Message ID | 200908280431.04960.max@laiers.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 08/28/2009 05:31 AM, Max Laier wrote: > Hello, > > it seems to me that the reclaim mechanism for shadow page table pages is sub- > optimal. The arch.active_mmu_pages list that is used for reclaiming does not > move up parent shadow page tables when a child is added so when we need a new > shadow page we zap the oldest - which can well be a directory level page > holding a just added table level page. > > Attached is a proof-of-concept diff and two plots before and after. The plots > show referenced guest pages over time. What do you mean by referenced guest pages? Total number of populated sptes? > As you can see there is less saw- > toothing in the after plot and also fewer changes overall (because we don't > zap mappings that are still in use as often). This is with a limit of 64 for > the shadow page table to increase the effect and vmx/ept. > > I realize that the list_move and parent walk are quite expensive and that > kvm_mmu_alloc_page is only half the story. It should really be done every > time a new guest page table is mapped - maybe via rmap_add. This would > obviously completely kill performance-wise, though. > > Another idea would be to improve the reclaim logic in a way that it prefers > "old" PT_PAGE_TABLE_LEVEL over directories. Though I'm not sure how to code > that up sensibly, either. > > As I said, this is proof-of-concept and RFC. So any comments welcome. For my > use case the proof-of-concept diff seems to do well enough, though. > Given that reclaim is fairly rare, we should try to move the cost there. So how about this: - add an 'accessed' flag to struct kvm_mmu_page - when reclaiming, try to evict pages that were not recently accessed (but don't overscan - if you scan many recently accessed pages, evict some of them anyway) - when scanning, update the accessed flag with the accessed bit of all parent_ptes - when dropping an spte, update the accessed flag of the kvm_mmu_page it points to - when reloading cr3, mark the page as accessed (since it has no parent_ptes) This should introduce some LRU-ness that depends not only on fault behaviour but also on long-term guest access behaviour (which is important for long-running processes and kernel pages).
On Monday 31 August 2009 11:55:24 Avi Kivity wrote: > On 08/28/2009 05:31 AM, Max Laier wrote: > > Hello, > > > > it seems to me that the reclaim mechanism for shadow page table pages is > > sub- optimal. The arch.active_mmu_pages list that is used for reclaiming > > does not move up parent shadow page tables when a child is added so when > > we need a new shadow page we zap the oldest - which can well be a > > directory level page holding a just added table level page. > > > > Attached is a proof-of-concept diff and two plots before and after. The > > plots show referenced guest pages over time. > > What do you mean by referenced guest pages? Total number of populated > sptes? Yes. > > As you can see there is less saw- > > toothing in the after plot and also fewer changes overall (because we > > don't zap mappings that are still in use as often). This is with a limit > > of 64 for the shadow page table to increase the effect and vmx/ept. > > > > I realize that the list_move and parent walk are quite expensive and that > > kvm_mmu_alloc_page is only half the story. It should really be done > > every time a new guest page table is mapped - maybe via rmap_add. This > > would obviously completely kill performance-wise, though. > > > > Another idea would be to improve the reclaim logic in a way that it > > prefers "old" PT_PAGE_TABLE_LEVEL over directories. Though I'm not sure > > how to code that up sensibly, either. > > > > As I said, this is proof-of-concept and RFC. So any comments welcome. > > For my use case the proof-of-concept diff seems to do well enough, > > though. > > Given that reclaim is fairly rare, we should try to move the cost > there. So how about this: > > - add an 'accessed' flag to struct kvm_mmu_page > - when reclaiming, try to evict pages that were not recently accessed > (but don't overscan - if you scan many recently accessed pages, evict > some of them anyway) - prefer page table level pages over directory level pages in the face of overscan. > - when scanning, update the accessed flag with the accessed bit of all > parent_ptes I might be misunderstanding, but I think it should be the other way 'round. i.e. a page is accessed if any of it's children have been accessed. > - when dropping an spte, update the accessed flag of the kvm_mmu_page it > points to > - when reloading cr3, mark the page as accessed (since it has no > parent_ptes) > > This should introduce some LRU-ness that depends not only on fault > behaviour but also on long-term guest access behaviour (which is > important for long-running processes and kernel pages). I'll try to come up with a patch for this, later tonight. Unless you already have something in the making. Thanks.
On 08/31/2009 03:09 PM, Max Laier wrote: > >>> As you can see there is less saw- >>> toothing in the after plot and also fewer changes overall (because we >>> don't zap mappings that are still in use as often). This is with a limit >>> of 64 for the shadow page table to increase the effect and vmx/ept. >>> >>> I realize that the list_move and parent walk are quite expensive and that >>> kvm_mmu_alloc_page is only half the story. It should really be done >>> every time a new guest page table is mapped - maybe via rmap_add. This >>> would obviously completely kill performance-wise, though. >>> >>> Another idea would be to improve the reclaim logic in a way that it >>> prefers "old" PT_PAGE_TABLE_LEVEL over directories. Though I'm not sure >>> how to code that up sensibly, either. >>> >>> As I said, this is proof-of-concept and RFC. So any comments welcome. >>> For my use case the proof-of-concept diff seems to do well enough, >>> though. >>> >> Given that reclaim is fairly rare, we should try to move the cost >> there. So how about this: >> >> - add an 'accessed' flag to struct kvm_mmu_page >> - when reclaiming, try to evict pages that were not recently accessed >> (but don't overscan - if you scan many recently accessed pages, evict >> some of them anyway) >> > - prefer page table level pages over directory level pages in the face of > overscan. > I'm hoping that overscan will only occur when we start to feel memory pressure, and that once we do a full scan we'll get accurate recency information. >> - when scanning, update the accessed flag with the accessed bit of all >> parent_ptes >> > I might be misunderstanding, but I think it should be the other way 'round. > i.e. a page is accessed if any of it's children have been accessed. > They're both true, but looking at the parents is much more efficient. Note we need to look at the accessed bit of the parent_ptes, not parent kvm_mmu_pages. >> - when dropping an spte, update the accessed flag of the kvm_mmu_page it >> points to >> - when reloading cr3, mark the page as accessed (since it has no >> parent_ptes) >> >> This should introduce some LRU-ness that depends not only on fault >> behaviour but also on long-term guest access behaviour (which is >> important for long-running processes and kernel pages). >> > I'll try to come up with a patch for this, later tonight. Unless you already > have something in the making. Thanks. > Please do, it's an area that need attention.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 95d5329..0a63570 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -190,6 +190,8 @@ struct kvm_unsync_walk { }; typedef int (*mmu_parent_walk_fn) (struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); +static void mmu_parent_walk(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, + mmu_parent_walk_fn fn); static struct kmem_cache *pte_chain_cache; static struct kmem_cache *rmap_desc_cache; @@ -900,6 +902,12 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn) return gfn & ((1 << KVM_MMU_HASH_SHIFT) - 1); } +static int move_up_walk_fn(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + list_move(&sp->link, &vcpu->kvm->arch.active_mmu_pages); + return 1; +} + static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, u64 *parent_pte) { @@ -918,6 +926,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, bitmap_zero(sp->slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS); sp->multimapped = 0; sp->parent_pte = parent_pte; +#if 1 + if (parent_pte) + mmu_parent_walk(vcpu, sp, move_up_walk_fn); +#endif --vcpu->kvm->arch.n_free_mmu_pages; return sp; }