Message ID | 1369252560-11611-4-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Xiao, On Thu, May 23, 2013 at 03:55:52AM +0800, Xiao Guangrong wrote: > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to > walk and zap all shadow pages one by one, also it need to zap all guest > page's rmap and all shadow page's parent spte list. Particularly, things > become worse if guest uses more memory or vcpus. It is not good for > scalability > > In this patch, we introduce a faster way to invalidate all shadow pages. > KVM maintains a global mmu invalid generation-number which is stored in > kvm->arch.mmu_valid_gen and every shadow page stores the current global > generation-number into sp->mmu_valid_gen when it is created > > When KVM need zap all shadow pages sptes, it just simply increase the > global generation-number then reload root shadow pages on all vcpus. > Vcpu will create a new shadow page table according to current kvm's > generation-number. It ensures the old pages are not used any more. > Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) > are zapped by using lock-break technique > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > --- > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++ > arch/x86/kvm/mmu.h | 1 + > 3 files changed, 87 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 3741c65..bff7d46 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -222,6 +222,7 @@ struct kvm_mmu_page { > int root_count; /* Currently serving as active root */ > unsigned int unsync_children; > unsigned long parent_ptes; /* Reverse mapping for parent_pte */ > + unsigned long mmu_valid_gen; > DECLARE_BITMAP(unsync_child_bitmap, 512); > > #ifdef CONFIG_X86_32 > @@ -529,6 +530,7 @@ struct kvm_arch { > unsigned int n_requested_mmu_pages; > unsigned int n_max_mmu_pages; > unsigned int indirect_shadow_pages; > + unsigned long mmu_valid_gen; > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; > /* > * Hash table of struct kvm_mmu_page. > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index f8ca2f3..f302540 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte) > __clear_sp_write_flooding_count(sp); > } > > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) > +{ > + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); > +} > + > static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, > gfn_t gfn, > gva_t gaddr, > @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, > > account_shadowed(vcpu->kvm, gfn); > } > + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; > init_shadow_page_table(sp); > trace_kvm_mmu_get_page(sp, true); > return sp; > @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, > ret = mmu_zap_unsync_children(kvm, sp, invalid_list); > kvm_mmu_page_unlink_children(kvm, sp); > kvm_mmu_unlink_parents(kvm, sp); > + > if (!sp->role.invalid && !sp->role.direct) > unaccount_shadowed(kvm, sp->gfn); > + > if (sp->unsync) > kvm_unlink_unsync_page(kvm, sp); > if (!sp->root_count) { > @@ -4195,6 +4203,82 @@ restart: > spin_unlock(&kvm->mmu_lock); > } > > +static void kvm_zap_obsolete_pages(struct kvm *kvm) > +{ > + struct kvm_mmu_page *sp, *node; > + LIST_HEAD(invalid_list); > + > +restart: > + list_for_each_entry_safe_reverse(sp, node, > + &kvm->arch.active_mmu_pages, link) { > + /* > + * No obsolete page exists before new created page since > + * active_mmu_pages is the FIFO list. > + */ > + if (!is_obsolete_sp(kvm, sp)) > + break; Can you add a comment to list_add(x, active_mmu_pages) callsites mentioning this case? Because it'll break silently if people do list_add_tail(). > + /* > + * Do not repeatedly zap a root page to avoid unnecessary > + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to > + * progress: > + * vcpu 0 vcpu 1 > + * call vcpu_enter_guest(): > + * 1): handle KVM_REQ_MMU_RELOAD > + * and require mmu-lock to > + * load mmu > + * repeat: > + * 1): zap root page and > + * send KVM_REQ_MMU_RELOAD > + * > + * 2): if (cond_resched_lock(mmu-lock)) > + * > + * 2): hold mmu-lock and load mmu > + * > + * 3): see KVM_REQ_MMU_RELOAD bit > + * on vcpu->requests is set > + * then return 1 to call > + * vcpu_enter_guest() again. > + * goto repeat; > + * > + * Since we are reversely walking the list and the invalid > + * list will be moved to the head, skip the invalid page > + * can help us to avoid the infinity list walking. > + */ > + if (sp->role.invalid) > + continue; But this allows completing (that is returning), with page that should be zapped still present (even though its invalid). Is another pass needed at the end to take care of the invalid pages? Which at that point must have their root_count decreased. > + > + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { > + kvm_mmu_commit_zap_page(kvm, &invalid_list); > + cond_resched_lock(&kvm->mmu_lock); > + goto restart; > + } > + > + if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list)) > + goto restart; > + } > + > + kvm_mmu_commit_zap_page(kvm, &invalid_list); > +} > + > +/* > + * Fast invalidate all shadow pages and use lock-break technique > + * to zap obsolete pages. > + * > + * It's required when memslot is being deleted or VM is being > + * destroyed, in these cases, we should ensure that KVM MMU does > + * not use any resource of the being-deleted slot or all slots > + * after calling the function. > + */ > +void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm) > +{ > + spin_lock(&kvm->mmu_lock); > + kvm->arch.mmu_valid_gen++; > + > + kvm_zap_obsolete_pages(kvm); > + spin_unlock(&kvm->mmu_lock); > +} > + Also this function should be serialized, that is, should not allow simultaneous kvm_mmu_invalidate_zap_all_pages. If thats so assert(mutex_is_locked(kvm->lock)) would help. Probably fine to have simultaneous users, but not necessary AFAICS. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 24, 2013 at 05:23:07PM -0300, Marcelo Tosatti wrote: > Hi Xiao, > > On Thu, May 23, 2013 at 03:55:52AM +0800, Xiao Guangrong wrote: > > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to > > walk and zap all shadow pages one by one, also it need to zap all guest > > page's rmap and all shadow page's parent spte list. Particularly, things > > become worse if guest uses more memory or vcpus. It is not good for > > scalability > > > > In this patch, we introduce a faster way to invalidate all shadow pages. > > KVM maintains a global mmu invalid generation-number which is stored in > > kvm->arch.mmu_valid_gen and every shadow page stores the current global > > generation-number into sp->mmu_valid_gen when it is created > > > > When KVM need zap all shadow pages sptes, it just simply increase the > > global generation-number then reload root shadow pages on all vcpus. > > Vcpu will create a new shadow page table according to current kvm's > > generation-number. It ensures the old pages are not used any more. > > Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) > > are zapped by using lock-break technique > > > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > > --- > > arch/x86/include/asm/kvm_host.h | 2 + > > arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++ > > arch/x86/kvm/mmu.h | 1 + > > 3 files changed, 87 insertions(+), 0 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index 3741c65..bff7d46 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -222,6 +222,7 @@ struct kvm_mmu_page { > > int root_count; /* Currently serving as active root */ > > unsigned int unsync_children; > > unsigned long parent_ptes; /* Reverse mapping for parent_pte */ > > + unsigned long mmu_valid_gen; > > DECLARE_BITMAP(unsync_child_bitmap, 512); > > > > #ifdef CONFIG_X86_32 > > @@ -529,6 +530,7 @@ struct kvm_arch { > > unsigned int n_requested_mmu_pages; > > unsigned int n_max_mmu_pages; > > unsigned int indirect_shadow_pages; > > + unsigned long mmu_valid_gen; > > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; > > /* > > * Hash table of struct kvm_mmu_page. > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > > index f8ca2f3..f302540 100644 > > --- a/arch/x86/kvm/mmu.c > > +++ b/arch/x86/kvm/mmu.c > > @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte) > > __clear_sp_write_flooding_count(sp); > > } > > > > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) > > +{ > > + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); > > +} > > + > > static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, > > gfn_t gfn, > > gva_t gaddr, > > @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, > > > > account_shadowed(vcpu->kvm, gfn); > > } > > + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; > > init_shadow_page_table(sp); > > trace_kvm_mmu_get_page(sp, true); > > return sp; > > @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, > > ret = mmu_zap_unsync_children(kvm, sp, invalid_list); > > kvm_mmu_page_unlink_children(kvm, sp); > > kvm_mmu_unlink_parents(kvm, sp); > > + > > if (!sp->role.invalid && !sp->role.direct) > > unaccount_shadowed(kvm, sp->gfn); > > + > > if (sp->unsync) > > kvm_unlink_unsync_page(kvm, sp); > > if (!sp->root_count) { > > @@ -4195,6 +4203,82 @@ restart: > > spin_unlock(&kvm->mmu_lock); > > } > > > > +static void kvm_zap_obsolete_pages(struct kvm *kvm) > > +{ > > + struct kvm_mmu_page *sp, *node; > > + LIST_HEAD(invalid_list); > > + > > +restart: > > + list_for_each_entry_safe_reverse(sp, node, > > + &kvm->arch.active_mmu_pages, link) { > > + /* > > + * No obsolete page exists before new created page since > > + * active_mmu_pages is the FIFO list. > > + */ > > + if (!is_obsolete_sp(kvm, sp)) > > + break; > > Can you add a comment to list_add(x, active_mmu_pages) callsites > mentioning this case? > > Because it'll break silently if people do list_add_tail(). > > > + /* > > + * Do not repeatedly zap a root page to avoid unnecessary > > + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to > > + * progress: > > + * vcpu 0 vcpu 1 > > + * call vcpu_enter_guest(): > > + * 1): handle KVM_REQ_MMU_RELOAD > > + * and require mmu-lock to > > + * load mmu > > + * repeat: > > + * 1): zap root page and > > + * send KVM_REQ_MMU_RELOAD > > + * > > + * 2): if (cond_resched_lock(mmu-lock)) > > + * > > + * 2): hold mmu-lock and load mmu > > + * > > + * 3): see KVM_REQ_MMU_RELOAD bit > > + * on vcpu->requests is set > > + * then return 1 to call > > + * vcpu_enter_guest() again. > > + * goto repeat; > > + * > > + * Since we are reversely walking the list and the invalid > > + * list will be moved to the head, skip the invalid page > > + * can help us to avoid the infinity list walking. > > + */ > > + if (sp->role.invalid) > > + continue; > > But this allows completing (that is returning), with page that should > be zapped still present (even though its invalid). > > Is another pass needed at the end to take care of the invalid pages? > Which at that point must have their root_count decreased. > It is not different from how it work now. Invalid page can still be not zapped after zap_all() completes. They are ignored by all relevant code paths. > > + > > + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { > > + kvm_mmu_commit_zap_page(kvm, &invalid_list); > > + cond_resched_lock(&kvm->mmu_lock); > > + goto restart; > > + } > > + > > + if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list)) > > + goto restart; > > + } > > + > > + kvm_mmu_commit_zap_page(kvm, &invalid_list); > > +} > > + > > +/* > > + * Fast invalidate all shadow pages and use lock-break technique > > + * to zap obsolete pages. > > + * > > + * It's required when memslot is being deleted or VM is being > > + * destroyed, in these cases, we should ensure that KVM MMU does > > + * not use any resource of the being-deleted slot or all slots > > + * after calling the function. > > + */ > > +void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm) > > +{ > > + spin_lock(&kvm->mmu_lock); > > + kvm->arch.mmu_valid_gen++; > > + > > + kvm_zap_obsolete_pages(kvm); > > + spin_unlock(&kvm->mmu_lock); > > +} > > + > > Also this function should be serialized, that is, should not allow > simultaneous kvm_mmu_invalidate_zap_all_pages. If thats so > assert(mutex_is_locked(kvm->lock)) would help. > > Probably fine to have simultaneous users, but not necessary AFAICS. I raced this point on previous patch submission. The function can be executed simultaneously only during vmexit when zap_all is called by mm notifiers. During regular work the function is only called by slot manipulation functions and those are serialized by slot lock. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 26, 2013 at 11:26:49AM +0300, Gleb Natapov wrote: > On Fri, May 24, 2013 at 05:23:07PM -0300, Marcelo Tosatti wrote: > > Hi Xiao, > > > > On Thu, May 23, 2013 at 03:55:52AM +0800, Xiao Guangrong wrote: > > > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to > > > walk and zap all shadow pages one by one, also it need to zap all guest > > > page's rmap and all shadow page's parent spte list. Particularly, things > > > become worse if guest uses more memory or vcpus. It is not good for > > > scalability > > > > > > In this patch, we introduce a faster way to invalidate all shadow pages. > > > KVM maintains a global mmu invalid generation-number which is stored in > > > kvm->arch.mmu_valid_gen and every shadow page stores the current global > > > generation-number into sp->mmu_valid_gen when it is created > > > > > > When KVM need zap all shadow pages sptes, it just simply increase the > > > global generation-number then reload root shadow pages on all vcpus. > > > Vcpu will create a new shadow page table according to current kvm's > > > generation-number. It ensures the old pages are not used any more. > > > Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) > > > are zapped by using lock-break technique > > > > > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > > > --- > > > arch/x86/include/asm/kvm_host.h | 2 + > > > arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++ > > > arch/x86/kvm/mmu.h | 1 + > > > 3 files changed, 87 insertions(+), 0 deletions(-) > > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > > index 3741c65..bff7d46 100644 > > > --- a/arch/x86/include/asm/kvm_host.h > > > +++ b/arch/x86/include/asm/kvm_host.h > > > @@ -222,6 +222,7 @@ struct kvm_mmu_page { > > > int root_count; /* Currently serving as active root */ > > > unsigned int unsync_children; > > > unsigned long parent_ptes; /* Reverse mapping for parent_pte */ > > > + unsigned long mmu_valid_gen; > > > DECLARE_BITMAP(unsync_child_bitmap, 512); > > > > > > #ifdef CONFIG_X86_32 > > > @@ -529,6 +530,7 @@ struct kvm_arch { > > > unsigned int n_requested_mmu_pages; > > > unsigned int n_max_mmu_pages; > > > unsigned int indirect_shadow_pages; > > > + unsigned long mmu_valid_gen; > > > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; > > > /* > > > * Hash table of struct kvm_mmu_page. > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > > > index f8ca2f3..f302540 100644 > > > --- a/arch/x86/kvm/mmu.c > > > +++ b/arch/x86/kvm/mmu.c > > > @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte) > > > __clear_sp_write_flooding_count(sp); > > > } > > > > > > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) > > > +{ > > > + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); > > > +} > > > + > > > static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, > > > gfn_t gfn, > > > gva_t gaddr, > > > @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, > > > > > > account_shadowed(vcpu->kvm, gfn); > > > } > > > + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; > > > init_shadow_page_table(sp); > > > trace_kvm_mmu_get_page(sp, true); > > > return sp; > > > @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, > > > ret = mmu_zap_unsync_children(kvm, sp, invalid_list); > > > kvm_mmu_page_unlink_children(kvm, sp); > > > kvm_mmu_unlink_parents(kvm, sp); > > > + > > > if (!sp->role.invalid && !sp->role.direct) > > > unaccount_shadowed(kvm, sp->gfn); > > > + > > > if (sp->unsync) > > > kvm_unlink_unsync_page(kvm, sp); > > > if (!sp->root_count) { > > > @@ -4195,6 +4203,82 @@ restart: > > > spin_unlock(&kvm->mmu_lock); > > > } > > > > > > +static void kvm_zap_obsolete_pages(struct kvm *kvm) > > > +{ > > > + struct kvm_mmu_page *sp, *node; > > > + LIST_HEAD(invalid_list); > > > + > > > +restart: > > > + list_for_each_entry_safe_reverse(sp, node, > > > + &kvm->arch.active_mmu_pages, link) { > > > + /* > > > + * No obsolete page exists before new created page since > > > + * active_mmu_pages is the FIFO list. > > > + */ > > > + if (!is_obsolete_sp(kvm, sp)) > > > + break; > > > > Can you add a comment to list_add(x, active_mmu_pages) callsites > > mentioning this case? > > > > Because it'll break silently if people do list_add_tail(). > > > > > + /* > > > + * Do not repeatedly zap a root page to avoid unnecessary > > > + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to > > > + * progress: > > > + * vcpu 0 vcpu 1 > > > + * call vcpu_enter_guest(): > > > + * 1): handle KVM_REQ_MMU_RELOAD > > > + * and require mmu-lock to > > > + * load mmu > > > + * repeat: > > > + * 1): zap root page and > > > + * send KVM_REQ_MMU_RELOAD > > > + * > > > + * 2): if (cond_resched_lock(mmu-lock)) > > > + * > > > + * 2): hold mmu-lock and load mmu > > > + * > > > + * 3): see KVM_REQ_MMU_RELOAD bit > > > + * on vcpu->requests is set > > > + * then return 1 to call > > > + * vcpu_enter_guest() again. > > > + * goto repeat; > > > + * > > > + * Since we are reversely walking the list and the invalid > > > + * list will be moved to the head, skip the invalid page > > > + * can help us to avoid the infinity list walking. > > > + */ > > > + if (sp->role.invalid) > > > + continue; > > > > But this allows completing (that is returning), with page that should > > be zapped still present (even though its invalid). > > > > Is another pass needed at the end to take care of the invalid pages? > > Which at that point must have their root_count decreased. > > > It is not different from how it work now. Invalid page can still be not > zapped after zap_all() completes. They are ignored by all relevant code > paths. It is different. kvm_mmu_zap_all() returns with 0 pages at mmu_active_pages today. It is probably worthwhile to maintain behaviour (instead of every other code path does the right thing). > > Also this function should be serialized, that is, should not allow > > simultaneous kvm_mmu_invalidate_zap_all_pages. If thats so > > assert(mutex_is_locked(kvm->lock)) would help. > > > > Probably fine to have simultaneous users, but not necessary AFAICS. > I raced this point on previous patch submission. The function can be > executed simultaneously only during vmexit when zap_all is called by mm > notifiers. During regular work the function is only called by slot > manipulation functions and those are serialized by slot lock. How about fix hypercall vs kvm_set_memory? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Marcelo, On 05/25/2013 04:23 AM, Marcelo Tosatti wrote: >> +static void kvm_zap_obsolete_pages(struct kvm *kvm) >> +{ >> + struct kvm_mmu_page *sp, *node; >> + LIST_HEAD(invalid_list); >> + >> +restart: >> + list_for_each_entry_safe_reverse(sp, node, >> + &kvm->arch.active_mmu_pages, link) { >> + /* >> + * No obsolete page exists before new created page since >> + * active_mmu_pages is the FIFO list. >> + */ >> + if (!is_obsolete_sp(kvm, sp)) >> + break; > > Can you add a comment to list_add(x, active_mmu_pages) callsites > mentioning this case? > > Because it'll break silently if people do list_add_tail(). Sure, I will do it in the next version. And i totally agree with Gleb's points that reply your other questions in this patch. Thank you all! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/27/2013 04:37 AM, Marcelo Tosatti wrote: > On Sun, May 26, 2013 at 11:26:49AM +0300, Gleb Natapov wrote: >> On Fri, May 24, 2013 at 05:23:07PM -0300, Marcelo Tosatti wrote: >>> Hi Xiao, >>> >>> On Thu, May 23, 2013 at 03:55:52AM +0800, Xiao Guangrong wrote: >>>> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to >>>> walk and zap all shadow pages one by one, also it need to zap all guest >>>> page's rmap and all shadow page's parent spte list. Particularly, things >>>> become worse if guest uses more memory or vcpus. It is not good for >>>> scalability >>>> >>>> In this patch, we introduce a faster way to invalidate all shadow pages. >>>> KVM maintains a global mmu invalid generation-number which is stored in >>>> kvm->arch.mmu_valid_gen and every shadow page stores the current global >>>> generation-number into sp->mmu_valid_gen when it is created >>>> >>>> When KVM need zap all shadow pages sptes, it just simply increase the >>>> global generation-number then reload root shadow pages on all vcpus. >>>> Vcpu will create a new shadow page table according to current kvm's >>>> generation-number. It ensures the old pages are not used any more. >>>> Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) >>>> are zapped by using lock-break technique >>>> >>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> >>>> --- >>>> arch/x86/include/asm/kvm_host.h | 2 + >>>> arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++ >>>> arch/x86/kvm/mmu.h | 1 + >>>> 3 files changed, 87 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >>>> index 3741c65..bff7d46 100644 >>>> --- a/arch/x86/include/asm/kvm_host.h >>>> +++ b/arch/x86/include/asm/kvm_host.h >>>> @@ -222,6 +222,7 @@ struct kvm_mmu_page { >>>> int root_count; /* Currently serving as active root */ >>>> unsigned int unsync_children; >>>> unsigned long parent_ptes; /* Reverse mapping for parent_pte */ >>>> + unsigned long mmu_valid_gen; >>>> DECLARE_BITMAP(unsync_child_bitmap, 512); >>>> >>>> #ifdef CONFIG_X86_32 >>>> @@ -529,6 +530,7 @@ struct kvm_arch { >>>> unsigned int n_requested_mmu_pages; >>>> unsigned int n_max_mmu_pages; >>>> unsigned int indirect_shadow_pages; >>>> + unsigned long mmu_valid_gen; >>>> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; >>>> /* >>>> * Hash table of struct kvm_mmu_page. >>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >>>> index f8ca2f3..f302540 100644 >>>> --- a/arch/x86/kvm/mmu.c >>>> +++ b/arch/x86/kvm/mmu.c >>>> @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte) >>>> __clear_sp_write_flooding_count(sp); >>>> } >>>> >>>> +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) >>>> +{ >>>> + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); >>>> +} >>>> + >>>> static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, >>>> gfn_t gfn, >>>> gva_t gaddr, >>>> @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, >>>> >>>> account_shadowed(vcpu->kvm, gfn); >>>> } >>>> + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; >>>> init_shadow_page_table(sp); >>>> trace_kvm_mmu_get_page(sp, true); >>>> return sp; >>>> @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, >>>> ret = mmu_zap_unsync_children(kvm, sp, invalid_list); >>>> kvm_mmu_page_unlink_children(kvm, sp); >>>> kvm_mmu_unlink_parents(kvm, sp); >>>> + >>>> if (!sp->role.invalid && !sp->role.direct) >>>> unaccount_shadowed(kvm, sp->gfn); >>>> + >>>> if (sp->unsync) >>>> kvm_unlink_unsync_page(kvm, sp); >>>> if (!sp->root_count) { >>>> @@ -4195,6 +4203,82 @@ restart: >>>> spin_unlock(&kvm->mmu_lock); >>>> } >>>> >>>> +static void kvm_zap_obsolete_pages(struct kvm *kvm) >>>> +{ >>>> + struct kvm_mmu_page *sp, *node; >>>> + LIST_HEAD(invalid_list); >>>> + >>>> +restart: >>>> + list_for_each_entry_safe_reverse(sp, node, >>>> + &kvm->arch.active_mmu_pages, link) { >>>> + /* >>>> + * No obsolete page exists before new created page since >>>> + * active_mmu_pages is the FIFO list. >>>> + */ >>>> + if (!is_obsolete_sp(kvm, sp)) >>>> + break; >>> >>> Can you add a comment to list_add(x, active_mmu_pages) callsites >>> mentioning this case? >>> >>> Because it'll break silently if people do list_add_tail(). >>> >>>> + /* >>>> + * Do not repeatedly zap a root page to avoid unnecessary >>>> + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to >>>> + * progress: >>>> + * vcpu 0 vcpu 1 >>>> + * call vcpu_enter_guest(): >>>> + * 1): handle KVM_REQ_MMU_RELOAD >>>> + * and require mmu-lock to >>>> + * load mmu >>>> + * repeat: >>>> + * 1): zap root page and >>>> + * send KVM_REQ_MMU_RELOAD >>>> + * >>>> + * 2): if (cond_resched_lock(mmu-lock)) >>>> + * >>>> + * 2): hold mmu-lock and load mmu >>>> + * >>>> + * 3): see KVM_REQ_MMU_RELOAD bit >>>> + * on vcpu->requests is set >>>> + * then return 1 to call >>>> + * vcpu_enter_guest() again. >>>> + * goto repeat; >>>> + * >>>> + * Since we are reversely walking the list and the invalid >>>> + * list will be moved to the head, skip the invalid page >>>> + * can help us to avoid the infinity list walking. >>>> + */ >>>> + if (sp->role.invalid) >>>> + continue; >>> >>> But this allows completing (that is returning), with page that should >>> be zapped still present (even though its invalid). >>> >>> Is another pass needed at the end to take care of the invalid pages? >>> Which at that point must have their root_count decreased. >>> >> It is not different from how it work now. Invalid page can still be not >> zapped after zap_all() completes. They are ignored by all relevant code >> paths. > > It is different. kvm_mmu_zap_all() returns with 0 pages at > mmu_active_pages today. Do not think so. kvm_mmu_zap_all() also keeps role.root && role.invalid pages on mmu_active_pages. role.root && role.invalid pages should be zapped in vcpu context. > > It is probably worthwhile to maintain behaviour (instead of every other > code path does the right thing). > >>> Also this function should be serialized, that is, should not allow >>> simultaneous kvm_mmu_invalidate_zap_all_pages. If thats so >>> assert(mutex_is_locked(kvm->lock)) would help. >>> >>> Probably fine to have simultaneous users, but not necessary AFAICS. >> I raced this point on previous patch submission. The function can be >> executed simultaneously only during vmexit when zap_all is called by mm >> notifiers. During regular work the function is only called by slot >> manipulation functions and those are serialized by slot lock. > > How about fix hypercall vs kvm_set_memory? In this version, we have dropped call kvm_mmu_zap_all in hypercall since it is useless. (pacthing hypercall is broken for a long time: https://lkml.org/lkml/2013/5/22/388) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3741c65..bff7d46 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -222,6 +222,7 @@ struct kvm_mmu_page { int root_count; /* Currently serving as active root */ unsigned int unsync_children; unsigned long parent_ptes; /* Reverse mapping for parent_pte */ + unsigned long mmu_valid_gen; DECLARE_BITMAP(unsync_child_bitmap, 512); #ifdef CONFIG_X86_32 @@ -529,6 +530,7 @@ struct kvm_arch { unsigned int n_requested_mmu_pages; unsigned int n_max_mmu_pages; unsigned int indirect_shadow_pages; + unsigned long mmu_valid_gen; struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; /* * Hash table of struct kvm_mmu_page. diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f8ca2f3..f302540 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sp); } +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); +} + static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gaddr, @@ -1900,6 +1905,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, account_shadowed(vcpu->kvm, gfn); } + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen; init_shadow_page_table(sp); trace_kvm_mmu_get_page(sp, true); return sp; @@ -2070,8 +2076,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, ret = mmu_zap_unsync_children(kvm, sp, invalid_list); kvm_mmu_page_unlink_children(kvm, sp); kvm_mmu_unlink_parents(kvm, sp); + if (!sp->role.invalid && !sp->role.direct) unaccount_shadowed(kvm, sp->gfn); + if (sp->unsync) kvm_unlink_unsync_page(kvm, sp); if (!sp->root_count) { @@ -4195,6 +4203,82 @@ restart: spin_unlock(&kvm->mmu_lock); } +static void kvm_zap_obsolete_pages(struct kvm *kvm) +{ + struct kvm_mmu_page *sp, *node; + LIST_HEAD(invalid_list); + +restart: + list_for_each_entry_safe_reverse(sp, node, + &kvm->arch.active_mmu_pages, link) { + /* + * No obsolete page exists before new created page since + * active_mmu_pages is the FIFO list. + */ + if (!is_obsolete_sp(kvm, sp)) + break; + + /* + * Do not repeatedly zap a root page to avoid unnecessary + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to + * progress: + * vcpu 0 vcpu 1 + * call vcpu_enter_guest(): + * 1): handle KVM_REQ_MMU_RELOAD + * and require mmu-lock to + * load mmu + * repeat: + * 1): zap root page and + * send KVM_REQ_MMU_RELOAD + * + * 2): if (cond_resched_lock(mmu-lock)) + * + * 2): hold mmu-lock and load mmu + * + * 3): see KVM_REQ_MMU_RELOAD bit + * on vcpu->requests is set + * then return 1 to call + * vcpu_enter_guest() again. + * goto repeat; + * + * Since we are reversely walking the list and the invalid + * list will be moved to the head, skip the invalid page + * can help us to avoid the infinity list walking. + */ + if (sp->role.invalid) + continue; + + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { + kvm_mmu_commit_zap_page(kvm, &invalid_list); + cond_resched_lock(&kvm->mmu_lock); + goto restart; + } + + if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list)) + goto restart; + } + + kvm_mmu_commit_zap_page(kvm, &invalid_list); +} + +/* + * Fast invalidate all shadow pages and use lock-break technique + * to zap obsolete pages. + * + * It's required when memslot is being deleted or VM is being + * destroyed, in these cases, we should ensure that KVM MMU does + * not use any resource of the being-deleted slot or all slots + * after calling the function. + */ +void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm) +{ + spin_lock(&kvm->mmu_lock); + kvm->arch.mmu_valid_gen++; + + kvm_zap_obsolete_pages(kvm); + spin_unlock(&kvm->mmu_lock); +} + void kvm_mmu_zap_mmio_sptes(struct kvm *kvm) { struct kvm_mmu_page *sp, *node; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 2adcbc2..922bfae 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -97,4 +97,5 @@ static inline bool permission_fault(struct kvm_mmu *mmu, unsigned pte_access, return (mmu->permissions[pfec >> 1] >> pte_access) & 1; } +void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm); #endif
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to walk and zap all shadow pages one by one, also it need to zap all guest page's rmap and all shadow page's parent spte list. Particularly, things become worse if guest uses more memory or vcpus. It is not good for scalability In this patch, we introduce a faster way to invalidate all shadow pages. KVM maintains a global mmu invalid generation-number which is stored in kvm->arch.mmu_valid_gen and every shadow page stores the current global generation-number into sp->mmu_valid_gen when it is created When KVM need zap all shadow pages sptes, it just simply increase the global generation-number then reload root shadow pages on all vcpus. Vcpu will create a new shadow page table according to current kvm's generation-number. It ensures the old pages are not used any more. Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen) are zapped by using lock-break technique Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> --- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/mmu.c | 84 +++++++++++++++++++++++++++++++++++++++ arch/x86/kvm/mmu.h | 1 + 3 files changed, 87 insertions(+), 0 deletions(-)