Message ID | 1375189330-24066-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Il 30/07/2013 15:02, Xiao Guangrong ha scritto: > Make sure we can see the writable spte before the dirt bitmap is visible > > We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based > on the dirty bitmap, we should ensure the writable spte can be found in rmap > before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and > failed to write-protect the page > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > --- > arch/x86/kvm/mmu.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 35d4b50..0fe56ad 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, > } > } > > - if (pte_access & ACC_WRITE_MASK) > - mark_page_dirty(vcpu->kvm, gfn); > - > set_pte: > if (mmu_spte_update(sptep, spte)) > kvm_flush_remote_tlbs(vcpu->kvm); > + > + if (pte_access & ACC_WRITE_MASK) > + mark_page_dirty(vcpu->kvm, gfn); > done: > return ret; > } > What about this comment above: /* * Optimization: for pte sync, if spte was writable the hash * lookup is unnecessary (and expensive). Write protection * is responsibility of mmu_get_page / kvm_sync_page. * Same reasoning can be applied to dirty page accounting. */ if (!can_unsync && is_writable_pte(*sptep)) goto set_pte; if (mmu_need_write_protect(vcpu, gfn, can_unsync)) { ? Should it be changed to if (!can_unsync && is_writable_pte(*sptep)) pte_access &= ~ACC_WRITE_MASK; /* do not mark dirty */ else if (mmu_need_write_protect(vcpu, gfn, can_unsync)) { ? Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/30/2013 09:26 PM, Paolo Bonzini wrote: > Il 30/07/2013 15:02, Xiao Guangrong ha scritto: >> Make sure we can see the writable spte before the dirt bitmap is visible >> >> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based >> on the dirty bitmap, we should ensure the writable spte can be found in rmap >> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and >> failed to write-protect the page >> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> >> --- >> arch/x86/kvm/mmu.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >> index 35d4b50..0fe56ad 100644 >> --- a/arch/x86/kvm/mmu.c >> +++ b/arch/x86/kvm/mmu.c >> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, >> } >> } >> >> - if (pte_access & ACC_WRITE_MASK) >> - mark_page_dirty(vcpu->kvm, gfn); >> - >> set_pte: >> if (mmu_spte_update(sptep, spte)) >> kvm_flush_remote_tlbs(vcpu->kvm); >> + >> + if (pte_access & ACC_WRITE_MASK) >> + mark_page_dirty(vcpu->kvm, gfn); >> done: >> return ret; >> } >> > > What about this comment above: > > /* > * Optimization: for pte sync, if spte was writable the hash > * lookup is unnecessary (and expensive). Write protection > * is responsibility of mmu_get_page / kvm_sync_page. This comments mean no sync shadow page created if the the spte is still writable because add a sync page need to writable all spte point to this page. So we can keep the spte as writable. I think it is better to checking SPTE_MMU_WRITEABLE bit instead of PT_WRITABLE_MASK since the latter bit can be cleared by dirty log and it can be a separate patch i think. > * Same reasoning can be applied to dirty page accounting. This comment means if the spte is writable the corresponding bit on dirty bitmap should have been set. Thanks to your reminder, i think this comment should be dropped, now we need to mark_page_dirty() whenever the spte update to writable. Otherwise this will happen: VCPU 0 VCPU 1 Clear dirty bit on the bitmap Read the spte, it is writable write the spte update the spte, keep it as writable and do not call mark_page_dirty(). Flush tlb Then vcpu 1 can continue to write the page but fail to set the bit on the bitmap. > */ > if (!can_unsync && is_writable_pte(*sptep)) > goto set_pte; > > if (mmu_need_write_protect(vcpu, gfn, can_unsync)) { > > > ? > > Should it be changed to > > if (!can_unsync && is_writable_pte(*sptep)) > pte_access &= ~ACC_WRITE_MASK; /* do not mark dirty */ Yes, this can avoid the issue above. But there is only a small window between sync the spte and locklessly write-protect the spte (since the sptep is already writable), i think we'd better keep the spte writable to speed up the normal case. :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote: > Make sure we can see the writable spte before the dirt bitmap is visible > > We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based > on the dirty bitmap, we should ensure the writable spte can be found in rmap > before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and > failed to write-protect the page > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > --- > arch/x86/kvm/mmu.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) Can you explain why this is safe, with regard to the rule at edde99ce05290e50 ? "The rule is that all pages are either dirty in the current bitmap, or write-protected, which is violated here." Overall, please document what is the required order of operations for both set_spte and get_dirty_log and why this order is safe (right on top of the code). > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 35d4b50..0fe56ad 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, > } > } > > - if (pte_access & ACC_WRITE_MASK) > - mark_page_dirty(vcpu->kvm, gfn); > - > set_pte: > if (mmu_spte_update(sptep, spte)) > kvm_flush_remote_tlbs(vcpu->kvm); Here, there is a modified guest page without dirty log bit set (think another vcpu writing to the page via this spte). > + > + if (pte_access & ACC_WRITE_MASK) > + mark_page_dirty(vcpu->kvm, gfn); > done: > return ret; > } > -- > 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/07/2013 09:48 AM, Marcelo Tosatti wrote: > On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote: >> Make sure we can see the writable spte before the dirt bitmap is visible >> >> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based >> on the dirty bitmap, we should ensure the writable spte can be found in rmap >> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and >> failed to write-protect the page >> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> >> --- >> arch/x86/kvm/mmu.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) > > Can you explain why this is safe, with regard to the rule > at edde99ce05290e50 ? BTW, this log fixed this case: VCPU 0 KVM migration control write-protects all pages #Pf happen then the page become writable, set dirty bit on the bitmap swap the bitmap, current bitmap is empty write the page (no dirty log) stop the guest and push the remaining dirty pages Stopped See current bitmap is empty that means no page is dirty. > > "The rule is that all pages are either dirty in the current bitmap, > or write-protected, which is violated here." Actually, this rule is not complete true, there's the 3th case: the window between write guest page and set dirty bitmap is valid. In that window, page is write-free and not dirty logged. This case is based on the fact that at the final step of live migration, kvm should stop the guest and push the remaining dirty pages to the destination. They're some examples in the current code: example 1, in fast_pf_fix_direct_spte(): if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) /* The window in here... */ mark_page_dirty(vcpu->kvm, gfn); example 2, in kvm_write_guest_page(): r = __copy_to_user((void __user *)addr + offset, data, len); if (r) return -EFAULT; /* * The window is here, the page is dirty but not logged in * The bitmap. */ mark_page_dirty(kvm, gfn); return 0; > > Overall, please document what is the required order of operations for > both set_spte and get_dirty_log and why this order is safe (right on top > of the code). Okay. The order we required here is, we should 1) set spte to writable __before__ set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty bitmap. The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above. The point 1) and 2) can ensure we can find the spte on rmap and see the spte is writable when we do lockless write-protection, otherwise these cases will happen VCPU 0 kvm ioctl doing get-dirty-pages mark_page_dirty(gfn) which set the gfn on the dirty maps mask = xchg(dirty_bitmap, 0) walk all gfns which set on "mask" and locklessly write-protect the gfn, then walk rmap, see no spte on that rmap add the spte into rmap !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap. Or: VCPU 0 kvm ioctl doing get-dirty-pages mark_page_dirty(gfn) which set the gfn on the dirty maps add spte into rmap mask = xchg(dirty_bitmap, 0) walk all gfns which set on "mask" and locklessly write-protect the gfn, then walk rmap, see spte is on the ramp but it is readonly or nonpresent. Mark spte writable !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap. Hopefully, i have clarified it, if you have any questions, please let me know. > >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >> index 35d4b50..0fe56ad 100644 >> --- a/arch/x86/kvm/mmu.c >> +++ b/arch/x86/kvm/mmu.c >> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, >> } >> } >> >> - if (pte_access & ACC_WRITE_MASK) >> - mark_page_dirty(vcpu->kvm, gfn); >> - >> set_pte: >> if (mmu_spte_update(sptep, spte)) >> kvm_flush_remote_tlbs(vcpu->kvm); > > Here, there is a modified guest page without dirty log bit set (think > another vcpu writing to the page via this spte). This is okay since we call mark_page_dirty(vcpu->kvm, gfn) after that, this is the same case as fast_pf_fix_direct_spte() and i have explained why it is safe above. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote: > On 08/07/2013 09:48 AM, Marcelo Tosatti wrote: > > On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote: > >> Make sure we can see the writable spte before the dirt bitmap is visible > >> > >> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based > >> on the dirty bitmap, we should ensure the writable spte can be found in rmap > >> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and > >> failed to write-protect the page > >> > >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > >> --- > >> arch/x86/kvm/mmu.c | 6 +++--- > >> 1 file changed, 3 insertions(+), 3 deletions(-) > > > > Can you explain why this is safe, with regard to the rule > > at edde99ce05290e50 ? > > BTW, this log fixed this case: > > VCPU 0 KVM migration control > > write-protects all pages > #Pf happen then the page > become writable, set dirty > bit on the bitmap > > swap the bitmap, current bitmap is empty > > write the page (no dirty log) > > stop the guest and push > the remaining dirty pages > Stopped > See current bitmap is empty that means > no page is dirty. > > > > "The rule is that all pages are either dirty in the current bitmap, > > or write-protected, which is violated here." > > Actually, this rule is not complete true, there's the 3th case: > the window between write guest page and set dirty bitmap is valid. > In that window, page is write-free and not dirty logged. > > This case is based on the fact that at the final step of live migration, > kvm should stop the guest and push the remaining dirty pages to the > destination. > > They're some examples in the current code: > example 1, in fast_pf_fix_direct_spte(): > if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) > /* The window in here... */ > mark_page_dirty(vcpu->kvm, gfn); > > example 2, in kvm_write_guest_page(): > r = __copy_to_user((void __user *)addr + offset, data, len); > if (r) > return -EFAULT; > /* > * The window is here, the page is dirty but not logged in > * The bitmap. > */ > mark_page_dirty(kvm, gfn); > return 0; > > > > > Overall, please document what is the required order of operations for > > both set_spte and get_dirty_log and why this order is safe (right on top > > of the code). > > Okay. > > The order we required here is, we should 1) set spte to writable __before__ > set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty > bitmap. > > The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above. > The point 1) and 2) can ensure we can find the spte on rmap and see the spte is > writable when we do lockless write-protection, otherwise these cases will happen > > VCPU 0 kvm ioctl doing get-dirty-pages > > mark_page_dirty(gfn) which > set the gfn on the dirty maps > mask = xchg(dirty_bitmap, 0) > > walk all gfns which set on "mask" and > locklessly write-protect the gfn, > then walk rmap, see no spte on that rmap > > > add the spte into rmap > > !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap. > > Or: > > VCPU 0 kvm ioctl doing get-dirty-pages > > mark_page_dirty(gfn) which > set the gfn on the dirty maps > > add spte into rmap > mask = xchg(dirty_bitmap, 0) > > walk all gfns which set on "mask" and > locklessly write-protect the gfn, > then walk rmap, see spte is on the ramp > but it is readonly or nonpresent. > > Mark spte writable > > !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap. > > Hopefully, i have clarified it, if you have any questions, please let me know. Yes, partially. Please on top of the explanation above, have something along the lines of "The flush IPI assumes that a thread switch happens in this order" comment at arch/x86/mm/tlb.c "With get_user_pages_fast, we walk down the pagetables without taking" comment at arch/x86/mm/gup.c What about the relation between read-only spte and respective TLB flush? That is: vcpu0 vcpu1 lockless write protect mark spte read-only either some mmu-lockless path or not write protect page: find read-only page assumes tlb is flushed kvm_flush_remote_tlbs In general, i think write protection is a good candidate for lockless operation. However, i would also be concerned about the larger problem of mmu-lock contention and how to solve it. Did you ever consider splitting it? > >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > >> index 35d4b50..0fe56ad 100644 > >> --- a/arch/x86/kvm/mmu.c > >> +++ b/arch/x86/kvm/mmu.c > >> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, > >> } > >> } > >> > >> - if (pte_access & ACC_WRITE_MASK) > >> - mark_page_dirty(vcpu->kvm, gfn); > >> - > >> set_pte: > >> if (mmu_spte_update(sptep, spte)) > >> kvm_flush_remote_tlbs(vcpu->kvm); > > > > Here, there is a modified guest page without dirty log bit set (think > > another vcpu writing to the page via this spte). > > This is okay since we call mark_page_dirty(vcpu->kvm, gfn) after that, this > is the same case as fast_pf_fix_direct_spte() and i have explained why it is > safe above. Right. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ Post again after adjusting the format since the mail list rejected to deliver my previous one. ] On Aug 8, 2013, at 11:06 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote: >> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote: >>> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote: >>>> Make sure we can see the writable spte before the dirt bitmap is visible >>>> >>>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based >>>> on the dirty bitmap, we should ensure the writable spte can be found in rmap >>>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and >>>> failed to write-protect the page >>>> >>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> >>>> --- >>>> arch/x86/kvm/mmu.c | 6 +++--- >>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>> >>> Can you explain why this is safe, with regard to the rule >>> at edde99ce05290e50 ? >> >> BTW, this log fixed this case: >> >> VCPU 0 KVM migration control >> >> write-protects all pages >> #Pf happen then the page >> become writable, set dirty >> bit on the bitmap >> >> swap the bitmap, current bitmap is empty >> >> write the page (no dirty log) >> >> stop the guest and push >> the remaining dirty pages >> Stopped >> See current bitmap is empty that means >> no page is dirty. >>> >>> "The rule is that all pages are either dirty in the current bitmap, >>> or write-protected, which is violated here." >> >> Actually, this rule is not complete true, there's the 3th case: >> the window between write guest page and set dirty bitmap is valid. >> In that window, page is write-free and not dirty logged. >> >> This case is based on the fact that at the final step of live migration, >> kvm should stop the guest and push the remaining dirty pages to the >> destination. >> >> They're some examples in the current code: >> example 1, in fast_pf_fix_direct_spte(): >> if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) >> /* The window in here... */ >> mark_page_dirty(vcpu->kvm, gfn); >> >> example 2, in kvm_write_guest_page(): >> r = __copy_to_user((void __user *)addr + offset, data, len); >> if (r) >> return -EFAULT; >> /* >> * The window is here, the page is dirty but not logged in >> * The bitmap. >> */ >> mark_page_dirty(kvm, gfn); >> return 0; >> >>> >>> Overall, please document what is the required order of operations for >>> both set_spte and get_dirty_log and why this order is safe (right on top >>> of the code). >> >> Okay. >> >> The order we required here is, we should 1) set spte to writable __before__ >> set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty >> bitmap. >> >> The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above. >> The point 1) and 2) can ensure we can find the spte on rmap and see the spte is >> writable when we do lockless write-protection, otherwise these cases will happen >> >> VCPU 0 kvm ioctl doing get-dirty-pages >> >> mark_page_dirty(gfn) which >> set the gfn on the dirty maps >> mask = xchg(dirty_bitmap, 0) >> >> walk all gfns which set on "mask" and >> locklessly write-protect the gfn, >> then walk rmap, see no spte on that rmap >> >> >> add the spte into rmap >> >> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap. >> >> Or: >> >> VCPU 0 kvm ioctl doing get-dirty-pages >> >> mark_page_dirty(gfn) which >> set the gfn on the dirty maps >> >> add spte into rmap >> mask = xchg(dirty_bitmap, 0) >> >> walk all gfns which set on "mask" and >> locklessly write-protect the gfn, >> then walk rmap, see spte is on the ramp >> but it is readonly or nonpresent. >> >> Mark spte writable >> >> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap. >> >> Hopefully, i have clarified it, if you have any questions, please let me know. > > Yes, partially. Please on top of the explanation above, have something along > the lines of > > "The flush IPI assumes that a thread switch happens in this order" > comment at arch/x86/mm/tlb.c > > "With get_user_pages_fast, we walk down the pagetables without taking" > comment at arch/x86/mm/gup.c Marcelo, thanks for your suggestion, i will improve both the changelog and the comments in the next version. > > > What about the relation between read-only spte and respective TLB flush? > That is: > > vcpu0 vcpu1 > > lockless write protect > mark spte read-only > either some mmu-lockless path or not > write protect page: > find read-only page > assumes tlb is flushed The tlb flush on mmu-lockless paths do not depends on the spte, there are two lockless paths: one is kvm_mmu_slot_remove_write_access() which unconditionally flushes tlb, another one is kvm_vm_ioctl_get_dirty_log() which flushes tlb based on dirty bitmap (flush tlb if have dirty page). Under the protection of mmu-lock, since we flush tlb whenever spte is zapped, we only need to care the case of spte updating which is fixed in "[PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified", in that patch, it changes the flush-condition to - if (is_writable_pte(old_spte) && !is_writable_pte(new_spte)) + if (spte_is_locklessly_modifiable(old_spte) && + !is_writable_pte(new_spte)) That means whenever a spte which can be potentially locklessly-modified becomes readonly, do the tlb flush. > > kvm_flush_remote_tlbs > > > In general, i think write protection is a good candidate for lockless operation. > However, i would also be concerned about the larger problem of mmu-lock > contention and how to solve it. Did you ever consider splitting it? Yes, i did and actaully am still working on that. :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote: > On 08/07/2013 09:48 AM, Marcelo Tosatti wrote: > > On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote: > >> Make sure we can see the writable spte before the dirt bitmap is visible > >> > >> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based > >> on the dirty bitmap, we should ensure the writable spte can be found in rmap > >> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and > >> failed to write-protect the page > >> > >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > >> --- > >> arch/x86/kvm/mmu.c | 6 +++--- > >> 1 file changed, 3 insertions(+), 3 deletions(-) > > > > Can you explain why this is safe, with regard to the rule > > at edde99ce05290e50 ? > > BTW, this log fixed this case: > > VCPU 0 KVM migration control > > write-protects all pages > #Pf happen then the page > become writable, set dirty > bit on the bitmap > > swap the bitmap, current bitmap is empty > > write the page (no dirty log) > > stop the guest and push > the remaining dirty pages > Stopped > See current bitmap is empty that means > no page is dirty. > > > > "The rule is that all pages are either dirty in the current bitmap, > > or write-protected, which is violated here." > > Actually, this rule is not complete true, there's the 3th case: > the window between write guest page and set dirty bitmap is valid. > In that window, page is write-free and not dirty logged. > > This case is based on the fact that at the final step of live migration, > kvm should stop the guest and push the remaining dirty pages to the > destination. > > They're some examples in the current code: > example 1, in fast_pf_fix_direct_spte(): > if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) > /* The window in here... */ > mark_page_dirty(vcpu->kvm, gfn); > > example 2, in kvm_write_guest_page(): > r = __copy_to_user((void __user *)addr + offset, data, len); > if (r) > return -EFAULT; > /* > * The window is here, the page is dirty but not logged in > * The bitmap. > */ > mark_page_dirty(kvm, gfn); > return 0; Why is this valid ? That is, the obviously correct rule is "that all pages are either dirty in the current bitmap, or write-protected, which is violated here." With the window above, GET_DIRTY_LOG can be called 100 times while the page is dirty, but the corresponding bit not set in the dirty bitmap. It violates the documentation: /* for KVM_GET_DIRTY_LOG */ struct kvm_dirty_log { __u32 slot; __u32 padding; union { void __user *dirty_bitmap; /* one bit per page */ __u64 padding; }; }; Given a memory slot, return a bitmap containing any pages dirtied since the last call to this ioctl. Bit 0 is the first page in the memory slot. Ensure the entire structure is cleared to avoid padding issues. The point about migration, is that GET_DIRTY_LOG is strictly correct because it stops vcpus. But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are executing? With fast page fault: if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) /* The window in here... */ mark_page_dirty(vcpu->kvm, gfn); And the $SUBJECT set_spte reordering, the rule becomes A call to GET_DIRTY_LOG guarantees to return correct information about dirty pages before invocation of the previous GET_DIRTY_LOG call. (see example 1: the next GET_DIRTY_LOG will return the dirty information there). The rule for sptes that is, because kvm_write_guest does not match the documentation at all. So before example 1 and this patch, the rule (well for sptes at least) was "Given a memory slot, return a bitmap containing any pages dirtied since the last call to this ioctl. Bit 0 is the first page in the memory slot. Ensure the entire structure is cleared to avoid padding issues." Can you explain why it is OK to relax this rule? Thanks (reviewing the nulls desc patch...). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 19, 2013 at 10:29:20PM -0200, Marcelo Tosatti wrote: > A call to GET_DIRTY_LOG guarantees to return correct information about > dirty pages before invocation of the previous GET_DIRTY_LOG call. > Can you explain why it is OK to relax this rule? That is, this might be OK, but better understand why it is. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Nov 20, 2013, at 8:29 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote: >> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote: >>> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote: >>>> Make sure we can see the writable spte before the dirt bitmap is visible >>>> >>>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based >>>> on the dirty bitmap, we should ensure the writable spte can be found in rmap >>>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and >>>> failed to write-protect the page >>>> >>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> >>>> --- >>>> arch/x86/kvm/mmu.c | 6 +++--- >>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>> >>> Can you explain why this is safe, with regard to the rule >>> at edde99ce05290e50 ? >> >> BTW, this log fixed this case: >> >> VCPU 0 KVM migration control >> >> write-protects all pages >> #Pf happen then the page >> become writable, set dirty >> bit on the bitmap >> >> swap the bitmap, current bitmap is empty >> >> write the page (no dirty log) >> >> stop the guest and push >> the remaining dirty pages >> Stopped >> See current bitmap is empty that means >> no page is dirty. >>> >>> "The rule is that all pages are either dirty in the current bitmap, >>> or write-protected, which is violated here." >> >> Actually, this rule is not complete true, there's the 3th case: >> the window between write guest page and set dirty bitmap is valid. >> In that window, page is write-free and not dirty logged. >> >> This case is based on the fact that at the final step of live migration, >> kvm should stop the guest and push the remaining dirty pages to the >> destination. >> >> They're some examples in the current code: >> example 1, in fast_pf_fix_direct_spte(): >> if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) >> /* The window in here... */ >> mark_page_dirty(vcpu->kvm, gfn); >> >> example 2, in kvm_write_guest_page(): >> r = __copy_to_user((void __user *)addr + offset, data, len); >> if (r) >> return -EFAULT; >> /* >> * The window is here, the page is dirty but not logged in >> * The bitmap. >> */ >> mark_page_dirty(kvm, gfn); >> return 0; > Hi Marcelo, > Why is this valid ? That is, the obviously correct rule is > > "that all pages are either dirty in the current bitmap, > or write-protected, which is violated here." > > With the window above, GET_DIRTY_LOG can be called 100 times while the > page is dirty, but the corresponding bit not set in the dirty bitmap. > > It violates the documentation: > > /* for KVM_GET_DIRTY_LOG */ > struct kvm_dirty_log { > __u32 slot; > __u32 padding; > union { > void __user *dirty_bitmap; /* one bit per page */ > __u64 padding; > }; > }; > > Given a memory slot, return a bitmap containing any pages dirtied > since the last call to this ioctl. Bit 0 is the first page in the > memory slot. Ensure the entire structure is cleared to avoid padding > issues. > > The point about migration, is that GET_DIRTY_LOG is strictly correct > because it stops vcpus. > > But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are > executing? Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be generated when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages the vcpus should be stopped. > > With fast page fault: > > if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) > /* The window in here... */ > mark_page_dirty(vcpu->kvm, gfn); > > And the $SUBJECT set_spte reordering, the rule becomes > > A call to GET_DIRTY_LOG guarantees to return correct information about > dirty pages before invocation of the previous GET_DIRTY_LOG call. > > (see example 1: the next GET_DIRTY_LOG will return the dirty information > there). > It seems no. The first GET_DIRTY_LOG can happen before fast-page-fault? the second GET_DIRTY_LOG happens in the window between cmpxchg() and mark_page_dirty(), for the second one, the information is still “incorrect”. > The rule for sptes that is, because kvm_write_guest does not match the > documentation at all. You mean the case of “kvm_write_guest” is valid (I do not know why it is)? Or anything else? > > So before example 1 and this patch, the rule (well for sptes at least) was > > "Given a memory slot, return a bitmap containing any pages dirtied > since the last call to this ioctl. Bit 0 is the first page in the > memory slot. Ensure the entire structure is cleared to avoid padding > issues." > > Can you explain why it is OK to relax this rule? It’s because: 1) it doesn’t break current use cases, i.e. Live migration and FB-flushing. 2) the current code, like kvm_write_guest has already broken the documentation (the guest page has been written but missed in the dirty bitmap). 3) it’s needless to implement a exact get-dirty-pages since the dirty pages can no be exactly got except stopping vcpus. So i think we'd document this case instead. No? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 20, 2013 at 10:20:09PM +0800, Xiao Guangrong wrote: > > But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are > > executing? > > Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be generated > when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages the vcpus > should be stopped. > > > > > With fast page fault: > > > > if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) > > /* The window in here... */ > > mark_page_dirty(vcpu->kvm, gfn); > > > > And the $SUBJECT set_spte reordering, the rule becomes > > > > A call to GET_DIRTY_LOG guarantees to return correct information about > > dirty pages before invocation of the previous GET_DIRTY_LOG call. > > > > (see example 1: the next GET_DIRTY_LOG will return the dirty information > > there). > > > > It seems no. > > The first GET_DIRTY_LOG can happen before fast-page-fault? > the second GET_DIRTY_LOG happens in the window between cmpxchg() > and mark_page_dirty(), for the second one, the information is still “incorrect”. Its correct for the previous GET_DIRTY_LOG call. > > The rule for sptes that is, because kvm_write_guest does not match the > > documentation at all. > > You mean the case of “kvm_write_guest” is valid (I do not know why it is)? > Or anything else? > > > > > So before example 1 and this patch, the rule (well for sptes at least) was > > > > "Given a memory slot, return a bitmap containing any pages dirtied > > since the last call to this ioctl. Bit 0 is the first page in the > > memory slot. Ensure the entire structure is cleared to avoid padding > > issues." > > > > Can you explain why it is OK to relax this rule? > > It’s because: > 1) it doesn’t break current use cases, i.e. Live migration and FB-flushing. > 2) the current code, like kvm_write_guest has already broken the documentation > (the guest page has been written but missed in the dirty bitmap). > 3) it’s needless to implement a exact get-dirty-pages since the dirty pages can > no be exactly got except stopping vcpus. > > So i think we'd document this case instead. No? Lets figure out the requirements, then. I don't understand why FB-flushing is safe (think kvm-autotest: one pixel off the entire test fails). Before fast page fault: Pages are either write protected or the corresponding dirty bitmap bit is set. Any write faults to dirty logged sptes while GET_DIRTY log is executing in the protected section are allowed to instantiate writeable spte after GET_DIRTY log is finished. After fast page fault: Pages can be writeable and the dirty bitmap not set. Therefore data can be dirty before GET_DIRTY executes and still fail to be returned in the bitmap. Since this patchset does not introduce change in behaviour (fast pf did), no reason to avoid merging this. BTW, since GET_DIRTY log does not require to report concurrent (to GET_DIRTY_LOG) write faults, it is not necessary to rewalk the spte list, is it? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Nov 21, 2013, at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote: > On Wed, Nov 20, 2013 at 10:20:09PM +0800, Xiao Guangrong wrote: >>> But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are >>> executing? >> >> Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be generated >> when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages the vcpus >> should be stopped. >> >>> >>> With fast page fault: >>> >>> if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte) >>> /* The window in here... */ >>> mark_page_dirty(vcpu->kvm, gfn); >>> >>> And the $SUBJECT set_spte reordering, the rule becomes >>> >>> A call to GET_DIRTY_LOG guarantees to return correct information about >>> dirty pages before invocation of the previous GET_DIRTY_LOG call. >>> >>> (see example 1: the next GET_DIRTY_LOG will return the dirty information >>> there). >>> >> >> It seems no. >> >> The first GET_DIRTY_LOG can happen before fast-page-fault? >> the second GET_DIRTY_LOG happens in the window between cmpxchg() >> and mark_page_dirty(), for the second one, the information is still “incorrect”. > > Its correct for the previous GET_DIRTY_LOG call. Oh, yes. > >>> The rule for sptes that is, because kvm_write_guest does not match the >>> documentation at all. >> >> You mean the case of “kvm_write_guest” is valid (I do not know why it is)? >> Or anything else? >> >>> >>> So before example 1 and this patch, the rule (well for sptes at least) was >>> >>> "Given a memory slot, return a bitmap containing any pages dirtied >>> since the last call to this ioctl. Bit 0 is the first page in the >>> memory slot. Ensure the entire structure is cleared to avoid padding >>> issues." >>> >>> Can you explain why it is OK to relax this rule? >> >> It’s because: >> 1) it doesn’t break current use cases, i.e. Live migration and FB-flushing. >> 2) the current code, like kvm_write_guest has already broken the documentation >> (the guest page has been written but missed in the dirty bitmap). >> 3) it’s needless to implement a exact get-dirty-pages since the dirty pages can >> no be exactly got except stopping vcpus. >> >> So i think we'd document this case instead. No? > > Lets figure out the requirements, then. I don't understand why > FB-flushing is safe (think kvm-autotest: one pixel off the entire > test fails). I did not read FB-flushing code, i guess the reason why it can work is: FB-flushing do periodicity get-dirty-page and flush it. After guest writes the page, the page will be flushed in the next GET_DIRTY_LOG. > > Before fast page fault: Pages are either write protected or the > corresponding dirty bitmap bit is set. Any write faults to dirty logged > sptes while GET_DIRTY log is executing in the protected section are > allowed to instantiate writeable spte after GET_DIRTY log is finished. > > After fast page fault: Pages can be writeable and the dirty bitmap not > set. Therefore data can be dirty before GET_DIRTY executes and still > fail to be returned in the bitmap. It’s right. The current GET_DIRTY fail to get the dirty page but the next GET_DIRTY can get it properly since the current GET_DIRTY need to flush all TLBs that waits for fast-page-fault finish. I do not think it’s a big problem since single GET_DIRTY is useless as i explained in the previous mail. > > Since this patchset does not introduce change in behaviour (fast pf > did), no reason to avoid merging this. Yes, thank you, Marcelo! :) > > BTW, since GET_DIRTY log does not require to report concurrent (to > GET_DIRTY_LOG) write faults, it is not necessary to rewalk the spte > list, is it? You mean the “rewalk” we introduced in pte_list_walk_lockless() in this patchset? I think this rewalk is needed because it’s caused by meet a unexpected nulls that means we’re walking on the unexpected rmap. If we do not do this, some writable sptes will be missed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 35d4b50..0fe56ad 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, } } - if (pte_access & ACC_WRITE_MASK) - mark_page_dirty(vcpu->kvm, gfn); - set_pte: if (mmu_spte_update(sptep, spte)) kvm_flush_remote_tlbs(vcpu->kvm); + + if (pte_access & ACC_WRITE_MASK) + mark_page_dirty(vcpu->kvm, gfn); done: return ret; }
Make sure we can see the writable spte before the dirt bitmap is visible We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap, we should ensure the writable spte can be found in rmap before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect the page Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> --- arch/x86/kvm/mmu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)