diff mbox

[04/12] KVM: MMU: log dirty page after marking spte writable

Message ID 1375189330-24066-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong July 30, 2013, 1:02 p.m. UTC
Make sure we can see the writable spte before the dirt bitmap is visible

We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
on the dirty bitmap, we should ensure the writable spte can be found in rmap
before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
failed to write-protect the page

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Paolo Bonzini July 30, 2013, 1:26 p.m. UTC | #1
Il 30/07/2013 15:02, Xiao Guangrong ha scritto:
> Make sure we can see the writable spte before the dirt bitmap is visible
> 
> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
> on the dirty bitmap, we should ensure the writable spte can be found in rmap
> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
> failed to write-protect the page
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 35d4b50..0fe56ad 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  		}
>  	}
>  
> -	if (pte_access & ACC_WRITE_MASK)
> -		mark_page_dirty(vcpu->kvm, gfn);
> -
>  set_pte:
>  	if (mmu_spte_update(sptep, spte))
>  		kvm_flush_remote_tlbs(vcpu->kvm);
> +
> +	if (pte_access & ACC_WRITE_MASK)
> +		mark_page_dirty(vcpu->kvm, gfn);
>  done:
>  	return ret;
>  }
> 

What about this comment above:

                /*
                 * Optimization: for pte sync, if spte was writable the hash
                 * lookup is unnecessary (and expensive). Write protection
                 * is responsibility of mmu_get_page / kvm_sync_page.
                 * Same reasoning can be applied to dirty page accounting.
                 */
                if (!can_unsync && is_writable_pte(*sptep))
                        goto set_pte;

                if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {


?

Should it be changed to

                if (!can_unsync && is_writable_pte(*sptep))
                        pte_access &= ~ACC_WRITE_MASK; /* do not mark dirty */

                else if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {

?

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong July 31, 2013, 7:25 a.m. UTC | #2
On 07/30/2013 09:26 PM, Paolo Bonzini wrote:
> Il 30/07/2013 15:02, Xiao Guangrong ha scritto:
>> Make sure we can see the writable spte before the dirt bitmap is visible
>>
>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
>> failed to write-protect the page
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>> ---
>>  arch/x86/kvm/mmu.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 35d4b50..0fe56ad 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>>  		}
>>  	}
>>  
>> -	if (pte_access & ACC_WRITE_MASK)
>> -		mark_page_dirty(vcpu->kvm, gfn);
>> -
>>  set_pte:
>>  	if (mmu_spte_update(sptep, spte))
>>  		kvm_flush_remote_tlbs(vcpu->kvm);
>> +
>> +	if (pte_access & ACC_WRITE_MASK)
>> +		mark_page_dirty(vcpu->kvm, gfn);
>>  done:
>>  	return ret;
>>  }
>>
> 
> What about this comment above:
> 
>                 /*
>                  * Optimization: for pte sync, if spte was writable the hash
>                  * lookup is unnecessary (and expensive). Write protection
>                  * is responsibility of mmu_get_page / kvm_sync_page.

This comments mean no sync shadow page created if the the spte is still writable
because add a sync page need to writable all spte point to this page. So we can
keep the spte as writable.

I think it is better to checking SPTE_MMU_WRITEABLE bit instead of PT_WRITABLE_MASK
since the latter bit can be cleared by dirty log and it can be a separate patch i
think.

>                  * Same reasoning can be applied to dirty page accounting.

This comment means if the spte is writable the corresponding bit on dirty bitmap
should have been set.

Thanks to your reminder, i think this comment should be dropped, now we need to
mark_page_dirty() whenever the spte update to writable. Otherwise this will happen:

   VCPU 0                         VCPU 1
Clear dirty bit on the bitmap
                               Read the spte, it is writable
write the spte
                               update the spte, keep it as writable
                               and do not call mark_page_dirty().
Flush tlb

Then vcpu 1 can continue to write the page but fail to set the bit on the bitmap.

>                  */
>                 if (!can_unsync && is_writable_pte(*sptep))
>                         goto set_pte;
> 
>                 if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> 
> 
> ?
> 
> Should it be changed to
> 
>                 if (!can_unsync && is_writable_pte(*sptep))
>                         pte_access &= ~ACC_WRITE_MASK; /* do not mark dirty */

Yes, this can avoid the issue above.

But there is only a small window between sync the spte and locklessly write-protect
the spte (since the sptep is already writable), i think we'd better keep the spte
writable to speed up the normal case. :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Aug. 7, 2013, 1:48 a.m. UTC | #3
On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
> Make sure we can see the writable spte before the dirt bitmap is visible
> 
> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
> on the dirty bitmap, we should ensure the writable spte can be found in rmap
> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
> failed to write-protect the page
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Can you explain why this is safe, with regard to the rule 
at edde99ce05290e50 ?

"The rule is that all pages are either dirty in the current bitmap,
or write-protected, which is violated here."

Overall, please document what is the required order of operations for
both set_spte and get_dirty_log and why this order is safe (right on top
of the code).

> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 35d4b50..0fe56ad 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  		}
>  	}
>  
> -	if (pte_access & ACC_WRITE_MASK)
> -		mark_page_dirty(vcpu->kvm, gfn);
> -
>  set_pte:
>  	if (mmu_spte_update(sptep, spte))
>  		kvm_flush_remote_tlbs(vcpu->kvm);

Here, there is a modified guest page without dirty log bit set (think
another vcpu writing to the page via this spte).

> +
> +	if (pte_access & ACC_WRITE_MASK)
> +		mark_page_dirty(vcpu->kvm, gfn);
>  done:
>  	return ret;
>  }
> -- 
> 1.8.1.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Aug. 7, 2013, 4:06 a.m. UTC | #4
On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
>> Make sure we can see the writable spte before the dirt bitmap is visible
>>
>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
>> failed to write-protect the page
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>> ---
>>  arch/x86/kvm/mmu.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Can you explain why this is safe, with regard to the rule 
> at edde99ce05290e50 ?

BTW, this log fixed this case:

     VCPU 0                KVM migration control

                           write-protects all pages
#Pf happen then the page
become writable, set dirty
bit on the bitmap

                          swap the bitmap, current bitmap is empty

write the page (no dirty log)

                          stop the guest and push
                          the remaining dirty pages
Stopped
                          See current bitmap is empty that means
                          no page is dirty.
> 
> "The rule is that all pages are either dirty in the current bitmap,
> or write-protected, which is violated here."

Actually, this rule is not complete true, there's the 3th case:
the window between write guest page and set dirty bitmap is valid.
In that window, page is write-free and not dirty logged.

This case is based on the fact that at the final step of live migration,
kvm should stop the guest and push the remaining dirty pages to the
destination.

They're some examples in the current code:
example 1, in fast_pf_fix_direct_spte():
	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
		/* The window in here... */
		mark_page_dirty(vcpu->kvm, gfn);

example 2, in kvm_write_guest_page():
	r = __copy_to_user((void __user *)addr + offset, data, len);
	if (r)
		return -EFAULT;
	/*
	 * The window is here, the page is dirty but not logged in
         * The bitmap.
	 */
	mark_page_dirty(kvm, gfn);
	return 0;

> 
> Overall, please document what is the required order of operations for
> both set_spte and get_dirty_log and why this order is safe (right on top
> of the code).

Okay.

The order we required here is, we should 1) set spte to writable __before__
set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
bitmap.

The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above.
The point 1) and 2) can ensure we can find the spte on rmap and see the spte is
writable when we do lockless write-protection, otherwise these cases will happen

VCPU 0					kvm ioctl doing get-dirty-pages

mark_page_dirty(gfn) which
set the gfn on the dirty maps
                                      mask = xchg(dirty_bitmap, 0)

                                      walk all gfns which set on "mask" and
                                      locklessly write-protect the gfn,
                                      then walk rmap, see no spte on that rmap
	

add the spte into rmap

!!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.

Or:

VCPU 0					kvm ioctl doing get-dirty-pages

mark_page_dirty(gfn) which
set the gfn on the dirty maps

add spte into rmap
                                      mask = xchg(dirty_bitmap, 0)

                                      walk all gfns which set on "mask" and
                                      locklessly write-protect the gfn,
                                      then walk rmap, see spte is on the ramp
                                      but it is readonly or nonpresent.
	
Mark spte writable

!!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.

Hopefully, i have clarified it, if you have any questions, please let me know.

> 
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 35d4b50..0fe56ad 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>>  		}
>>  	}
>>  
>> -	if (pte_access & ACC_WRITE_MASK)
>> -		mark_page_dirty(vcpu->kvm, gfn);
>> -
>>  set_pte:
>>  	if (mmu_spte_update(sptep, spte))
>>  		kvm_flush_remote_tlbs(vcpu->kvm);
> 
> Here, there is a modified guest page without dirty log bit set (think
> another vcpu writing to the page via this spte).

This is okay since we call mark_page_dirty(vcpu->kvm, gfn) after that, this
is the same case as fast_pf_fix_direct_spte() and i have explained why it is
safe above.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Aug. 8, 2013, 3:06 p.m. UTC | #5
On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
> > On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
> >> Make sure we can see the writable spte before the dirt bitmap is visible
> >>
> >> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
> >> on the dirty bitmap, we should ensure the writable spte can be found in rmap
> >> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
> >> failed to write-protect the page
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >> ---
> >>  arch/x86/kvm/mmu.c | 6 +++---
> >>  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > Can you explain why this is safe, with regard to the rule 
> > at edde99ce05290e50 ?
> 
> BTW, this log fixed this case:
> 
>      VCPU 0                KVM migration control
> 
>                            write-protects all pages
> #Pf happen then the page
> become writable, set dirty
> bit on the bitmap
> 
>                           swap the bitmap, current bitmap is empty
> 
> write the page (no dirty log)
> 
>                           stop the guest and push
>                           the remaining dirty pages
> Stopped
>                           See current bitmap is empty that means
>                           no page is dirty.
> > 
> > "The rule is that all pages are either dirty in the current bitmap,
> > or write-protected, which is violated here."
> 
> Actually, this rule is not complete true, there's the 3th case:
> the window between write guest page and set dirty bitmap is valid.
> In that window, page is write-free and not dirty logged.
> 
> This case is based on the fact that at the final step of live migration,
> kvm should stop the guest and push the remaining dirty pages to the
> destination.
> 
> They're some examples in the current code:
> example 1, in fast_pf_fix_direct_spte():
> 	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
> 		/* The window in here... */
> 		mark_page_dirty(vcpu->kvm, gfn);
> 
> example 2, in kvm_write_guest_page():
> 	r = __copy_to_user((void __user *)addr + offset, data, len);
> 	if (r)
> 		return -EFAULT;
> 	/*
> 	 * The window is here, the page is dirty but not logged in
>          * The bitmap.
> 	 */
> 	mark_page_dirty(kvm, gfn);
> 	return 0;
> 
> > 
> > Overall, please document what is the required order of operations for
> > both set_spte and get_dirty_log and why this order is safe (right on top
> > of the code).
> 
> Okay.
> 
> The order we required here is, we should 1) set spte to writable __before__
> set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
> bitmap.
> 
> The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above.
> The point 1) and 2) can ensure we can find the spte on rmap and see the spte is
> writable when we do lockless write-protection, otherwise these cases will happen
> 
> VCPU 0					kvm ioctl doing get-dirty-pages
> 
> mark_page_dirty(gfn) which
> set the gfn on the dirty maps
>                                       mask = xchg(dirty_bitmap, 0)
> 
>                                       walk all gfns which set on "mask" and
>                                       locklessly write-protect the gfn,
>                                       then walk rmap, see no spte on that rmap
> 	
> 
> add the spte into rmap
> 
> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
> 
> Or:
> 
> VCPU 0					kvm ioctl doing get-dirty-pages
> 
> mark_page_dirty(gfn) which
> set the gfn on the dirty maps
> 
> add spte into rmap
>                                       mask = xchg(dirty_bitmap, 0)
> 
>                                       walk all gfns which set on "mask" and
>                                       locklessly write-protect the gfn,
>                                       then walk rmap, see spte is on the ramp
>                                       but it is readonly or nonpresent.
> 	
> Mark spte writable
> 
> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
> 
> Hopefully, i have clarified it, if you have any questions, please let me know.

Yes, partially. Please on top of the explanation above, have something along
the lines of

"The flush IPI assumes that a thread switch happens in this order"
comment at arch/x86/mm/tlb.c

"With get_user_pages_fast, we walk down the pagetables without taking"
comment at arch/x86/mm/gup.c


What about the relation between read-only spte and respective TLB flush?
That is:

vcpu0					vcpu1

lockless write protect
mark spte read-only
					either some mmu-lockless path or not
					write protect page:
					find read-only page
					assumes tlb is flushed
				
kvm_flush_remote_tlbs


In general, i think write protection is a good candidate for lockless operation.
However, i would also be concerned about the larger problem of mmu-lock 
contention and how to solve it. Did you ever consider splitting it?

> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 35d4b50..0fe56ad 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> >>  		}
> >>  	}
> >>  
> >> -	if (pte_access & ACC_WRITE_MASK)
> >> -		mark_page_dirty(vcpu->kvm, gfn);
> >> -
> >>  set_pte:
> >>  	if (mmu_spte_update(sptep, spte))
> >>  		kvm_flush_remote_tlbs(vcpu->kvm);
> > 
> > Here, there is a modified guest page without dirty log bit set (think
> > another vcpu writing to the page via this spte).
> 
> This is okay since we call mark_page_dirty(vcpu->kvm, gfn) after that, this
> is the same case as fast_pf_fix_direct_spte() and i have explained why it is
> safe above.

Right. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Aug. 8, 2013, 4:26 p.m. UTC | #6
[ Post again after adjusting the format since the mail list rejected to deliver my previous one. ]

On Aug 8, 2013, at 11:06 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
>> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
>>> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
>>>> Make sure we can see the writable spte before the dirt bitmap is visible
>>>> 
>>>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>>>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>>>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
>>>> failed to write-protect the page
>>>> 
>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>>>> ---
>>>> arch/x86/kvm/mmu.c | 6 +++---
>>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>> 
>>> Can you explain why this is safe, with regard to the rule 
>>> at edde99ce05290e50 ?
>> 
>> BTW, this log fixed this case:
>> 
>>     VCPU 0                KVM migration control
>> 
>>                           write-protects all pages
>> #Pf happen then the page
>> become writable, set dirty
>> bit on the bitmap
>> 
>>                          swap the bitmap, current bitmap is empty
>> 
>> write the page (no dirty log)
>> 
>>                          stop the guest and push
>>                          the remaining dirty pages
>> Stopped
>>                          See current bitmap is empty that means
>>                          no page is dirty.
>>> 
>>> "The rule is that all pages are either dirty in the current bitmap,
>>> or write-protected, which is violated here."
>> 
>> Actually, this rule is not complete true, there's the 3th case:
>> the window between write guest page and set dirty bitmap is valid.
>> In that window, page is write-free and not dirty logged.
>> 
>> This case is based on the fact that at the final step of live migration,
>> kvm should stop the guest and push the remaining dirty pages to the
>> destination.
>> 
>> They're some examples in the current code:
>> example 1, in fast_pf_fix_direct_spte():
>> 	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>> 		/* The window in here... */
>> 		mark_page_dirty(vcpu->kvm, gfn);
>> 
>> example 2, in kvm_write_guest_page():
>> 	r = __copy_to_user((void __user *)addr + offset, data, len);
>> 	if (r)
>> 		return -EFAULT;
>> 	/*
>> 	 * The window is here, the page is dirty but not logged in
>>         * The bitmap.
>> 	 */
>> 	mark_page_dirty(kvm, gfn);
>> 	return 0;
>> 
>>> 
>>> Overall, please document what is the required order of operations for
>>> both set_spte and get_dirty_log and why this order is safe (right on top
>>> of the code).
>> 
>> Okay.
>> 
>> The order we required here is, we should 1) set spte to writable __before__
>> set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
>> bitmap.
>> 
>> The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above.
>> The point 1) and 2) can ensure we can find the spte on rmap and see the spte is
>> writable when we do lockless write-protection, otherwise these cases will happen
>> 
>> VCPU 0					kvm ioctl doing get-dirty-pages
>> 
>> mark_page_dirty(gfn) which
>> set the gfn on the dirty maps
>>                                      mask = xchg(dirty_bitmap, 0)
>> 
>>                                      walk all gfns which set on "mask" and
>>                                      locklessly write-protect the gfn,
>>                                      then walk rmap, see no spte on that rmap
>> 	
>> 
>> add the spte into rmap
>> 
>> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
>> 
>> Or:
>> 
>> VCPU 0					kvm ioctl doing get-dirty-pages
>> 
>> mark_page_dirty(gfn) which
>> set the gfn on the dirty maps
>> 
>> add spte into rmap
>>                                      mask = xchg(dirty_bitmap, 0)
>> 
>>                                      walk all gfns which set on "mask" and
>>                                      locklessly write-protect the gfn,
>>                                      then walk rmap, see spte is on the ramp
>>                                      but it is readonly or nonpresent.
>> 	
>> Mark spte writable
>> 
>> !!!!!! Then the page can be freely wrote but not recorded in the dirty bitmap.
>> 
>> Hopefully, i have clarified it, if you have any questions, please let me know.
> 
> Yes, partially. Please on top of the explanation above, have something along
> the lines of
> 
> "The flush IPI assumes that a thread switch happens in this order"
> comment at arch/x86/mm/tlb.c
> 
> "With get_user_pages_fast, we walk down the pagetables without taking"
> comment at arch/x86/mm/gup.c

Marcelo, thanks for your suggestion, i will improve both the changelog and
the comments in the next version.

> 
> 
> What about the relation between read-only spte and respective TLB flush?
> That is:
> 
> vcpu0					vcpu1
> 
> lockless write protect
> mark spte read-only
> 					either some mmu-lockless path or not
> 					write protect page:
> 					find read-only page
> 					assumes tlb is flushed

The tlb flush on mmu-lockless paths do not depends on the spte, there are
two lockless paths: one is kvm_mmu_slot_remove_write_access() which
unconditionally flushes tlb,  another one is kvm_vm_ioctl_get_dirty_log()
which flushes tlb based on dirty bitmap (flush tlb if have dirty page).

Under the protection of mmu-lock, since we flush tlb whenever spte is zapped,
we only need to care the case of spte updating which is fixed in 
"[PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified", in
that patch, it changes the flush-condition to

-	if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+	if (spte_is_locklessly_modifiable(old_spte) &&
+	      !is_writable_pte(new_spte))

That means whenever a spte which can be potentially locklessly-modified
becomes readonly, do the tlb flush.

> 				
> kvm_flush_remote_tlbs
> 
> 
> In general, i think write protection is a good candidate for lockless operation.
> However, i would also be concerned about the larger problem of mmu-lock 
> contention and how to solve it. Did you ever consider splitting it?

Yes, i did and actaully am still working on that. :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 20, 2013, 12:29 a.m. UTC | #7
On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
> > On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
> >> Make sure we can see the writable spte before the dirt bitmap is visible
> >>
> >> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
> >> on the dirty bitmap, we should ensure the writable spte can be found in rmap
> >> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
> >> failed to write-protect the page
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >> ---
> >>  arch/x86/kvm/mmu.c | 6 +++---
> >>  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > Can you explain why this is safe, with regard to the rule 
> > at edde99ce05290e50 ?
> 
> BTW, this log fixed this case:
> 
>      VCPU 0                KVM migration control
> 
>                            write-protects all pages
> #Pf happen then the page
> become writable, set dirty
> bit on the bitmap
> 
>                           swap the bitmap, current bitmap is empty
> 
> write the page (no dirty log)
> 
>                           stop the guest and push
>                           the remaining dirty pages
> Stopped
>                           See current bitmap is empty that means
>                           no page is dirty.
> > 
> > "The rule is that all pages are either dirty in the current bitmap,
> > or write-protected, which is violated here."
> 
> Actually, this rule is not complete true, there's the 3th case:
> the window between write guest page and set dirty bitmap is valid.
> In that window, page is write-free and not dirty logged.
> 
> This case is based on the fact that at the final step of live migration,
> kvm should stop the guest and push the remaining dirty pages to the
> destination.
> 
> They're some examples in the current code:
> example 1, in fast_pf_fix_direct_spte():
> 	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
> 		/* The window in here... */
> 		mark_page_dirty(vcpu->kvm, gfn);
> 
> example 2, in kvm_write_guest_page():
> 	r = __copy_to_user((void __user *)addr + offset, data, len);
> 	if (r)
> 		return -EFAULT;
> 	/*
> 	 * The window is here, the page is dirty but not logged in
>          * The bitmap.
> 	 */
> 	mark_page_dirty(kvm, gfn);
> 	return 0;

Why is this valid ? That is, the obviously correct rule is

"that all pages are either dirty in the current bitmap,
or write-protected, which is violated here."

With the window above, GET_DIRTY_LOG can be called 100 times while the 
page is dirty, but the corresponding bit not set in the dirty bitmap.

It violates the documentation:

/* for KVM_GET_DIRTY_LOG */
struct kvm_dirty_log {
        __u32 slot;
        __u32 padding;
        union {
                void __user *dirty_bitmap; /* one bit per page */
                __u64 padding;
        };
};

Given a memory slot, return a bitmap containing any pages dirtied
since the last call to this ioctl.  Bit 0 is the first page in the
memory slot.  Ensure the entire structure is cleared to avoid padding
issues.

The point about migration, is that GET_DIRTY_LOG is strictly correct
because it stops vcpus.

But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are
executing? 

With fast page fault:

       if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
               /* The window in here... */
               mark_page_dirty(vcpu->kvm, gfn);
 
And the $SUBJECT set_spte reordering, the rule becomes

A call to GET_DIRTY_LOG guarantees to return correct information about 
dirty pages before invocation of the previous GET_DIRTY_LOG call.

(see example 1: the next GET_DIRTY_LOG will return the dirty information
there).

The rule for sptes that is, because kvm_write_guest does not match the
documentation at all.

So before example 1 and this patch, the rule (well for sptes at least) was

"Given a memory slot, return a bitmap containing any pages dirtied
since the last call to this ioctl.  Bit 0 is the first page in the
memory slot.  Ensure the entire structure is cleared to avoid padding
issues."

Can you explain why it is OK to relax this rule?

Thanks

(reviewing the nulls desc patch...).


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 20, 2013, 12:35 a.m. UTC | #8
On Tue, Nov 19, 2013 at 10:29:20PM -0200, Marcelo Tosatti wrote:
> A call to GET_DIRTY_LOG guarantees to return correct information about 
> dirty pages before invocation of the previous GET_DIRTY_LOG call.

> Can you explain why it is OK to relax this rule?

That is, this might be OK, but better understand why it is.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 20, 2013, 2:20 p.m. UTC | #9
On Nov 20, 2013, at 8:29 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
>> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
>>> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
>>>> Make sure we can see the writable spte before the dirt bitmap is visible
>>>> 
>>>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>>>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>>>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and
>>>> failed to write-protect the page
>>>> 
>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>>>> ---
>>>> arch/x86/kvm/mmu.c | 6 +++---
>>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>> 
>>> Can you explain why this is safe, with regard to the rule 
>>> at edde99ce05290e50 ?
>> 
>> BTW, this log fixed this case:
>> 
>>    VCPU 0                KVM migration control
>> 
>>                          write-protects all pages
>> #Pf happen then the page
>> become writable, set dirty
>> bit on the bitmap
>> 
>>                         swap the bitmap, current bitmap is empty
>> 
>> write the page (no dirty log)
>> 
>>                         stop the guest and push
>>                         the remaining dirty pages
>> Stopped
>>                         See current bitmap is empty that means
>>                         no page is dirty.
>>> 
>>> "The rule is that all pages are either dirty in the current bitmap,
>>> or write-protected, which is violated here."
>> 
>> Actually, this rule is not complete true, there's the 3th case:
>> the window between write guest page and set dirty bitmap is valid.
>> In that window, page is write-free and not dirty logged.
>> 
>> This case is based on the fact that at the final step of live migration,
>> kvm should stop the guest and push the remaining dirty pages to the
>> destination.
>> 
>> They're some examples in the current code:
>> example 1, in fast_pf_fix_direct_spte():
>> 	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>> 		/* The window in here... */
>> 		mark_page_dirty(vcpu->kvm, gfn);
>> 
>> example 2, in kvm_write_guest_page():
>> 	r = __copy_to_user((void __user *)addr + offset, data, len);
>> 	if (r)
>> 		return -EFAULT;
>> 	/*
>> 	 * The window is here, the page is dirty but not logged in
>>        * The bitmap.
>> 	 */
>> 	mark_page_dirty(kvm, gfn);
>> 	return 0;
> 

Hi Marcelo,

> Why is this valid ? That is, the obviously correct rule is
> 
> "that all pages are either dirty in the current bitmap,
> or write-protected, which is violated here."
> 
> With the window above, GET_DIRTY_LOG can be called 100 times while the 
> page is dirty, but the corresponding bit not set in the dirty bitmap.
> 
> It violates the documentation:
> 
> /* for KVM_GET_DIRTY_LOG */
> struct kvm_dirty_log {
>       __u32 slot;
>       __u32 padding;
>       union {
>               void __user *dirty_bitmap; /* one bit per page */
>               __u64 padding;
>       };
> };
> 
> Given a memory slot, return a bitmap containing any pages dirtied
> since the last call to this ioctl.  Bit 0 is the first page in the
> memory slot.  Ensure the entire structure is cleared to avoid padding
> issues.
> 
> The point about migration, is that GET_DIRTY_LOG is strictly correct
> because it stops vcpus.
> 
> But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are
> executing? 

Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be generated
when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages the vcpus
should be stopped. 

> 
> With fast page fault:
> 
>      if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>              /* The window in here... */
>              mark_page_dirty(vcpu->kvm, gfn);
> 
> And the $SUBJECT set_spte reordering, the rule becomes
> 
> A call to GET_DIRTY_LOG guarantees to return correct information about 
> dirty pages before invocation of the previous GET_DIRTY_LOG call.
> 
> (see example 1: the next GET_DIRTY_LOG will return the dirty information
> there).
> 

It seems no.

The first GET_DIRTY_LOG can happen before fast-page-fault?
the second GET_DIRTY_LOG happens in the window between cmpxchg()
and mark_page_dirty(), for the second one, the information is still “incorrect”.

> The rule for sptes that is, because kvm_write_guest does not match the
> documentation at all.

You mean the case of “kvm_write_guest” is valid (I do not know why it is)?
Or anything else?

> 
> So before example 1 and this patch, the rule (well for sptes at least) was
> 
> "Given a memory slot, return a bitmap containing any pages dirtied
> since the last call to this ioctl.  Bit 0 is the first page in the
> memory slot.  Ensure the entire structure is cleared to avoid padding
> issues."
> 
> Can you explain why it is OK to relax this rule?

It’s because:
1) it doesn’t break current use cases, i.e. Live migration and FB-flushing.
2) the current code, like kvm_write_guest  has already broken the documentation
   (the guest page has been written but missed in the dirty bitmap).
3) it’s needless to implement a exact get-dirty-pages since the dirty pages can
   no be exactly got except stopping vcpus. 

So i think we'd document this case instead. No?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 20, 2013, 7:47 p.m. UTC | #10
On Wed, Nov 20, 2013 at 10:20:09PM +0800, Xiao Guangrong wrote:
> > But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are
> > executing? 
> 
> Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be generated
> when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages the vcpus
> should be stopped. 
> 
> > 
> > With fast page fault:
> > 
> >      if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
> >              /* The window in here... */
> >              mark_page_dirty(vcpu->kvm, gfn);
> > 
> > And the $SUBJECT set_spte reordering, the rule becomes
> > 
> > A call to GET_DIRTY_LOG guarantees to return correct information about 
> > dirty pages before invocation of the previous GET_DIRTY_LOG call.
> > 
> > (see example 1: the next GET_DIRTY_LOG will return the dirty information
> > there).
> > 
> 
> It seems no.
> 
> The first GET_DIRTY_LOG can happen before fast-page-fault?
> the second GET_DIRTY_LOG happens in the window between cmpxchg()
> and mark_page_dirty(), for the second one, the information is still “incorrect”.

Its correct for the previous GET_DIRTY_LOG call.

> > The rule for sptes that is, because kvm_write_guest does not match the
> > documentation at all.
> 
> You mean the case of “kvm_write_guest” is valid (I do not know why it is)?
> Or anything else?
> 
> > 
> > So before example 1 and this patch, the rule (well for sptes at least) was
> > 
> > "Given a memory slot, return a bitmap containing any pages dirtied
> > since the last call to this ioctl.  Bit 0 is the first page in the
> > memory slot.  Ensure the entire structure is cleared to avoid padding
> > issues."
> > 
> > Can you explain why it is OK to relax this rule?
> 
> It’s because:
> 1) it doesn’t break current use cases, i.e. Live migration and FB-flushing.
> 2) the current code, like kvm_write_guest  has already broken the documentation
>    (the guest page has been written but missed in the dirty bitmap).
> 3) it’s needless to implement a exact get-dirty-pages since the dirty pages can
>    no be exactly got except stopping vcpus. 
> 
> So i think we'd document this case instead. No?

Lets figure out the requirements, then. I don't understand why
FB-flushing is safe (think kvm-autotest: one pixel off the entire
test fails).

Before fast page fault: Pages are either write protected or the
corresponding dirty bitmap bit is set. Any write faults to dirty logged
sptes while GET_DIRTY log is executing in the protected section are
allowed to instantiate writeable spte after GET_DIRTY log is finished.

After fast page fault: Pages can be writeable and the dirty bitmap not
set. Therefore data can be dirty before GET_DIRTY executes and still
fail to be returned in the bitmap.

Since this patchset does not introduce change in behaviour (fast pf
did), no reason to avoid merging this.

BTW, since GET_DIRTY log does not require to report concurrent (to
GET_DIRTY_LOG) write faults, it is not necessary to rewalk the spte
list, is it?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 21, 2013, 4:26 a.m. UTC | #11
On Nov 21, 2013, at 3:47 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:

> On Wed, Nov 20, 2013 at 10:20:09PM +0800, Xiao Guangrong wrote:
>>> But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are
>>> executing? 
>> 
>> Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be generated
>> when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages the vcpus
>> should be stopped. 
>> 
>>> 
>>> With fast page fault:
>>> 
>>>     if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>>>             /* The window in here... */
>>>             mark_page_dirty(vcpu->kvm, gfn);
>>> 
>>> And the $SUBJECT set_spte reordering, the rule becomes
>>> 
>>> A call to GET_DIRTY_LOG guarantees to return correct information about 
>>> dirty pages before invocation of the previous GET_DIRTY_LOG call.
>>> 
>>> (see example 1: the next GET_DIRTY_LOG will return the dirty information
>>> there).
>>> 
>> 
>> It seems no.
>> 
>> The first GET_DIRTY_LOG can happen before fast-page-fault?
>> the second GET_DIRTY_LOG happens in the window between cmpxchg()
>> and mark_page_dirty(), for the second one, the information is still “incorrect”.
> 
> Its correct for the previous GET_DIRTY_LOG call.

Oh, yes.

> 
>>> The rule for sptes that is, because kvm_write_guest does not match the
>>> documentation at all.
>> 
>> You mean the case of “kvm_write_guest” is valid (I do not know why it is)?
>> Or anything else?
>> 
>>> 
>>> So before example 1 and this patch, the rule (well for sptes at least) was
>>> 
>>> "Given a memory slot, return a bitmap containing any pages dirtied
>>> since the last call to this ioctl.  Bit 0 is the first page in the
>>> memory slot.  Ensure the entire structure is cleared to avoid padding
>>> issues."
>>> 
>>> Can you explain why it is OK to relax this rule?
>> 
>> It’s because:
>> 1) it doesn’t break current use cases, i.e. Live migration and FB-flushing.
>> 2) the current code, like kvm_write_guest  has already broken the documentation
>>   (the guest page has been written but missed in the dirty bitmap).
>> 3) it’s needless to implement a exact get-dirty-pages since the dirty pages can
>>   no be exactly got except stopping vcpus. 
>> 
>> So i think we'd document this case instead. No?
> 
> Lets figure out the requirements, then. I don't understand why
> FB-flushing is safe (think kvm-autotest: one pixel off the entire
> test fails).

I did not read FB-flushing code, i guess the reason why it can work is:
FB-flushing do periodicity get-dirty-page and flush it.  After guest writes
the page, the page will be flushed in the next GET_DIRTY_LOG.

> 
> Before fast page fault: Pages are either write protected or the
> corresponding dirty bitmap bit is set. Any write faults to dirty logged
> sptes while GET_DIRTY log is executing in the protected section are
> allowed to instantiate writeable spte after GET_DIRTY log is finished.
> 
> After fast page fault: Pages can be writeable and the dirty bitmap not
> set. Therefore data can be dirty before GET_DIRTY executes and still
> fail to be returned in the bitmap.

It’s right. The current GET_DIRTY fail to get the dirty page but the
next GET_DIRTY can get it properly since the current GET_DIRTY
need to flush all TLBs that waits for fast-page-fault finish.

I do not think it’s a big problem since single GET_DIRTY is useless as
i explained in the previous mail.

> 
> Since this patchset does not introduce change in behaviour (fast pf
> did), no reason to avoid merging this.

Yes, thank you, Marcelo! :)

> 
> BTW, since GET_DIRTY log does not require to report concurrent (to
> GET_DIRTY_LOG) write faults, it is not necessary to rewalk the spte
> list, is it?

You mean the “rewalk” we introduced in pte_list_walk_lockless() in this patchset?
I think this rewalk is needed because it’s caused by meet a unexpected nulls that
means we’re walking on the unexpected rmap. If we do not do this, some writable
sptes will be missed.




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 35d4b50..0fe56ad 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2486,12 +2486,12 @@  static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		}
 	}
 
-	if (pte_access & ACC_WRITE_MASK)
-		mark_page_dirty(vcpu->kvm, gfn);
-
 set_pte:
 	if (mmu_spte_update(sptep, spte))
 		kvm_flush_remote_tlbs(vcpu->kvm);
+
+	if (pte_access & ACC_WRITE_MASK)
+		mark_page_dirty(vcpu->kvm, gfn);
 done:
 	return ret;
 }