diff mbox

[v3,04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

Message ID 1382534973-13197-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong Oct. 23, 2013, 1:29 p.m. UTC
Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
  just change the spte from writable to readonly so that we only need to care
  the case of changing spte from present to present (changing the spte from
  present to nonpresent will flush all the TLBs immediately), in other words,
  the only case we need to care is mmu_spte_update()

- in mmu_spte_update(), we haved checked
  SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
  means it does not depend on PT_WRITABLE_MASK anymore

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c | 18 ++++++++++++++----
 arch/x86/kvm/x86.c |  9 +++++++--
 2 files changed, 21 insertions(+), 6 deletions(-)

Comments

Marcelo Tosatti Nov. 14, 2013, 12:36 a.m. UTC | #1
On Wed, Oct 23, 2013 at 09:29:22PM +0800, Xiao Guangrong wrote:
> Now we can flush all the TLBs out of the mmu lock without TLB corruption when
> write-proect the sptes, it is because:
> - we have marked large sptes readonly instead of dropping them that means we
>   just change the spte from writable to readonly so that we only need to care
>   the case of changing spte from present to present (changing the spte from
>   present to nonpresent will flush all the TLBs immediately), in other words,
>   the only case we need to care is mmu_spte_update()

Xiao,

Any code location which reads the writable bit in the spte and assumes if its not
set, that the translation which the spte refers to is not cached in a
remote CPU's TLB can become buggy. (*)

It might be the case that now its not an issue, but its so subtle that
it should be improved.

Can you add a fat comment on top of is_writeable_bit describing this?
(and explain why is_writable_pte users do not make an assumption
about (*). 

"Writeable bit of locklessly modifiable sptes might be cleared
but TLBs not flushed: so whenever reading locklessly modifiable sptes
you cannot assume TLBs are flushed".

For example this one is unclear:

                if (!can_unsync && is_writable_pte(*sptep))
                        goto set_pte;
And:

        if (!is_writable_pte(spte) &&
              !(pt_protect && spte_is_locklessly_modifiable(spte)))
                return false;

This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
serialized by a single mutex (if there were two mutexes, it would not be
safe). Can you add an assert to both
kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
for (slots_lock) is locked, and explain?

So just improve the comments please, thanks (no need to resend whole
series).

> - in mmu_spte_update(), we haved checked
>   SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
>   means it does not depend on PT_WRITABLE_MASK anymore
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c | 18 ++++++++++++++----
>  arch/x86/kvm/x86.c |  9 +++++++--
>  2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 62f18ec..337d173 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4273,15 +4273,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
>  			if (*rmapp)
>  				__rmap_write_protect(kvm, rmapp, false);
>  
> -			if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> -				kvm_flush_remote_tlbs(kvm);
> +			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>  				cond_resched_lock(&kvm->mmu_lock);
> -			}
>  		}
>  	}
>  
> -	kvm_flush_remote_tlbs(kvm);
>  	spin_unlock(&kvm->mmu_lock);
> +
> +	/*
> +	 * We can flush all the TLBs out of the mmu lock without TLB
> +	 * corruption since we just change the spte from writable to
> +	 * readonly so that we only need to care the case of changing
> +	 * spte from present to present (changing the spte from present
> +	 * to nonpresent will flush all the TLBs immediately), in other
> +	 * words, the only case we care is mmu_spte_update() where we
> +	 * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
> +	 * instead of PT_WRITABLE_MASK, that means it does not depend
> +	 * on PT_WRITABLE_MASK anymore.
> +	 */
> +	kvm_flush_remote_tlbs(kvm);
>  }
>  
>  #define BATCH_ZAP_PAGES	10
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b3aa650..573c6b3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3571,11 +3571,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  		offset = i * BITS_PER_LONG;
>  		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
>  	}
> -	if (is_dirty)
> -		kvm_flush_remote_tlbs(kvm);
>  
>  	spin_unlock(&kvm->mmu_lock);
>  
> +	/*
> +	 * All the TLBs can be flushed out of mmu lock, see the comments in
> +	 * kvm_mmu_slot_remove_write_access().
> +	 */
> +	if (is_dirty)
> +		kvm_flush_remote_tlbs(kvm);
> +
>  	r = -EFAULT;
>  	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
>  		goto out;
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 14, 2013, 5:15 a.m. UTC | #2
Hi Marcelo,

On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:

> 
> Any code location which reads the writable bit in the spte and assumes if its not
> set, that the translation which the spte refers to is not cached in a
> remote CPU's TLB can become buggy. (*)
> 
> It might be the case that now its not an issue, but its so subtle that
> it should be improved.
> 
> Can you add a fat comment on top of is_writeable_bit describing this?
> (and explain why is_writable_pte users do not make an assumption
> about (*). 
> 
> "Writeable bit of locklessly modifiable sptes might be cleared
> but TLBs not flushed: so whenever reading locklessly modifiable sptes
> you cannot assume TLBs are flushed".
> 
> For example this one is unclear:
> 
>                 if (!can_unsync && is_writable_pte(*sptep))
>                         goto set_pte;
> And:
> 
>         if (!is_writable_pte(spte) &&
>               !(pt_protect && spte_is_locklessly_modifiable(spte)))
>                 return false;
> 
> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
> serialized by a single mutex (if there were two mutexes, it would not be
> safe). Can you add an assert to both
> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
> for (slots_lock) is locked, and explain?
> 
> So just improve the comments please, thanks (no need to resend whole
> series).

Thank you very much for your time to review it and really appreciate
for you detailed the issue so clearly to me.

I will do it on the top of this patchset or after it is merged
(if it's possiable).




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 14, 2013, 6:39 p.m. UTC | #3
On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote:
> 
> Hi Marcelo,
> 
> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:
> 
> > 
> > Any code location which reads the writable bit in the spte and assumes if its not
> > set, that the translation which the spte refers to is not cached in a
> > remote CPU's TLB can become buggy. (*)
> > 
> > It might be the case that now its not an issue, but its so subtle that
> > it should be improved.
> > 
> > Can you add a fat comment on top of is_writeable_bit describing this?
> > (and explain why is_writable_pte users do not make an assumption
> > about (*). 
> > 
> > "Writeable bit of locklessly modifiable sptes might be cleared
> > but TLBs not flushed: so whenever reading locklessly modifiable sptes
> > you cannot assume TLBs are flushed".
> > 
> > For example this one is unclear:
> > 
> >                 if (!can_unsync && is_writable_pte(*sptep))
> >                         goto set_pte;
> > And:
> > 
> >         if (!is_writable_pte(spte) &&
> >               !(pt_protect && spte_is_locklessly_modifiable(spte)))
> >                 return false;
> > 
> > This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
> > serialized by a single mutex (if there were two mutexes, it would not be
> > safe). Can you add an assert to both
> > kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
> > for (slots_lock) is locked, and explain?
> > 
> > So just improve the comments please, thanks (no need to resend whole
> > series).
> 
> Thank you very much for your time to review it and really appreciate
> for you detailed the issue so clearly to me.
> 
> I will do it on the top of this patchset or after it is merged
> (if it's possiable).

Ok, can you explain why every individual caller of is_writable_pte have
no such assumption now? (the one mentioned above is not clear to me for
example, should explain all of them).

OK to improve comments later.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Nov. 15, 2013, 7:09 a.m. UTC | #4
On 11/15/2013 02:39 AM, Marcelo Tosatti wrote:
> On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote:
>>
>> Hi Marcelo,
>>
>> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:
>>
>>>
>>> Any code location which reads the writable bit in the spte and assumes if its not
>>> set, that the translation which the spte refers to is not cached in a
>>> remote CPU's TLB can become buggy. (*)
>>>
>>> It might be the case that now its not an issue, but its so subtle that
>>> it should be improved.
>>>
>>> Can you add a fat comment on top of is_writeable_bit describing this?
>>> (and explain why is_writable_pte users do not make an assumption
>>> about (*). 
>>>
>>> "Writeable bit of locklessly modifiable sptes might be cleared
>>> but TLBs not flushed: so whenever reading locklessly modifiable sptes
>>> you cannot assume TLBs are flushed".
>>>
>>> For example this one is unclear:
>>>
>>>                 if (!can_unsync && is_writable_pte(*sptep))
>>>                         goto set_pte;
>>> And:
>>>
>>>         if (!is_writable_pte(spte) &&
>>>               !(pt_protect && spte_is_locklessly_modifiable(spte)))
>>>                 return false;
>>>
>>> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
>>> serialized by a single mutex (if there were two mutexes, it would not be
>>> safe). Can you add an assert to both
>>> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
>>> for (slots_lock) is locked, and explain?
>>>
>>> So just improve the comments please, thanks (no need to resend whole
>>> series).
>>
>> Thank you very much for your time to review it and really appreciate
>> for you detailed the issue so clearly to me.
>>
>> I will do it on the top of this patchset or after it is merged
>> (if it's possiable).
> 
> Ok, can you explain why every individual caller of is_writable_pte have
> no such assumption now? (the one mentioned above is not clear to me for
> example, should explain all of them).

Okay.

Generally speak, we 1) needn't care readonly spte too much since it
can not be locklessly write-protected and 2) if is_writable_pte() is used
to check mmu-mode's state we can check SPTE_MMU_WRITEABLE instead.

There are the places is_writable_pte is used:
1) in spte_has_volatile_bits():
 527 static bool spte_has_volatile_bits(u64 spte)
 528 {
 529         /*
 530          * Always atomicly update spte if it can be updated
 531          * out of mmu-lock, it can ensure dirty bit is not lost,
 532          * also, it can help us to get a stable is_writable_pte()
 533          * to ensure tlb flush is not missed.
 534          */
 535         if (spte_is_locklessly_modifiable(spte))
 536                 return true;
 537
 538         if (!shadow_accessed_mask)
 539                 return false;
 540
 541         if (!is_shadow_present_pte(spte))
 542                 return false;
 543
 544         if ((spte & shadow_accessed_mask) &&
 545               (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
 546                 return false;
 547
 548         return true;
 549 }

this path is not broken since any spte can be lockless modifiable will do
lockless update (will always return 'true' in  the line 536).

2): in mmu_spte_update()
594         /*
 595          * For the spte updated out of mmu-lock is safe, since
 596          * we always atomicly update it, see the comments in
 597          * spte_has_volatile_bits().
 598          */
 599         if (spte_is_locklessly_modifiable(old_spte) &&
 600               !is_writable_pte(new_spte))
 601                 ret = true;

The new_spte is a temp value that can not be fetched by lockless
write-protection and !is_writable_pte() is stable enough (can not be
locklessly write-protected).

3) in spte_write_protect()
1368         if (!is_writable_pte(spte) &&
1369               !spte_is_locklessly_modifiable(spte))
1370                 return false;
1371

It always do write-protection if the spte is lockelss modifiable.
(This code is the aspect after applying the whole pachset, the code is safe too
before patch "[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect" since
the lockless write-protection path is serialized by a single lock.).

4) in set_spte()
2690                 /*
2691                  * Optimization: for pte sync, if spte was writable the hash
2692                  * lookup is unnecessary (and expensive). Write protection
2693                  * is responsibility of mmu_get_page / kvm_sync_page.
2694                  * Same reasoning can be applied to dirty page accounting.
2695                  */
2696                 if (!can_unsync && is_writable_pte(*sptep))
2697                         goto set_pte;

It is used for a optimization and the worst case is the optimization is disabled
(walking the shadow pages in the hast table) when the spte has been locklessly
write-protected. It does not hurt anything since it is a rare event. And the
optimization can be back if we check SPTE_MMU_WRITEABLE instead.

5) fast_page_fault()
3110         /*
3111          * Check if it is a spurious fault caused by TLB lazily flushed.
3112          *
3113          * Need not check the access of upper level table entries since
3114          * they are always ACC_ALL.
3115          */
3116          if (is_writable_pte(spte)) {
3117                 ret = true;
3118                 goto exit;
3119         }

Since kvm_vm_ioctl_get_dirty_log() firstly get-and-clear dirty-bitmap before
do write-protect, the dirty-bitmap will be properly set again when fast_page_fault
fix the spte who is write-protected by lockless write-protection.

6) in fast_page_fault's tracepoint:
244 #define __spte_satisfied(__spte)                                \
245         (__entry->retry && is_writable_pte(__entry->__spte))
It causes the tracepoint reports the wrong result when fast_page_fault
and tdp_page_fault/lockless-write-protect run concurrently, i guess it's
okay since it's only used for trace.

7) in audit_write_protection():
202                 if (is_writable_pte(*sptep))
203                         audit_printk(kvm, "shadow page has writable "
204                                      "mappings: gfn %llx role %x\n",
205                                      sp->gfn, sp->role.word);
It's okay since lockless-write-protection does not update the readonly sptes.

> 
> OK to improve comments later.

Thank you, Marcelo!


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Nov. 19, 2013, 12:19 a.m. UTC | #5
On Fri, Nov 15, 2013 at 03:09:13PM +0800, Xiao Guangrong wrote:
> On 11/15/2013 02:39 AM, Marcelo Tosatti wrote:
> > On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote:
> >>
> >> Hi Marcelo,
> >>
> >> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:
> >>
> >>>
> >>> Any code location which reads the writable bit in the spte and assumes if its not
> >>> set, that the translation which the spte refers to is not cached in a
> >>> remote CPU's TLB can become buggy. (*)
> >>>
> >>> It might be the case that now its not an issue, but its so subtle that
> >>> it should be improved.
> >>>
> >>> Can you add a fat comment on top of is_writeable_bit describing this?
> >>> (and explain why is_writable_pte users do not make an assumption
> >>> about (*). 
> >>>
> >>> "Writeable bit of locklessly modifiable sptes might be cleared
> >>> but TLBs not flushed: so whenever reading locklessly modifiable sptes
> >>> you cannot assume TLBs are flushed".
> >>>
> >>> For example this one is unclear:
> >>>
> >>>                 if (!can_unsync && is_writable_pte(*sptep))
> >>>                         goto set_pte;
> >>> And:
> >>>
> >>>         if (!is_writable_pte(spte) &&
> >>>               !(pt_protect && spte_is_locklessly_modifiable(spte)))
> >>>                 return false;
> >>>
> >>> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
> >>> serialized by a single mutex (if there were two mutexes, it would not be
> >>> safe). Can you add an assert to both
> >>> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
> >>> for (slots_lock) is locked, and explain?
> >>>
> >>> So just improve the comments please, thanks (no need to resend whole
> >>> series).
> >>
> >> Thank you very much for your time to review it and really appreciate
> >> for you detailed the issue so clearly to me.
> >>
> >> I will do it on the top of this patchset or after it is merged
> >> (if it's possiable).
> > 
> > Ok, can you explain why every individual caller of is_writable_pte have
> > no such assumption now? (the one mentioned above is not clear to me for
> > example, should explain all of them).
> 
> Okay.

OK, thanks for checking.

> Generally speak, we 1) needn't care readonly spte too much since it
> can not be locklessly write-protected and 2) if is_writable_pte() is used
> to check mmu-mode's state we can check SPTE_MMU_WRITEABLE instead.
> 
> There are the places is_writable_pte is used:
> 1) in spte_has_volatile_bits():
>  527 static bool spte_has_volatile_bits(u64 spte)
>  528 {
>  529         /*
>  530          * Always atomicly update spte if it can be updated
>  531          * out of mmu-lock, it can ensure dirty bit is not lost,
>  532          * also, it can help us to get a stable is_writable_pte()
>  533          * to ensure tlb flush is not missed.
>  534          */
>  535         if (spte_is_locklessly_modifiable(spte))
>  536                 return true;
>  537
>  538         if (!shadow_accessed_mask)
>  539                 return false;
>  540
>  541         if (!is_shadow_present_pte(spte))
>  542                 return false;
>  543
>  544         if ((spte & shadow_accessed_mask) &&
>  545               (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
>  546                 return false;
>  547
>  548         return true;
>  549 }
> 
> this path is not broken since any spte can be lockless modifiable will do
> lockless update (will always return 'true' in  the line 536).
> 
> 2): in mmu_spte_update()
> 594         /*
>  595          * For the spte updated out of mmu-lock is safe, since
>  596          * we always atomicly update it, see the comments in
>  597          * spte_has_volatile_bits().
>  598          */
>  599         if (spte_is_locklessly_modifiable(old_spte) &&
>  600               !is_writable_pte(new_spte))
>  601                 ret = true;
> 
> The new_spte is a temp value that can not be fetched by lockless
> write-protection and !is_writable_pte() is stable enough (can not be
> locklessly write-protected).
> 
> 3) in spte_write_protect()
> 1368         if (!is_writable_pte(spte) &&
> 1369               !spte_is_locklessly_modifiable(spte))
> 1370                 return false;
> 1371
> 
> It always do write-protection if the spte is lockelss modifiable.
> (This code is the aspect after applying the whole pachset, the code is safe too
> before patch "[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect" since
> the lockless write-protection path is serialized by a single lock.).
> 
> 4) in set_spte()
> 2690                 /*
> 2691                  * Optimization: for pte sync, if spte was writable the hash
> 2692                  * lookup is unnecessary (and expensive). Write protection
> 2693                  * is responsibility of mmu_get_page / kvm_sync_page.
> 2694                  * Same reasoning can be applied to dirty page accounting.
> 2695                  */
> 2696                 if (!can_unsync && is_writable_pte(*sptep))
> 2697                         goto set_pte;
> 
> It is used for a optimization and the worst case is the optimization is disabled
> (walking the shadow pages in the hast table) when the spte has been locklessly
> write-protected. It does not hurt anything since it is a rare event. And the
> optimization can be back if we check SPTE_MMU_WRITEABLE instead.
> 
> 5) fast_page_fault()
> 3110         /*
> 3111          * Check if it is a spurious fault caused by TLB lazily flushed.
> 3112          *
> 3113          * Need not check the access of upper level table entries since
> 3114          * they are always ACC_ALL.
> 3115          */
> 3116          if (is_writable_pte(spte)) {
> 3117                 ret = true;
> 3118                 goto exit;
> 3119         }
> 
> Since kvm_vm_ioctl_get_dirty_log() firstly get-and-clear dirty-bitmap before
> do write-protect, the dirty-bitmap will be properly set again when fast_page_fault
> fix the spte who is write-protected by lockless write-protection.
> 
> 6) in fast_page_fault's tracepoint:
> 244 #define __spte_satisfied(__spte)                                \
> 245         (__entry->retry && is_writable_pte(__entry->__spte))
> It causes the tracepoint reports the wrong result when fast_page_fault
> and tdp_page_fault/lockless-write-protect run concurrently, i guess it's
> okay since it's only used for trace.
> 
> 7) in audit_write_protection():
> 202                 if (is_writable_pte(*sptep))
> 203                         audit_printk(kvm, "shadow page has writable "
> 204                                      "mappings: gfn %llx role %x\n",
> 205                                      sp->gfn, sp->role.word);
> It's okay since lockless-write-protection does not update the readonly sptes.
> 
> > 
> > OK to improve comments later.
> 
> Thank you, Marcelo!
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 62f18ec..337d173 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4273,15 +4273,25 @@  void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 			if (*rmapp)
 				__rmap_write_protect(kvm, rmapp, false);
 
-			if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
-				kvm_flush_remote_tlbs(kvm);
+			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
 				cond_resched_lock(&kvm->mmu_lock);
-			}
 		}
 	}
 
-	kvm_flush_remote_tlbs(kvm);
 	spin_unlock(&kvm->mmu_lock);
+
+	/*
+	 * We can flush all the TLBs out of the mmu lock without TLB
+	 * corruption since we just change the spte from writable to
+	 * readonly so that we only need to care the case of changing
+	 * spte from present to present (changing the spte from present
+	 * to nonpresent will flush all the TLBs immediately), in other
+	 * words, the only case we care is mmu_spte_update() where we
+	 * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
+	 * instead of PT_WRITABLE_MASK, that means it does not depend
+	 * on PT_WRITABLE_MASK anymore.
+	 */
+	kvm_flush_remote_tlbs(kvm);
 }
 
 #define BATCH_ZAP_PAGES	10
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b3aa650..573c6b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3571,11 +3571,16 @@  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 		offset = i * BITS_PER_LONG;
 		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
 	}
-	if (is_dirty)
-		kvm_flush_remote_tlbs(kvm);
 
 	spin_unlock(&kvm->mmu_lock);
 
+	/*
+	 * All the TLBs can be flushed out of mmu lock, see the comments in
+	 * kvm_mmu_slot_remove_write_access().
+	 */
+	if (is_dirty)
+		kvm_flush_remote_tlbs(kvm);
+
 	r = -EFAULT;
 	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
 		goto out;