Message ID | 1382534973-13197-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Oct 23, 2013 at 09:29:22PM +0800, Xiao Guangrong wrote: > Now we can flush all the TLBs out of the mmu lock without TLB corruption when > write-proect the sptes, it is because: > - we have marked large sptes readonly instead of dropping them that means we > just change the spte from writable to readonly so that we only need to care > the case of changing spte from present to present (changing the spte from > present to nonpresent will flush all the TLBs immediately), in other words, > the only case we need to care is mmu_spte_update() Xiao, Any code location which reads the writable bit in the spte and assumes if its not set, that the translation which the spte refers to is not cached in a remote CPU's TLB can become buggy. (*) It might be the case that now its not an issue, but its so subtle that it should be improved. Can you add a fat comment on top of is_writeable_bit describing this? (and explain why is_writable_pte users do not make an assumption about (*). "Writeable bit of locklessly modifiable sptes might be cleared but TLBs not flushed: so whenever reading locklessly modifiable sptes you cannot assume TLBs are flushed". For example this one is unclear: if (!can_unsync && is_writable_pte(*sptep)) goto set_pte; And: if (!is_writable_pte(spte) && !(pt_protect && spte_is_locklessly_modifiable(spte))) return false; This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are serialized by a single mutex (if there were two mutexes, it would not be safe). Can you add an assert to both kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log for (slots_lock) is locked, and explain? So just improve the comments please, thanks (no need to resend whole series). > - in mmu_spte_update(), we haved checked > SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that > means it does not depend on PT_WRITABLE_MASK anymore > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > --- > arch/x86/kvm/mmu.c | 18 ++++++++++++++---- > arch/x86/kvm/x86.c | 9 +++++++-- > 2 files changed, 21 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 62f18ec..337d173 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -4273,15 +4273,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) > if (*rmapp) > __rmap_write_protect(kvm, rmapp, false); > > - if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { > - kvm_flush_remote_tlbs(kvm); > + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) > cond_resched_lock(&kvm->mmu_lock); > - } > } > } > > - kvm_flush_remote_tlbs(kvm); > spin_unlock(&kvm->mmu_lock); > + > + /* > + * We can flush all the TLBs out of the mmu lock without TLB > + * corruption since we just change the spte from writable to > + * readonly so that we only need to care the case of changing > + * spte from present to present (changing the spte from present > + * to nonpresent will flush all the TLBs immediately), in other > + * words, the only case we care is mmu_spte_update() where we > + * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE > + * instead of PT_WRITABLE_MASK, that means it does not depend > + * on PT_WRITABLE_MASK anymore. > + */ > + kvm_flush_remote_tlbs(kvm); > } > > #define BATCH_ZAP_PAGES 10 > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b3aa650..573c6b3 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3571,11 +3571,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) > offset = i * BITS_PER_LONG; > kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask); > } > - if (is_dirty) > - kvm_flush_remote_tlbs(kvm); > > spin_unlock(&kvm->mmu_lock); > > + /* > + * All the TLBs can be flushed out of mmu lock, see the comments in > + * kvm_mmu_slot_remove_write_access(). > + */ > + if (is_dirty) > + kvm_flush_remote_tlbs(kvm); > + > r = -EFAULT; > if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n)) > goto out; > -- > 1.8.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Marcelo, On 11/14/2013 08:36 AM, Marcelo Tosatti wrote: > > Any code location which reads the writable bit in the spte and assumes if its not > set, that the translation which the spte refers to is not cached in a > remote CPU's TLB can become buggy. (*) > > It might be the case that now its not an issue, but its so subtle that > it should be improved. > > Can you add a fat comment on top of is_writeable_bit describing this? > (and explain why is_writable_pte users do not make an assumption > about (*). > > "Writeable bit of locklessly modifiable sptes might be cleared > but TLBs not flushed: so whenever reading locklessly modifiable sptes > you cannot assume TLBs are flushed". > > For example this one is unclear: > > if (!can_unsync && is_writable_pte(*sptep)) > goto set_pte; > And: > > if (!is_writable_pte(spte) && > !(pt_protect && spte_is_locklessly_modifiable(spte))) > return false; > > This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are > serialized by a single mutex (if there were two mutexes, it would not be > safe). Can you add an assert to both > kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log > for (slots_lock) is locked, and explain? > > So just improve the comments please, thanks (no need to resend whole > series). Thank you very much for your time to review it and really appreciate for you detailed the issue so clearly to me. I will do it on the top of this patchset or after it is merged (if it's possiable). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote: > > Hi Marcelo, > > On 11/14/2013 08:36 AM, Marcelo Tosatti wrote: > > > > > Any code location which reads the writable bit in the spte and assumes if its not > > set, that the translation which the spte refers to is not cached in a > > remote CPU's TLB can become buggy. (*) > > > > It might be the case that now its not an issue, but its so subtle that > > it should be improved. > > > > Can you add a fat comment on top of is_writeable_bit describing this? > > (and explain why is_writable_pte users do not make an assumption > > about (*). > > > > "Writeable bit of locklessly modifiable sptes might be cleared > > but TLBs not flushed: so whenever reading locklessly modifiable sptes > > you cannot assume TLBs are flushed". > > > > For example this one is unclear: > > > > if (!can_unsync && is_writable_pte(*sptep)) > > goto set_pte; > > And: > > > > if (!is_writable_pte(spte) && > > !(pt_protect && spte_is_locklessly_modifiable(spte))) > > return false; > > > > This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are > > serialized by a single mutex (if there were two mutexes, it would not be > > safe). Can you add an assert to both > > kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log > > for (slots_lock) is locked, and explain? > > > > So just improve the comments please, thanks (no need to resend whole > > series). > > Thank you very much for your time to review it and really appreciate > for you detailed the issue so clearly to me. > > I will do it on the top of this patchset or after it is merged > (if it's possiable). Ok, can you explain why every individual caller of is_writable_pte have no such assumption now? (the one mentioned above is not clear to me for example, should explain all of them). OK to improve comments later. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/15/2013 02:39 AM, Marcelo Tosatti wrote: > On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote: >> >> Hi Marcelo, >> >> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote: >> >>> >>> Any code location which reads the writable bit in the spte and assumes if its not >>> set, that the translation which the spte refers to is not cached in a >>> remote CPU's TLB can become buggy. (*) >>> >>> It might be the case that now its not an issue, but its so subtle that >>> it should be improved. >>> >>> Can you add a fat comment on top of is_writeable_bit describing this? >>> (and explain why is_writable_pte users do not make an assumption >>> about (*). >>> >>> "Writeable bit of locklessly modifiable sptes might be cleared >>> but TLBs not flushed: so whenever reading locklessly modifiable sptes >>> you cannot assume TLBs are flushed". >>> >>> For example this one is unclear: >>> >>> if (!can_unsync && is_writable_pte(*sptep)) >>> goto set_pte; >>> And: >>> >>> if (!is_writable_pte(spte) && >>> !(pt_protect && spte_is_locklessly_modifiable(spte))) >>> return false; >>> >>> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are >>> serialized by a single mutex (if there were two mutexes, it would not be >>> safe). Can you add an assert to both >>> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log >>> for (slots_lock) is locked, and explain? >>> >>> So just improve the comments please, thanks (no need to resend whole >>> series). >> >> Thank you very much for your time to review it and really appreciate >> for you detailed the issue so clearly to me. >> >> I will do it on the top of this patchset or after it is merged >> (if it's possiable). > > Ok, can you explain why every individual caller of is_writable_pte have > no such assumption now? (the one mentioned above is not clear to me for > example, should explain all of them). Okay. Generally speak, we 1) needn't care readonly spte too much since it can not be locklessly write-protected and 2) if is_writable_pte() is used to check mmu-mode's state we can check SPTE_MMU_WRITEABLE instead. There are the places is_writable_pte is used: 1) in spte_has_volatile_bits(): 527 static bool spte_has_volatile_bits(u64 spte) 528 { 529 /* 530 * Always atomicly update spte if it can be updated 531 * out of mmu-lock, it can ensure dirty bit is not lost, 532 * also, it can help us to get a stable is_writable_pte() 533 * to ensure tlb flush is not missed. 534 */ 535 if (spte_is_locklessly_modifiable(spte)) 536 return true; 537 538 if (!shadow_accessed_mask) 539 return false; 540 541 if (!is_shadow_present_pte(spte)) 542 return false; 543 544 if ((spte & shadow_accessed_mask) && 545 (!is_writable_pte(spte) || (spte & shadow_dirty_mask))) 546 return false; 547 548 return true; 549 } this path is not broken since any spte can be lockless modifiable will do lockless update (will always return 'true' in the line 536). 2): in mmu_spte_update() 594 /* 595 * For the spte updated out of mmu-lock is safe, since 596 * we always atomicly update it, see the comments in 597 * spte_has_volatile_bits(). 598 */ 599 if (spte_is_locklessly_modifiable(old_spte) && 600 !is_writable_pte(new_spte)) 601 ret = true; The new_spte is a temp value that can not be fetched by lockless write-protection and !is_writable_pte() is stable enough (can not be locklessly write-protected). 3) in spte_write_protect() 1368 if (!is_writable_pte(spte) && 1369 !spte_is_locklessly_modifiable(spte)) 1370 return false; 1371 It always do write-protection if the spte is lockelss modifiable. (This code is the aspect after applying the whole pachset, the code is safe too before patch "[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect" since the lockless write-protection path is serialized by a single lock.). 4) in set_spte() 2690 /* 2691 * Optimization: for pte sync, if spte was writable the hash 2692 * lookup is unnecessary (and expensive). Write protection 2693 * is responsibility of mmu_get_page / kvm_sync_page. 2694 * Same reasoning can be applied to dirty page accounting. 2695 */ 2696 if (!can_unsync && is_writable_pte(*sptep)) 2697 goto set_pte; It is used for a optimization and the worst case is the optimization is disabled (walking the shadow pages in the hast table) when the spte has been locklessly write-protected. It does not hurt anything since it is a rare event. And the optimization can be back if we check SPTE_MMU_WRITEABLE instead. 5) fast_page_fault() 3110 /* 3111 * Check if it is a spurious fault caused by TLB lazily flushed. 3112 * 3113 * Need not check the access of upper level table entries since 3114 * they are always ACC_ALL. 3115 */ 3116 if (is_writable_pte(spte)) { 3117 ret = true; 3118 goto exit; 3119 } Since kvm_vm_ioctl_get_dirty_log() firstly get-and-clear dirty-bitmap before do write-protect, the dirty-bitmap will be properly set again when fast_page_fault fix the spte who is write-protected by lockless write-protection. 6) in fast_page_fault's tracepoint: 244 #define __spte_satisfied(__spte) \ 245 (__entry->retry && is_writable_pte(__entry->__spte)) It causes the tracepoint reports the wrong result when fast_page_fault and tdp_page_fault/lockless-write-protect run concurrently, i guess it's okay since it's only used for trace. 7) in audit_write_protection(): 202 if (is_writable_pte(*sptep)) 203 audit_printk(kvm, "shadow page has writable " 204 "mappings: gfn %llx role %x\n", 205 sp->gfn, sp->role.word); It's okay since lockless-write-protection does not update the readonly sptes. > > OK to improve comments later. Thank you, Marcelo! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 15, 2013 at 03:09:13PM +0800, Xiao Guangrong wrote: > On 11/15/2013 02:39 AM, Marcelo Tosatti wrote: > > On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote: > >> > >> Hi Marcelo, > >> > >> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote: > >> > >>> > >>> Any code location which reads the writable bit in the spte and assumes if its not > >>> set, that the translation which the spte refers to is not cached in a > >>> remote CPU's TLB can become buggy. (*) > >>> > >>> It might be the case that now its not an issue, but its so subtle that > >>> it should be improved. > >>> > >>> Can you add a fat comment on top of is_writeable_bit describing this? > >>> (and explain why is_writable_pte users do not make an assumption > >>> about (*). > >>> > >>> "Writeable bit of locklessly modifiable sptes might be cleared > >>> but TLBs not flushed: so whenever reading locklessly modifiable sptes > >>> you cannot assume TLBs are flushed". > >>> > >>> For example this one is unclear: > >>> > >>> if (!can_unsync && is_writable_pte(*sptep)) > >>> goto set_pte; > >>> And: > >>> > >>> if (!is_writable_pte(spte) && > >>> !(pt_protect && spte_is_locklessly_modifiable(spte))) > >>> return false; > >>> > >>> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are > >>> serialized by a single mutex (if there were two mutexes, it would not be > >>> safe). Can you add an assert to both > >>> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log > >>> for (slots_lock) is locked, and explain? > >>> > >>> So just improve the comments please, thanks (no need to resend whole > >>> series). > >> > >> Thank you very much for your time to review it and really appreciate > >> for you detailed the issue so clearly to me. > >> > >> I will do it on the top of this patchset or after it is merged > >> (if it's possiable). > > > > Ok, can you explain why every individual caller of is_writable_pte have > > no such assumption now? (the one mentioned above is not clear to me for > > example, should explain all of them). > > Okay. OK, thanks for checking. > Generally speak, we 1) needn't care readonly spte too much since it > can not be locklessly write-protected and 2) if is_writable_pte() is used > to check mmu-mode's state we can check SPTE_MMU_WRITEABLE instead. > > There are the places is_writable_pte is used: > 1) in spte_has_volatile_bits(): > 527 static bool spte_has_volatile_bits(u64 spte) > 528 { > 529 /* > 530 * Always atomicly update spte if it can be updated > 531 * out of mmu-lock, it can ensure dirty bit is not lost, > 532 * also, it can help us to get a stable is_writable_pte() > 533 * to ensure tlb flush is not missed. > 534 */ > 535 if (spte_is_locklessly_modifiable(spte)) > 536 return true; > 537 > 538 if (!shadow_accessed_mask) > 539 return false; > 540 > 541 if (!is_shadow_present_pte(spte)) > 542 return false; > 543 > 544 if ((spte & shadow_accessed_mask) && > 545 (!is_writable_pte(spte) || (spte & shadow_dirty_mask))) > 546 return false; > 547 > 548 return true; > 549 } > > this path is not broken since any spte can be lockless modifiable will do > lockless update (will always return 'true' in the line 536). > > 2): in mmu_spte_update() > 594 /* > 595 * For the spte updated out of mmu-lock is safe, since > 596 * we always atomicly update it, see the comments in > 597 * spte_has_volatile_bits(). > 598 */ > 599 if (spte_is_locklessly_modifiable(old_spte) && > 600 !is_writable_pte(new_spte)) > 601 ret = true; > > The new_spte is a temp value that can not be fetched by lockless > write-protection and !is_writable_pte() is stable enough (can not be > locklessly write-protected). > > 3) in spte_write_protect() > 1368 if (!is_writable_pte(spte) && > 1369 !spte_is_locklessly_modifiable(spte)) > 1370 return false; > 1371 > > It always do write-protection if the spte is lockelss modifiable. > (This code is the aspect after applying the whole pachset, the code is safe too > before patch "[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect" since > the lockless write-protection path is serialized by a single lock.). > > 4) in set_spte() > 2690 /* > 2691 * Optimization: for pte sync, if spte was writable the hash > 2692 * lookup is unnecessary (and expensive). Write protection > 2693 * is responsibility of mmu_get_page / kvm_sync_page. > 2694 * Same reasoning can be applied to dirty page accounting. > 2695 */ > 2696 if (!can_unsync && is_writable_pte(*sptep)) > 2697 goto set_pte; > > It is used for a optimization and the worst case is the optimization is disabled > (walking the shadow pages in the hast table) when the spte has been locklessly > write-protected. It does not hurt anything since it is a rare event. And the > optimization can be back if we check SPTE_MMU_WRITEABLE instead. > > 5) fast_page_fault() > 3110 /* > 3111 * Check if it is a spurious fault caused by TLB lazily flushed. > 3112 * > 3113 * Need not check the access of upper level table entries since > 3114 * they are always ACC_ALL. > 3115 */ > 3116 if (is_writable_pte(spte)) { > 3117 ret = true; > 3118 goto exit; > 3119 } > > Since kvm_vm_ioctl_get_dirty_log() firstly get-and-clear dirty-bitmap before > do write-protect, the dirty-bitmap will be properly set again when fast_page_fault > fix the spte who is write-protected by lockless write-protection. > > 6) in fast_page_fault's tracepoint: > 244 #define __spte_satisfied(__spte) \ > 245 (__entry->retry && is_writable_pte(__entry->__spte)) > It causes the tracepoint reports the wrong result when fast_page_fault > and tdp_page_fault/lockless-write-protect run concurrently, i guess it's > okay since it's only used for trace. > > 7) in audit_write_protection(): > 202 if (is_writable_pte(*sptep)) > 203 audit_printk(kvm, "shadow page has writable " > 204 "mappings: gfn %llx role %x\n", > 205 sp->gfn, sp->role.word); > It's okay since lockless-write-protection does not update the readonly sptes. > > > > > OK to improve comments later. > > Thank you, Marcelo! > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 62f18ec..337d173 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -4273,15 +4273,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) if (*rmapp) __rmap_write_protect(kvm, rmapp, false); - if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { - kvm_flush_remote_tlbs(kvm); + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) cond_resched_lock(&kvm->mmu_lock); - } } } - kvm_flush_remote_tlbs(kvm); spin_unlock(&kvm->mmu_lock); + + /* + * We can flush all the TLBs out of the mmu lock without TLB + * corruption since we just change the spte from writable to + * readonly so that we only need to care the case of changing + * spte from present to present (changing the spte from present + * to nonpresent will flush all the TLBs immediately), in other + * words, the only case we care is mmu_spte_update() where we + * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE + * instead of PT_WRITABLE_MASK, that means it does not depend + * on PT_WRITABLE_MASK anymore. + */ + kvm_flush_remote_tlbs(kvm); } #define BATCH_ZAP_PAGES 10 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b3aa650..573c6b3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3571,11 +3571,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) offset = i * BITS_PER_LONG; kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask); } - if (is_dirty) - kvm_flush_remote_tlbs(kvm); spin_unlock(&kvm->mmu_lock); + /* + * All the TLBs can be flushed out of mmu lock, see the comments in + * kvm_mmu_slot_remove_write_access(). + */ + if (is_dirty) + kvm_flush_remote_tlbs(kvm); + r = -EFAULT; if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n)) goto out;
Now we can flush all the TLBs out of the mmu lock without TLB corruption when write-proect the sptes, it is because: - we have marked large sptes readonly instead of dropping them that means we just change the spte from writable to readonly so that we only need to care the case of changing spte from present to present (changing the spte from present to nonpresent will flush all the TLBs immediately), in other words, the only case we need to care is mmu_spte_update() - in mmu_spte_update(), we haved checked SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that means it does not depend on PT_WRITABLE_MASK anymore Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> --- arch/x86/kvm/mmu.c | 18 ++++++++++++++---- arch/x86/kvm/x86.c | 9 +++++++-- 2 files changed, 21 insertions(+), 6 deletions(-)