Message ID | 20240802191617.312752-1-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Use precise range-based flush in mmu_notifier hooks when possible | expand |
Hi Sean, On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote: > Do arch-specific range-based TLB flushes (if they're supported) when > flushing in response to mmu_notifier events, as a single range-based flush > is almost always more performant. This is especially true in the case of > mmu_notifier events, as the majority of events that hit a running VM > operate on a relatively small range of memory. > > Cc: Marc Zyngier <maz@kernel.org> > Cc: Will Deacon <will@kernel.org> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > > This is *very* lightly tested, a thumbs up from the ARM world would be much > appreciated. > > virt/kvm/kvm_main.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index d0788d0a72cc..46bb95d58d53 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > struct kvm_gfn_range gfn_range; > struct kvm_memory_slot *slot; > struct kvm_memslots *slots; > + bool need_flush = false; > int i, idx; > > if (WARN_ON_ONCE(range->end <= range->start)) > @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > goto mmu_unlock; > } > r.ret |= range->handler(kvm, &gfn_range); > + > + /* > + * Use a precise gfn-based TLB flush when possible, as > + * most mmu_notifier events affect a small-ish range. > + * Fall back to a full TLB flush if the gfn-based flush > + * fails, and don't bother trying the gfn-based flush > + * if a full flush is already pending. > + */ > + if (range->flush_on_ret && !need_flush && r.ret && > + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start, > + gfn_range.end - gfn_range.start)) > + need_flush = true; Thanks for having a crack at this. We could still do better in the ->clear_flush_young() case if the handler could do the invalidation as part of its page-table walk (for example, it could use information about the page-table structure such as the level of the leaves to optimise the invalidation further), but this does at least avoid zapping the whole VMID on CPUs with range support. My only slight concern is that, should clear_flush_young() be extended to operate on more than a single page-at-a-time in future, this will silently end up invalidating the entire VMID for each memslot unless we teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case. Will
On Tue, Aug 20, 2024, Will Deacon wrote: > Hi Sean, > > On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote: > > Do arch-specific range-based TLB flushes (if they're supported) when > > flushing in response to mmu_notifier events, as a single range-based flush > > is almost always more performant. This is especially true in the case of > > mmu_notifier events, as the majority of events that hit a running VM > > operate on a relatively small range of memory. > > > > Cc: Marc Zyngier <maz@kernel.org> > > Cc: Will Deacon <will@kernel.org> > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > --- > > > > This is *very* lightly tested, a thumbs up from the ARM world would be much > > appreciated. > > > > virt/kvm/kvm_main.c | 15 ++++++++++++++- > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index d0788d0a72cc..46bb95d58d53 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > > struct kvm_gfn_range gfn_range; > > struct kvm_memory_slot *slot; > > struct kvm_memslots *slots; > > + bool need_flush = false; > > int i, idx; > > > > if (WARN_ON_ONCE(range->end <= range->start)) > > @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > > goto mmu_unlock; > > } > > r.ret |= range->handler(kvm, &gfn_range); > > + > > + /* > > + * Use a precise gfn-based TLB flush when possible, as > > + * most mmu_notifier events affect a small-ish range. > > + * Fall back to a full TLB flush if the gfn-based flush > > + * fails, and don't bother trying the gfn-based flush > > + * if a full flush is already pending. > > + */ > > + if (range->flush_on_ret && !need_flush && r.ret && > > + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start, > > + gfn_range.end - gfn_range.start)) > > + need_flush = true; > > Thanks for having a crack at this. > > We could still do better in the ->clear_flush_young() case if the For clear_flush_young(), I 100% think we should let architectures opt out of the flush. For architectures where it's safe, the primary MMU doesn't do a TLB flush, and hasn't for years. Sending patches for this (for at least x86 and arm64) is on my todo list. Even better would be to kill off mmu_notifier_clear_flush_young() entirely, e.g. if all KVM architectures can elide the flush. And even better than that would be to kill pxxx_clear_flush_young_notify() in the kernel, but I suspect that's not feasible as there are architectures that require a TLB flush for correctness. > handler could do the invalidation as part of its page-table walk (for > example, it could use information about the page-table structure such > as the level of the leaves to optimise the invalidation further), but > this does at least avoid zapping the whole VMID on CPUs with range > support. > > My only slight concern is that, should clear_flush_young() be extended > to operate on more than a single page-at-a-time in future, this will > silently end up invalidating the entire VMID for each memslot unless we > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case. I'm not sure I follow the "entire VMID for each memslot" concern. Are you worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide flush?
On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote: > On Tue, Aug 20, 2024, Will Deacon wrote: > > On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote: > > > Do arch-specific range-based TLB flushes (if they're supported) when > > > flushing in response to mmu_notifier events, as a single range-based flush > > > is almost always more performant. This is especially true in the case of > > > mmu_notifier events, as the majority of events that hit a running VM > > > operate on a relatively small range of memory. > > > > > > Cc: Marc Zyngier <maz@kernel.org> > > > Cc: Will Deacon <will@kernel.org> > > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > > --- > > > > > > This is *very* lightly tested, a thumbs up from the ARM world would be much > > > appreciated. > > > > > > virt/kvm/kvm_main.c | 15 ++++++++++++++- > > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index d0788d0a72cc..46bb95d58d53 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > > > struct kvm_gfn_range gfn_range; > > > struct kvm_memory_slot *slot; > > > struct kvm_memslots *slots; > > > + bool need_flush = false; > > > int i, idx; > > > > > > if (WARN_ON_ONCE(range->end <= range->start)) > > > @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > > > goto mmu_unlock; > > > } > > > r.ret |= range->handler(kvm, &gfn_range); > > > + > > > + /* > > > + * Use a precise gfn-based TLB flush when possible, as > > > + * most mmu_notifier events affect a small-ish range. > > > + * Fall back to a full TLB flush if the gfn-based flush > > > + * fails, and don't bother trying the gfn-based flush > > > + * if a full flush is already pending. > > > + */ > > > + if (range->flush_on_ret && !need_flush && r.ret && > > > + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start, > > > + gfn_range.end - gfn_range.start)) > > > + need_flush = true; > > > > Thanks for having a crack at this. > > > > We could still do better in the ->clear_flush_young() case if the > > For clear_flush_young(), I 100% think we should let architectures opt out of the > flush. For architectures where it's safe, the primary MMU doesn't do a TLB flush, > and hasn't for years. Sending patches for this (for at least x86 and arm64) is > on my todo list. I can see the appeal of dropping the invalidation altogether, although with the zoo of different micro-architectures we have on arm64 I do worry that it could potentially make the AF information a lot useful on some parts. Does x86 make any guarantees about when an old pte becomes visible to the CPU in the absence of explicit TLB invalidation? (e.g. I'm wondering if it's bounded by the next context switch or something like that). > Even better would be to kill off mmu_notifier_clear_flush_young() entirely, e.g. > if all KVM architectures can elide the flush. I, for one, would love to see fewer MMU notifiers :) > And even better than that would be to kill pxxx_clear_flush_young_notify() in > the kernel, but I suspect that's not feasible as there are architectures that > require a TLB flush for correctness. I think you might want it for IOMMUs as well. > > handler could do the invalidation as part of its page-table walk (for > > example, it could use information about the page-table structure such > > as the level of the leaves to optimise the invalidation further), but > > this does at least avoid zapping the whole VMID on CPUs with range > > support. > > > > My only slight concern is that, should clear_flush_young() be extended > > to operate on more than a single page-at-a-time in future, this will > > silently end up invalidating the entire VMID for each memslot unless we > > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case. > > I'm not sure I follow the "entire VMID for each memslot" concern. Are you > worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide > flush? The arm64 implementation of kvm_arch_flush_remote_tlbs_range() unconditionally returns 0, so we could end up over-invalidating pretty badly if that doesn't change. It should be straightforward to fix, but I just wanted to point it out because it would be easy to miss too! Will
On Tue, Aug 20, 2024, Will Deacon wrote: > On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote: > > On Tue, Aug 20, 2024, Will Deacon wrote: > > > On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote: > > > > Do arch-specific range-based TLB flushes (if they're supported) when > > > > flushing in response to mmu_notifier events, as a single range-based flush > > > > is almost always more performant. This is especially true in the case of > > > > mmu_notifier events, as the majority of events that hit a running VM > > > > operate on a relatively small range of memory. > > > > > > > > Cc: Marc Zyngier <maz@kernel.org> > > > > Cc: Will Deacon <will@kernel.org> > > > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > > > --- > > > > > > > > This is *very* lightly tested, a thumbs up from the ARM world would be much > > > > appreciated. > > > > > > > > virt/kvm/kvm_main.c | 15 ++++++++++++++- > > > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > > index d0788d0a72cc..46bb95d58d53 100644 > > > > --- a/virt/kvm/kvm_main.c > > > > +++ b/virt/kvm/kvm_main.c > > > > @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > > > > struct kvm_gfn_range gfn_range; > > > > struct kvm_memory_slot *slot; > > > > struct kvm_memslots *slots; > > > > + bool need_flush = false; > > > > int i, idx; > > > > > > > > if (WARN_ON_ONCE(range->end <= range->start)) > > > > @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > > > > goto mmu_unlock; > > > > } > > > > r.ret |= range->handler(kvm, &gfn_range); > > > > + > > > > + /* > > > > + * Use a precise gfn-based TLB flush when possible, as > > > > + * most mmu_notifier events affect a small-ish range. > > > > + * Fall back to a full TLB flush if the gfn-based flush > > > > + * fails, and don't bother trying the gfn-based flush > > > > + * if a full flush is already pending. > > > > + */ > > > > + if (range->flush_on_ret && !need_flush && r.ret && > > > > + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start, > > > > + gfn_range.end - gfn_range.start)) > > > > + need_flush = true; > > > > > > Thanks for having a crack at this. > > > > > > We could still do better in the ->clear_flush_young() case if the > > > > For clear_flush_young(), I 100% think we should let architectures opt out of the > > flush. For architectures where it's safe, the primary MMU doesn't do a TLB flush, > > and hasn't for years. Sending patches for this (for at least x86 and arm64) is > > on my todo list. > > I can see the appeal of dropping the invalidation altogether, although > with the zoo of different micro-architectures we have on arm64 I do > worry that it could potentially make the AF information a lot useful on > some parts. Does x86 make any guarantees about when an old pte becomes > visible to the CPU in the absence of explicit TLB invalidation? (e.g. > I'm wondering if it's bounded by the next context switch or something > like that). Nope. As pointed out by Yu in the MGLRU series[*], the main argument is that if there is memory pressure, i.e. a need for reclaim, then it's all but guaranteed that there will be even more TLB pressure. And while false negatives are certainly possible, nr_scanned >> nr_reclaimed >> nr_bad_reclaims, i.e. the total cost of false negatives is less than the total cost of the flushes because there are orders of magnitude more pages scanned than there are pages that are incorrectly reclaimed. https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@mail.gmail.com > > > Even better would be to kill off mmu_notifier_clear_flush_young() entirely, e.g. > > if all KVM architectures can elide the flush. > > I, for one, would love to see fewer MMU notifiers :) > > > And even better than that would be to kill pxxx_clear_flush_young_notify() in > > the kernel, but I suspect that's not feasible as there are architectures that > > require a TLB flush for correctness. > > I think you might want it for IOMMUs as well. > > > > handler could do the invalidation as part of its page-table walk (for > > > example, it could use information about the page-table structure such > > > as the level of the leaves to optimise the invalidation further), but > > > this does at least avoid zapping the whole VMID on CPUs with range > > > support. > > > > > > My only slight concern is that, should clear_flush_young() be extended > > > to operate on more than a single page-at-a-time in future, this will > > > silently end up invalidating the entire VMID for each memslot unless we > > > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case. > > > > I'm not sure I follow the "entire VMID for each memslot" concern. Are you > > worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide > > flush? > > The arm64 implementation of kvm_arch_flush_remote_tlbs_range() > unconditionally returns 0, so we could end up over-invalidating pretty > badly if that doesn't change. It should be straightforward to fix, but > I just wanted to point it out because it would be easy to miss too! Sorry, I'm still not following. 0==success, and gfn_range.{start,end} is scoped precisely to the overlap between the memslot and hva range. Regardless of the number of pages that are passed into clear_flush_young(), KVM should naturally flush only the exact range being aged. The only hiccup would be if the hva range straddles multiple memslots, but if userspace creates multiple memslots for a single vma, then that's a userspace problem.
On Tue, Aug 20, 2024 at 10:06:00AM -0700, Sean Christopherson wrote: > On Tue, Aug 20, 2024, Will Deacon wrote: > > On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote: > > > On Tue, Aug 20, 2024, Will Deacon wrote: > > > > handler could do the invalidation as part of its page-table walk (for > > > > example, it could use information about the page-table structure such > > > > as the level of the leaves to optimise the invalidation further), but > > > > this does at least avoid zapping the whole VMID on CPUs with range > > > > support. > > > > > > > > My only slight concern is that, should clear_flush_young() be extended > > > > to operate on more than a single page-at-a-time in future, this will > > > > silently end up invalidating the entire VMID for each memslot unless we > > > > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case. > > > > > > I'm not sure I follow the "entire VMID for each memslot" concern. Are you > > > worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide > > > flush? > > > > The arm64 implementation of kvm_arch_flush_remote_tlbs_range() > > unconditionally returns 0, so we could end up over-invalidating pretty > > badly if that doesn't change. It should be straightforward to fix, but > > I just wanted to point it out because it would be easy to miss too! > > Sorry, I'm still not following. 0==success, and gfn_range.{start,end} is scoped > precisely to the overlap between the memslot and hva range. Regardless of the > number of pages that are passed into clear_flush_young(), KVM should naturally > flush only the exact range being aged. The only hiccup would be if the hva range > straddles multiple memslots, but if userspace creates multiple memslots for a > single vma, then that's a userspace problem. Fair enough, but it's not a lot of effort to fix this (untested diff below) and if the code were to change in future so that __kvm_handle_hva_range() was more commonly used to span multiple memslots we probably wouldn't otherwise notice the silent over-invalidation for a while. Will --->8 diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 6981b1bc0946..1e34127f79b0 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -175,6 +175,9 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm) int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages) { + if (!system_supports_tlb_range()) + return -EOPNOTSUPP; + kvm_tlb_flush_vmid_range(&kvm->arch.mmu, gfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT); return 0;
On Fri, Aug 23, 2024, Will Deacon wrote: > On Tue, Aug 20, 2024 at 10:06:00AM -0700, Sean Christopherson wrote: > > On Tue, Aug 20, 2024, Will Deacon wrote: > > > On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote: > > > > On Tue, Aug 20, 2024, Will Deacon wrote: > > > > > handler could do the invalidation as part of its page-table walk (for > > > > > example, it could use information about the page-table structure such > > > > > as the level of the leaves to optimise the invalidation further), but > > > > > this does at least avoid zapping the whole VMID on CPUs with range > > > > > support. > > > > > > > > > > My only slight concern is that, should clear_flush_young() be extended > > > > > to operate on more than a single page-at-a-time in future, this will > > > > > silently end up invalidating the entire VMID for each memslot unless we > > > > > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case. > > > > > > > > I'm not sure I follow the "entire VMID for each memslot" concern. Are you > > > > worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide > > > > flush? > > > > > > The arm64 implementation of kvm_arch_flush_remote_tlbs_range() > > > unconditionally returns 0, so we could end up over-invalidating pretty > > > badly if that doesn't change. It should be straightforward to fix, but > > > I just wanted to point it out because it would be easy to miss too! > > > > Sorry, I'm still not following. 0==success, and gfn_range.{start,end} is scoped > > precisely to the overlap between the memslot and hva range. Regardless of the > > number of pages that are passed into clear_flush_young(), KVM should naturally > > flush only the exact range being aged. The only hiccup would be if the hva range > > straddles multiple memslots, but if userspace creates multiple memslots for a > > single vma, then that's a userspace problem. > > Fair enough, but it's not a lot of effort to fix this (untested diff > below) and if the code were to change in future so that > __kvm_handle_hva_range() was more commonly used to span multiple > memslots we probably wouldn't otherwise notice the silent > over-invalidation for a while. > > Will > > --->8 > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 6981b1bc0946..1e34127f79b0 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -175,6 +175,9 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm) > int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, > gfn_t gfn, u64 nr_pages) > { > + if (!system_supports_tlb_range()) > + return -EOPNOTSUPP; Oooh, now your comments make a lot more sense. I didn't catch on that range-based flushing wasn't universally supported. Agreed, not doing the above would be asinine. > + > kvm_tlb_flush_vmid_range(&kvm->arch.mmu, > gfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT); > return 0; >
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d0788d0a72cc..46bb95d58d53 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, struct kvm_gfn_range gfn_range; struct kvm_memory_slot *slot; struct kvm_memslots *slots; + bool need_flush = false; int i, idx; if (WARN_ON_ONCE(range->end <= range->start)) @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, goto mmu_unlock; } r.ret |= range->handler(kvm, &gfn_range); + + /* + * Use a precise gfn-based TLB flush when possible, as + * most mmu_notifier events affect a small-ish range. + * Fall back to a full TLB flush if the gfn-based flush + * fails, and don't bother trying the gfn-based flush + * if a full flush is already pending. + */ + if (range->flush_on_ret && !need_flush && r.ret && + kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start, + gfn_range.end - gfn_range.start)) + need_flush = true; } } - if (range->flush_on_ret && r.ret) + if (need_flush) kvm_flush_remote_tlbs(kvm); mmu_unlock:
Do arch-specific range-based TLB flushes (if they're supported) when flushing in response to mmu_notifier events, as a single range-based flush is almost always more performant. This is especially true in the case of mmu_notifier events, as the majority of events that hit a running VM operate on a relatively small range of memory. Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Sean Christopherson <seanjc@google.com> --- This is *very* lightly tested, a thumbs up from the ARM world would be much appreciated. virt/kvm/kvm_main.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f