diff mbox series

KVM: Use precise range-based flush in mmu_notifier hooks when possible

Message ID 20240802191617.312752-1-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: Use precise range-based flush in mmu_notifier hooks when possible | expand

Commit Message

Sean Christopherson Aug. 2, 2024, 7:16 p.m. UTC
Do arch-specific range-based TLB flushes (if they're supported) when
flushing in response to mmu_notifier events, as a single range-based flush
is almost always more performant.  This is especially true in the case of
mmu_notifier events, as the majority of events that hit a running VM
operate on a relatively small range of memory.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---

This is *very* lightly tested, a thumbs up from the ARM world would be much
appreciated.

 virt/kvm/kvm_main.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)


base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f

Comments

Will Deacon Aug. 20, 2024, 3:41 p.m. UTC | #1
Hi Sean,

On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote:
> Do arch-specific range-based TLB flushes (if they're supported) when
> flushing in response to mmu_notifier events, as a single range-based flush
> is almost always more performant.  This is especially true in the case of
> mmu_notifier events, as the majority of events that hit a running VM
> operate on a relatively small range of memory.
> 
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> 
> This is *very* lightly tested, a thumbs up from the ARM world would be much
> appreciated.
> 
>  virt/kvm/kvm_main.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d0788d0a72cc..46bb95d58d53 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>  	struct kvm_gfn_range gfn_range;
>  	struct kvm_memory_slot *slot;
>  	struct kvm_memslots *slots;
> +	bool need_flush = false;
>  	int i, idx;
>  
>  	if (WARN_ON_ONCE(range->end <= range->start))
> @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
>  					goto mmu_unlock;
>  			}
>  			r.ret |= range->handler(kvm, &gfn_range);
> +
> +			/*
> +			 * Use a precise gfn-based TLB flush when possible, as
> +			 * most mmu_notifier events affect a small-ish range.
> +			 * Fall back to a full TLB flush if the gfn-based flush
> +			 * fails, and don't bother trying the gfn-based flush
> +			 * if a full flush is already pending.
> +			 */
> +			if (range->flush_on_ret && !need_flush && r.ret &&
> +			    kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
> +							     gfn_range.end - gfn_range.start))
> +				need_flush = true;

Thanks for having a crack at this.

We could still do better in the ->clear_flush_young() case if the
handler could do the invalidation as part of its page-table walk (for
example, it could use information about the page-table structure such
as the level of the leaves to optimise the invalidation further), but
this does at least avoid zapping the whole VMID on CPUs with range
support.

My only slight concern is that, should clear_flush_young() be extended
to operate on more than a single page-at-a-time in future, this will
silently end up invalidating the entire VMID for each memslot unless we
teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case.

Will
Sean Christopherson Aug. 20, 2024, 4:07 p.m. UTC | #2
On Tue, Aug 20, 2024, Will Deacon wrote:
> Hi Sean,
> 
> On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote:
> > Do arch-specific range-based TLB flushes (if they're supported) when
> > flushing in response to mmu_notifier events, as a single range-based flush
> > is almost always more performant.  This is especially true in the case of
> > mmu_notifier events, as the majority of events that hit a running VM
> > operate on a relatively small range of memory.
> > 
> > Cc: Marc Zyngier <maz@kernel.org>
> > Cc: Will Deacon <will@kernel.org>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> > 
> > This is *very* lightly tested, a thumbs up from the ARM world would be much
> > appreciated.
> > 
> >  virt/kvm/kvm_main.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d0788d0a72cc..46bb95d58d53 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> >  	struct kvm_gfn_range gfn_range;
> >  	struct kvm_memory_slot *slot;
> >  	struct kvm_memslots *slots;
> > +	bool need_flush = false;
> >  	int i, idx;
> >  
> >  	if (WARN_ON_ONCE(range->end <= range->start))
> > @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> >  					goto mmu_unlock;
> >  			}
> >  			r.ret |= range->handler(kvm, &gfn_range);
> > +
> > +			/*
> > +			 * Use a precise gfn-based TLB flush when possible, as
> > +			 * most mmu_notifier events affect a small-ish range.
> > +			 * Fall back to a full TLB flush if the gfn-based flush
> > +			 * fails, and don't bother trying the gfn-based flush
> > +			 * if a full flush is already pending.
> > +			 */
> > +			if (range->flush_on_ret && !need_flush && r.ret &&
> > +			    kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
> > +							     gfn_range.end - gfn_range.start))
> > +				need_flush = true;
> 
> Thanks for having a crack at this.
> 
> We could still do better in the ->clear_flush_young() case if the

For clear_flush_young(), I 100% think we should let architectures opt out of the
flush.  For architectures where it's safe, the primary MMU doesn't do a TLB flush,
and hasn't for years.  Sending patches for this (for at least x86 and arm64) is
on my todo list.

Even better would be to kill off mmu_notifier_clear_flush_young() entirely, e.g.
if all KVM architectures can elide the flush.

And even better than that would be to kill pxxx_clear_flush_young_notify() in
the kernel, but I suspect that's not feasible as there are architectures that
require a TLB flush for correctness.

> handler could do the invalidation as part of its page-table walk (for
> example, it could use information about the page-table structure such
> as the level of the leaves to optimise the invalidation further), but
> this does at least avoid zapping the whole VMID on CPUs with range
> support.
> 
> My only slight concern is that, should clear_flush_young() be extended
> to operate on more than a single page-at-a-time in future, this will
> silently end up invalidating the entire VMID for each memslot unless we
> teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case.

I'm not sure I follow the "entire VMID for each memslot" concern.  Are you
worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide
flush?
Will Deacon Aug. 20, 2024, 4:32 p.m. UTC | #3
On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote:
> On Tue, Aug 20, 2024, Will Deacon wrote:
> > On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote:
> > > Do arch-specific range-based TLB flushes (if they're supported) when
> > > flushing in response to mmu_notifier events, as a single range-based flush
> > > is almost always more performant.  This is especially true in the case of
> > > mmu_notifier events, as the majority of events that hit a running VM
> > > operate on a relatively small range of memory.
> > > 
> > > Cc: Marc Zyngier <maz@kernel.org>
> > > Cc: Will Deacon <will@kernel.org>
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > > 
> > > This is *very* lightly tested, a thumbs up from the ARM world would be much
> > > appreciated.
> > > 
> > >  virt/kvm/kvm_main.c | 15 ++++++++++++++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index d0788d0a72cc..46bb95d58d53 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > >  	struct kvm_gfn_range gfn_range;
> > >  	struct kvm_memory_slot *slot;
> > >  	struct kvm_memslots *slots;
> > > +	bool need_flush = false;
> > >  	int i, idx;
> > >  
> > >  	if (WARN_ON_ONCE(range->end <= range->start))
> > > @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > >  					goto mmu_unlock;
> > >  			}
> > >  			r.ret |= range->handler(kvm, &gfn_range);
> > > +
> > > +			/*
> > > +			 * Use a precise gfn-based TLB flush when possible, as
> > > +			 * most mmu_notifier events affect a small-ish range.
> > > +			 * Fall back to a full TLB flush if the gfn-based flush
> > > +			 * fails, and don't bother trying the gfn-based flush
> > > +			 * if a full flush is already pending.
> > > +			 */
> > > +			if (range->flush_on_ret && !need_flush && r.ret &&
> > > +			    kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
> > > +							     gfn_range.end - gfn_range.start))
> > > +				need_flush = true;
> > 
> > Thanks for having a crack at this.
> > 
> > We could still do better in the ->clear_flush_young() case if the
> 
> For clear_flush_young(), I 100% think we should let architectures opt out of the
> flush.  For architectures where it's safe, the primary MMU doesn't do a TLB flush,
> and hasn't for years.  Sending patches for this (for at least x86 and arm64) is
> on my todo list.

I can see the appeal of dropping the invalidation altogether, although
with the zoo of different micro-architectures we have on arm64 I do
worry that it could potentially make the AF information a lot useful on
some parts. Does x86 make any guarantees about when an old pte becomes
visible to the CPU in the absence of explicit TLB invalidation? (e.g.
I'm wondering if it's bounded by the next context switch or something
like that).

> Even better would be to kill off mmu_notifier_clear_flush_young() entirely, e.g.
> if all KVM architectures can elide the flush.

I, for one, would love to see fewer MMU notifiers :)

> And even better than that would be to kill pxxx_clear_flush_young_notify() in
> the kernel, but I suspect that's not feasible as there are architectures that
> require a TLB flush for correctness.

I think you might want it for IOMMUs as well.

> > handler could do the invalidation as part of its page-table walk (for
> > example, it could use information about the page-table structure such
> > as the level of the leaves to optimise the invalidation further), but
> > this does at least avoid zapping the whole VMID on CPUs with range
> > support.
> > 
> > My only slight concern is that, should clear_flush_young() be extended
> > to operate on more than a single page-at-a-time in future, this will
> > silently end up invalidating the entire VMID for each memslot unless we
> > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case.
> 
> I'm not sure I follow the "entire VMID for each memslot" concern.  Are you
> worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide
> flush?

The arm64 implementation of kvm_arch_flush_remote_tlbs_range()
unconditionally returns 0, so we could end up over-invalidating pretty
badly if that doesn't change. It should be straightforward to fix, but
I just wanted to point it out because it would be easy to miss too!

Will
Sean Christopherson Aug. 20, 2024, 5:06 p.m. UTC | #4
On Tue, Aug 20, 2024, Will Deacon wrote:
> On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote:
> > On Tue, Aug 20, 2024, Will Deacon wrote:
> > > On Fri, Aug 02, 2024 at 12:16:17PM -0700, Sean Christopherson wrote:
> > > > Do arch-specific range-based TLB flushes (if they're supported) when
> > > > flushing in response to mmu_notifier events, as a single range-based flush
> > > > is almost always more performant.  This is especially true in the case of
> > > > mmu_notifier events, as the majority of events that hit a running VM
> > > > operate on a relatively small range of memory.
> > > > 
> > > > Cc: Marc Zyngier <maz@kernel.org>
> > > > Cc: Will Deacon <will@kernel.org>
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > > 
> > > > This is *very* lightly tested, a thumbs up from the ARM world would be much
> > > > appreciated.
> > > > 
> > > >  virt/kvm/kvm_main.c | 15 ++++++++++++++-
> > > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index d0788d0a72cc..46bb95d58d53 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -599,6 +599,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > > >  	struct kvm_gfn_range gfn_range;
> > > >  	struct kvm_memory_slot *slot;
> > > >  	struct kvm_memslots *slots;
> > > > +	bool need_flush = false;
> > > >  	int i, idx;
> > > >  
> > > >  	if (WARN_ON_ONCE(range->end <= range->start))
> > > > @@ -651,10 +652,22 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > > >  					goto mmu_unlock;
> > > >  			}
> > > >  			r.ret |= range->handler(kvm, &gfn_range);
> > > > +
> > > > +			/*
> > > > +			 * Use a precise gfn-based TLB flush when possible, as
> > > > +			 * most mmu_notifier events affect a small-ish range.
> > > > +			 * Fall back to a full TLB flush if the gfn-based flush
> > > > +			 * fails, and don't bother trying the gfn-based flush
> > > > +			 * if a full flush is already pending.
> > > > +			 */
> > > > +			if (range->flush_on_ret && !need_flush && r.ret &&
> > > > +			    kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
> > > > +							     gfn_range.end - gfn_range.start))
> > > > +				need_flush = true;
> > > 
> > > Thanks for having a crack at this.
> > > 
> > > We could still do better in the ->clear_flush_young() case if the
> > 
> > For clear_flush_young(), I 100% think we should let architectures opt out of the
> > flush.  For architectures where it's safe, the primary MMU doesn't do a TLB flush,
> > and hasn't for years.  Sending patches for this (for at least x86 and arm64) is
> > on my todo list.
> 
> I can see the appeal of dropping the invalidation altogether, although
> with the zoo of different micro-architectures we have on arm64 I do
> worry that it could potentially make the AF information a lot useful on
> some parts. Does x86 make any guarantees about when an old pte becomes
> visible to the CPU in the absence of explicit TLB invalidation? (e.g.
> I'm wondering if it's bounded by the next context switch or something
> like that).

Nope.  As pointed out by Yu in the MGLRU series[*], the main argument is that if
there is memory pressure, i.e. a need for reclaim, then it's all but guaranteed
that there will be even more TLB pressure.  And while false negatives are certainly
possible, nr_scanned >> nr_reclaimed >> nr_bad_reclaims, i.e. the total cost of
false negatives is less than the total cost of the flushes because there are
orders of magnitude more pages scanned than there are pages that are incorrectly
reclaimed.

https://lore.kernel.org/all/CAOUHufYCmYNngmS=rOSAQRB0N9ai+mA0aDrB9RopBvPHEK42Ng@mail.gmail.com

> 
> > Even better would be to kill off mmu_notifier_clear_flush_young() entirely, e.g.
> > if all KVM architectures can elide the flush.
> 
> I, for one, would love to see fewer MMU notifiers :)
> 
> > And even better than that would be to kill pxxx_clear_flush_young_notify() in
> > the kernel, but I suspect that's not feasible as there are architectures that
> > require a TLB flush for correctness.
> 
> I think you might want it for IOMMUs as well.
> 
> > > handler could do the invalidation as part of its page-table walk (for
> > > example, it could use information about the page-table structure such
> > > as the level of the leaves to optimise the invalidation further), but
> > > this does at least avoid zapping the whole VMID on CPUs with range
> > > support.
> > > 
> > > My only slight concern is that, should clear_flush_young() be extended
> > > to operate on more than a single page-at-a-time in future, this will
> > > silently end up invalidating the entire VMID for each memslot unless we
> > > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case.
> > 
> > I'm not sure I follow the "entire VMID for each memslot" concern.  Are you
> > worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide
> > flush?
> 
> The arm64 implementation of kvm_arch_flush_remote_tlbs_range()
> unconditionally returns 0, so we could end up over-invalidating pretty
> badly if that doesn't change. It should be straightforward to fix, but
> I just wanted to point it out because it would be easy to miss too!

Sorry, I'm still not following.  0==success, and gfn_range.{start,end} is scoped
precisely to the overlap between the memslot and hva range.  Regardless of the
number of pages that are passed into clear_flush_young(), KVM should naturally
flush only the exact range being aged.  The only hiccup would be if the hva range
straddles multiple memslots, but if userspace creates multiple memslots for a
single vma, then that's a userspace problem.
Will Deacon Aug. 23, 2024, 12:15 p.m. UTC | #5
On Tue, Aug 20, 2024 at 10:06:00AM -0700, Sean Christopherson wrote:
> On Tue, Aug 20, 2024, Will Deacon wrote:
> > On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote:
> > > On Tue, Aug 20, 2024, Will Deacon wrote:
> > > > handler could do the invalidation as part of its page-table walk (for
> > > > example, it could use information about the page-table structure such
> > > > as the level of the leaves to optimise the invalidation further), but
> > > > this does at least avoid zapping the whole VMID on CPUs with range
> > > > support.
> > > > 
> > > > My only slight concern is that, should clear_flush_young() be extended
> > > > to operate on more than a single page-at-a-time in future, this will
> > > > silently end up invalidating the entire VMID for each memslot unless we
> > > > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case.
> > > 
> > > I'm not sure I follow the "entire VMID for each memslot" concern.  Are you
> > > worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide
> > > flush?
> > 
> > The arm64 implementation of kvm_arch_flush_remote_tlbs_range()
> > unconditionally returns 0, so we could end up over-invalidating pretty
> > badly if that doesn't change. It should be straightforward to fix, but
> > I just wanted to point it out because it would be easy to miss too!
> 
> Sorry, I'm still not following.  0==success, and gfn_range.{start,end} is scoped
> precisely to the overlap between the memslot and hva range.  Regardless of the
> number of pages that are passed into clear_flush_young(), KVM should naturally
> flush only the exact range being aged.  The only hiccup would be if the hva range
> straddles multiple memslots, but if userspace creates multiple memslots for a
> single vma, then that's a userspace problem.

Fair enough, but it's not a lot of effort to fix this (untested diff
below) and if the code were to change in future so that
__kvm_handle_hva_range() was more commonly used to span multiple
memslots we probably wouldn't otherwise notice the silent
over-invalidation for a while.

Will

--->8

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6981b1bc0946..1e34127f79b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -175,6 +175,9 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
 int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm,
                                      gfn_t gfn, u64 nr_pages)
 {
+       if (!system_supports_tlb_range())
+               return -EOPNOTSUPP;
+
        kvm_tlb_flush_vmid_range(&kvm->arch.mmu,
                                gfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
        return 0;
Sean Christopherson Aug. 23, 2024, 1:27 p.m. UTC | #6
On Fri, Aug 23, 2024, Will Deacon wrote:
> On Tue, Aug 20, 2024 at 10:06:00AM -0700, Sean Christopherson wrote:
> > On Tue, Aug 20, 2024, Will Deacon wrote:
> > > On Tue, Aug 20, 2024 at 09:07:22AM -0700, Sean Christopherson wrote:
> > > > On Tue, Aug 20, 2024, Will Deacon wrote:
> > > > > handler could do the invalidation as part of its page-table walk (for
> > > > > example, it could use information about the page-table structure such
> > > > > as the level of the leaves to optimise the invalidation further), but
> > > > > this does at least avoid zapping the whole VMID on CPUs with range
> > > > > support.
> > > > > 
> > > > > My only slight concern is that, should clear_flush_young() be extended
> > > > > to operate on more than a single page-at-a-time in future, this will
> > > > > silently end up invalidating the entire VMID for each memslot unless we
> > > > > teach kvm_arch_flush_remote_tlbs_range() to return !0 in that case.
> > > > 
> > > > I'm not sure I follow the "entire VMID for each memslot" concern.  Are you
> > > > worried about kvm_arch_flush_remote_tlbs_range() failing and triggering a VM-wide
> > > > flush?
> > > 
> > > The arm64 implementation of kvm_arch_flush_remote_tlbs_range()
> > > unconditionally returns 0, so we could end up over-invalidating pretty
> > > badly if that doesn't change. It should be straightforward to fix, but
> > > I just wanted to point it out because it would be easy to miss too!
> > 
> > Sorry, I'm still not following.  0==success, and gfn_range.{start,end} is scoped
> > precisely to the overlap between the memslot and hva range.  Regardless of the
> > number of pages that are passed into clear_flush_young(), KVM should naturally
> > flush only the exact range being aged.  The only hiccup would be if the hva range
> > straddles multiple memslots, but if userspace creates multiple memslots for a
> > single vma, then that's a userspace problem.
> 
> Fair enough, but it's not a lot of effort to fix this (untested diff
> below) and if the code were to change in future so that
> __kvm_handle_hva_range() was more commonly used to span multiple
> memslots we probably wouldn't otherwise notice the silent
> over-invalidation for a while.
> 
> Will
> 
> --->8
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 6981b1bc0946..1e34127f79b0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -175,6 +175,9 @@ int kvm_arch_flush_remote_tlbs(struct kvm *kvm)
>  int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm,
>                                       gfn_t gfn, u64 nr_pages)
>  {
> +       if (!system_supports_tlb_range())
> +               return -EOPNOTSUPP;

Oooh, now your comments make a lot more sense.  I didn't catch on that range-based
flushing wasn't universally supported.

Agreed, not doing the above would be asinine.

> +
>         kvm_tlb_flush_vmid_range(&kvm->arch.mmu,
>                                 gfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT);
>         return 0;
>
diff mbox series

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..46bb95d58d53 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -599,6 +599,7 @@  static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 	struct kvm_gfn_range gfn_range;
 	struct kvm_memory_slot *slot;
 	struct kvm_memslots *slots;
+	bool need_flush = false;
 	int i, idx;
 
 	if (WARN_ON_ONCE(range->end <= range->start))
@@ -651,10 +652,22 @@  static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 					goto mmu_unlock;
 			}
 			r.ret |= range->handler(kvm, &gfn_range);
+
+			/*
+			 * Use a precise gfn-based TLB flush when possible, as
+			 * most mmu_notifier events affect a small-ish range.
+			 * Fall back to a full TLB flush if the gfn-based flush
+			 * fails, and don't bother trying the gfn-based flush
+			 * if a full flush is already pending.
+			 */
+			if (range->flush_on_ret && !need_flush && r.ret &&
+			    kvm_arch_flush_remote_tlbs_range(kvm, gfn_range.start,
+							     gfn_range.end - gfn_range.start))
+				need_flush = true;
 		}
 	}
 
-	if (range->flush_on_ret && r.ret)
+	if (need_flush)
 		kvm_flush_remote_tlbs(kvm);
 
 mmu_unlock: