[v9,4/8] KVM: Use gfn instead of hva for mmu_notifier_retry

Message ID	20221025151344.3784230-5-chao.p.peng@linux.intel.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Chao Peng <chao.p.peng@linux.intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Jonathan Corbet <corbet@lwn.net>, Sean Christopherson <seanjc@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Hugh Dickins <hughd@google.com>, Jeff Layton <jlayton@kernel.org>, "J . Bruce Fields" <bfields@fieldses.org>, Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Mike Rapoport <rppt@kernel.org>, Steven Price <steven.price@arm.com>, "Maciej S . Szmigiero" <mail@maciej.szmigiero.name>, Vlastimil Babka <vbabka@suse.cz>, Vishal Annapurve <vannapurve@google.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Chao Peng <chao.p.peng@linux.intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret <qperret@google.com>, tabba@google.com, Michael Roth <michael.roth@amd.com>, mhocko@suse.com, Muchun Song <songmuchun@bytedance.com>, wei.w.wang@intel.com Subject: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Date: Tue, 25 Oct 2022 23:13:40 +0800 Message-Id: <20221025151344.3784230-5-chao.p.peng@linux.intel.com> In-Reply-To: <20221025151344.3784230-1-chao.p.peng@linux.intel.com> References: <20221025151344.3784230-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	KVM: mm: fd-based approach for supporting KVM \| expand [v9,0/8] KVM: mm: fd-based approach for supporting KVM [v9,1/8] mm: Introduce memfd_restricted system call to create restricted user memory [v9,2/8] KVM: Extend the memslot to support fd-based private memory [v9,3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit [v9,4/8] KVM: Use gfn instead of hva for mmu_notifier_retry [v9,5/8] KVM: Register/unregister the guest private memory regions [v9,6/8] KVM: Update lpage info when private/shared memory are mixed [v9,7/8] KVM: Handle page fault for private memory [v9,8/8] KVM: Enable and expose KVM_MEM_PRIVATE

Chao Peng Oct. 25, 2022, 3:13 p.m. UTC

Currently in mmu_notifier validate path, hva range is recorded and then
checked against in the mmu_notifier_retry_hva() of the page fault path.
However, for the to be introduced private memory, a page fault may not
have a hva associated, checking gfn(gpa) makes more sense.

For existing non private memory case, gfn is expected to continue to
work. The only downside is when aliasing multiple gfns to a single hva,
the current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

It also fixes a bug in kvm_zap_gfn_range() which has already been using
gfn when calling kvm_mmu_invalidate_begin/end() while these functions
accept hva in current code.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h | 18 +++++++---------
 virt/kvm/kvm_main.c      | 45 ++++++++++++++++++++++++++--------------
 3 files changed, 39 insertions(+), 26 deletions(-)

Fuad Tabba Oct. 27, 2022, 10:29 a.m. UTC | #1

Hi,

On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Currently in mmu_notifier validate path, hva range is recorded and then
> checked against in the mmu_notifier_retry_hva() of the page fault path.
> However, for the to be introduced private memory, a page fault may not
> have a hva associated, checking gfn(gpa) makes more sense.
>
> For existing non private memory case, gfn is expected to continue to
> work. The only downside is when aliasing multiple gfns to a single hva,
> the current algorithm of checking multiple ranges could result in a much
> larger range being rejected. Such aliasing should be uncommon, so the
> impact is expected small.
>
> It also fixes a bug in kvm_zap_gfn_range() which has already been using

nit: Now it's kvm_unmap_gfn_range().

> gfn when calling kvm_mmu_invalidate_begin/end() while these functions
> accept hva in current code.
>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---

Based on reading this code and my limited knowledge of the x86 MMU code:
Reviewed-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad


>  arch/x86/kvm/mmu/mmu.c   |  2 +-
>  include/linux/kvm_host.h | 18 +++++++---------
>  virt/kvm/kvm_main.c      | 45 ++++++++++++++++++++++++++--------------
>  3 files changed, 39 insertions(+), 26 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f81539061d6..33b1aec44fb8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4217,7 +4217,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>                 return true;
>
>         return fault->slot &&
> -              mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +              mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
>  }
>
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 739a7562a1f3..79e5cbc35fcf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -775,8 +775,8 @@ struct kvm {
>         struct mmu_notifier mmu_notifier;
>         unsigned long mmu_invalidate_seq;
>         long mmu_invalidate_in_progress;
> -       unsigned long mmu_invalidate_range_start;
> -       unsigned long mmu_invalidate_range_end;
> +       gfn_t mmu_invalidate_range_start;
> +       gfn_t mmu_invalidate_range_end;
>  #endif
>         struct list_head devices;
>         u64 manual_dirty_log_protect;
> @@ -1365,10 +1365,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -                             unsigned long end);
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -                           unsigned long end);
> +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end);
>
>  long kvm_arch_dev_ioctl(struct file *filp,
>                         unsigned int ioctl, unsigned long arg);
> @@ -1937,9 +1935,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>         return 0;
>  }
>
> -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
>                                            unsigned long mmu_seq,
> -                                          unsigned long hva)
> +                                          gfn_t gfn)
>  {
>         lockdep_assert_held(&kvm->mmu_lock);
>         /*
> @@ -1949,8 +1947,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>          * positives, due to shortcuts when handing concurrent invalidations.
>          */
>         if (unlikely(kvm->mmu_invalidate_in_progress) &&
> -           hva >= kvm->mmu_invalidate_range_start &&
> -           hva < kvm->mmu_invalidate_range_end)
> +           gfn >= kvm->mmu_invalidate_range_start &&
> +           gfn < kvm->mmu_invalidate_range_end)
>                 return 1;
>         if (kvm->mmu_invalidate_seq != mmu_seq)
>                 return 1;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8dace78a0278..09c9cdeb773c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -540,8 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
>
>  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> -                            unsigned long end);
> +typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
>
>  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>
> @@ -628,7 +627,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>                                 locked = true;
>                                 KVM_MMU_LOCK(kvm);
>                                 if (!IS_KVM_NULL_FN(range->on_lock))
> -                                       range->on_lock(kvm, range->start, range->end);
> +                                       range->on_lock(kvm, gfn_range.start,
> +                                                           gfn_range.end);
>                                 if (IS_KVM_NULL_FN(range->handler))
>                                         break;
>                         }
> @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -                             unsigned long end)
> +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> +                                                           gfn_t end)
>  {
> -       /*
> -        * The count increase must become visible at unlock time as no
> -        * spte can be established without taking the mmu_lock and
> -        * count is also read inside the mmu_lock critical section.
> -        */
> -       kvm->mmu_invalidate_in_progress++;
>         if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>                 kvm->mmu_invalidate_range_start = start;
>                 kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>         }
>  }
>
> +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       /*
> +        * The count increase must become visible at unlock time as no
> +        * spte can be established without taking the mmu_lock and
> +        * count is also read inside the mmu_lock critical section.
> +        */
> +       kvm->mmu_invalidate_in_progress++;
> +}
> +
> +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +       update_invalidate_range(kvm, range->start, range->end);
> +       return kvm_unmap_gfn_range(kvm, range);
> +}
> +
> +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +       mark_invalidate_in_progress(kvm, start, end);
> +       update_invalidate_range(kvm, start, end);
> +}
> +
>  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
>  {
> @@ -752,8 +768,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                 .start          = range->start,
>                 .end            = range->end,
>                 .pte            = __pte(0),
> -               .handler        = kvm_unmap_gfn_range,
> -               .on_lock        = kvm_mmu_invalidate_begin,
> +               .handler        = kvm_mmu_handle_gfn_range,
> +               .on_lock        = mark_invalidate_in_progress,
>                 .on_unlock      = kvm_arch_guest_memory_reclaimed,
>                 .flush_on_ret   = true,
>                 .may_block      = mmu_notifier_range_blockable(range),
> @@ -791,8 +807,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>         return 0;
>  }
>
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -                           unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
>  {
>         /*
>          * This sequence increase will notify the kvm page fault that
> --
> 2.25.1
>

Chao Peng Nov. 4, 2022, 2:28 a.m. UTC | #2

On Thu, Oct 27, 2022 at 11:29:14AM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Currently in mmu_notifier validate path, hva range is recorded and then
> > checked against in the mmu_notifier_retry_hva() of the page fault path.
> > However, for the to be introduced private memory, a page fault may not
> > have a hva associated, checking gfn(gpa) makes more sense.
> >
> > For existing non private memory case, gfn is expected to continue to
> > work. The only downside is when aliasing multiple gfns to a single hva,
> > the current algorithm of checking multiple ranges could result in a much
> > larger range being rejected. Such aliasing should be uncommon, so the
> > impact is expected small.
> >
> > It also fixes a bug in kvm_zap_gfn_range() which has already been using
> 
> nit: Now it's kvm_unmap_gfn_range().

Forgot to mention: the bug is still with kvm_zap_gfn_range(). It calls
kvm_mmu_invalidate_begin/end with a gfn range but before this series
kvm_mmu_invalidate_begin/end actually accept a hva range. Note it's
unrelated to whether we use kvm_zap_gfn_range() or kvm_unmap_gfn_range()
in the following patch (patch 05).

Thanks,
Chao
> 
> > gfn when calling kvm_mmu_invalidate_begin/end() while these functions
> > accept hva in current code.
> >
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> 
> Based on reading this code and my limited knowledge of the x86 MMU code:
> Reviewed-by: Fuad Tabba <tabba@google.com>
> 
> Cheers,
> /fuad
> 
> 
> >  arch/x86/kvm/mmu/mmu.c   |  2 +-
> >  include/linux/kvm_host.h | 18 +++++++---------
> >  virt/kvm/kvm_main.c      | 45 ++++++++++++++++++++++++++--------------
> >  3 files changed, 39 insertions(+), 26 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f81539061d6..33b1aec44fb8 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4217,7 +4217,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> >                 return true;
> >
> >         return fault->slot &&
> > -              mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > +              mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> >  }
> >
> >  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 739a7562a1f3..79e5cbc35fcf 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -775,8 +775,8 @@ struct kvm {
> >         struct mmu_notifier mmu_notifier;
> >         unsigned long mmu_invalidate_seq;
> >         long mmu_invalidate_in_progress;
> > -       unsigned long mmu_invalidate_range_start;
> > -       unsigned long mmu_invalidate_range_end;
> > +       gfn_t mmu_invalidate_range_start;
> > +       gfn_t mmu_invalidate_range_end;
> >  #endif
> >         struct list_head devices;
> >         u64 manual_dirty_log_protect;
> > @@ -1365,10 +1365,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  #endif
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > -                             unsigned long end);
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > -                           unsigned long end);
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end);
> > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end);
> >
> >  long kvm_arch_dev_ioctl(struct file *filp,
> >                         unsigned int ioctl, unsigned long arg);
> > @@ -1937,9 +1935,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
> >         return 0;
> >  }
> >
> > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> >                                            unsigned long mmu_seq,
> > -                                          unsigned long hva)
> > +                                          gfn_t gfn)
> >  {
> >         lockdep_assert_held(&kvm->mmu_lock);
> >         /*
> > @@ -1949,8 +1947,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> >          * positives, due to shortcuts when handing concurrent invalidations.
> >          */
> >         if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > -           hva >= kvm->mmu_invalidate_range_start &&
> > -           hva < kvm->mmu_invalidate_range_end)
> > +           gfn >= kvm->mmu_invalidate_range_start &&
> > +           gfn < kvm->mmu_invalidate_range_end)
> >                 return 1;
> >         if (kvm->mmu_invalidate_seq != mmu_seq)
> >                 return 1;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 8dace78a0278..09c9cdeb773c 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -540,8 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> >
> >  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
> >
> > -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> > -                            unsigned long end);
> > +typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
> >
> >  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
> >
> > @@ -628,7 +627,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> >                                 locked = true;
> >                                 KVM_MMU_LOCK(kvm);
> >                                 if (!IS_KVM_NULL_FN(range->on_lock))
> > -                                       range->on_lock(kvm, range->start, range->end);
> > +                                       range->on_lock(kvm, gfn_range.start,
> > +                                                           gfn_range.end);
> >                                 if (IS_KVM_NULL_FN(range->handler))
> >                                         break;
> >                         }
> > @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >         kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> >  }
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > -                             unsigned long end)
> > +static inline void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > +                                                           gfn_t end)
> >  {
> > -       /*
> > -        * The count increase must become visible at unlock time as no
> > -        * spte can be established without taking the mmu_lock and
> > -        * count is also read inside the mmu_lock critical section.
> > -        */
> > -       kvm->mmu_invalidate_in_progress++;
> >         if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> >                 kvm->mmu_invalidate_range_start = start;
> >                 kvm->mmu_invalidate_range_end = end;
> > @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> >         }
> >  }
> >
> > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +       /*
> > +        * The count increase must become visible at unlock time as no
> > +        * spte can be established without taking the mmu_lock and
> > +        * count is also read inside the mmu_lock critical section.
> > +        */
> > +       kvm->mmu_invalidate_in_progress++;
> > +}
> > +
> > +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > +       update_invalidate_range(kvm, range->start, range->end);
> > +       return kvm_unmap_gfn_range(kvm, range);
> > +}
> > +
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +       mark_invalidate_in_progress(kvm, start, end);
> > +       update_invalidate_range(kvm, start, end);
> > +}
> > +
> >  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >                                         const struct mmu_notifier_range *range)
> >  {
> > @@ -752,8 +768,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >                 .start          = range->start,
> >                 .end            = range->end,
> >                 .pte            = __pte(0),
> > -               .handler        = kvm_unmap_gfn_range,
> > -               .on_lock        = kvm_mmu_invalidate_begin,
> > +               .handler        = kvm_mmu_handle_gfn_range,
> > +               .on_lock        = mark_invalidate_in_progress,
> >                 .on_unlock      = kvm_arch_guest_memory_reclaimed,
> >                 .flush_on_ret   = true,
> >                 .may_block      = mmu_notifier_range_blockable(range),
> > @@ -791,8 +807,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >         return 0;
> >  }
> >
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > -                           unsigned long end)
> > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end)
> >  {
> >         /*
> >          * This sequence increase will notify the kvm page fault that
> > --
> > 2.25.1
> >

Sean Christopherson Nov. 4, 2022, 10:29 p.m. UTC | #3

On Fri, Nov 04, 2022, Chao Peng wrote:
> On Thu, Oct 27, 2022 at 11:29:14AM +0100, Fuad Tabba wrote:
> > Hi,
> > 
> > On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > >
> > > Currently in mmu_notifier validate path, hva range is recorded and then
> > > checked against in the mmu_notifier_retry_hva() of the page fault path.
> > > However, for the to be introduced private memory, a page fault may not
> > > have a hva associated, checking gfn(gpa) makes more sense.
> > >
> > > For existing non private memory case, gfn is expected to continue to
> > > work. The only downside is when aliasing multiple gfns to a single hva,
> > > the current algorithm of checking multiple ranges could result in a much
> > > larger range being rejected. Such aliasing should be uncommon, so the
> > > impact is expected small.
> > >
> > > It also fixes a bug in kvm_zap_gfn_range() which has already been using
> > 
> > nit: Now it's kvm_unmap_gfn_range().
> 
> Forgot to mention: the bug is still with kvm_zap_gfn_range(). It calls
> kvm_mmu_invalidate_begin/end with a gfn range but before this series
> kvm_mmu_invalidate_begin/end actually accept a hva range. Note it's
> unrelated to whether we use kvm_zap_gfn_range() or kvm_unmap_gfn_range()
> in the following patch (patch 05).

Grr, in the future, if you find an existing bug, please send a patch.  At the
very least, report the bug.  The APICv case that this was added for could very
well be broken because of this, and the resulting failures would be an absolute
nightmare to debug.

Compile tested only...

--
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 4 Nov 2022 22:20:33 +0000
Subject: [PATCH] KVM: x86/mmu: Block all page faults during
 kvm_zap_gfn_range()

When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated
range to effectively block all page faults while the zap is in-progress.
The invalidation helpers take a host virtual address, whereas zapping a
GFN obviously provides a guest physical address and with the wrong unit
of measurement (frame vs. byte).

Alternatively, KVM could walk all memslots to get the associated HVAs,
but thanks to SMM, that would require multiple lookups.  And practically
speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path,
e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0
during boot, and APICv inhibits are similarly infrequent operations.

Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f81539061d6..1ccb769f62af 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6056,7 +6056,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 
-	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_begin(kvm, 0, -1ul);
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6070,7 +6070,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
 						   gfn_end - gfn_start);
 
-	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_end(kvm, 0, -1ul);
 
 	write_unlock(&kvm->mmu_lock);
 }

base-commit: c12879206e47730ff5ab255bbf625b28ade4028f
--

Chao Peng Nov. 8, 2022, 7:16 a.m. UTC | #4

On Fri, Nov 04, 2022 at 10:29:48PM +0000, Sean Christopherson wrote:
> On Fri, Nov 04, 2022, Chao Peng wrote:
> > On Thu, Oct 27, 2022 at 11:29:14AM +0100, Fuad Tabba wrote:
> > > Hi,
> > > 
> > > On Tue, Oct 25, 2022 at 4:19 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > Currently in mmu_notifier validate path, hva range is recorded and then
> > > > checked against in the mmu_notifier_retry_hva() of the page fault path.
> > > > However, for the to be introduced private memory, a page fault may not
> > > > have a hva associated, checking gfn(gpa) makes more sense.
> > > >
> > > > For existing non private memory case, gfn is expected to continue to
> > > > work. The only downside is when aliasing multiple gfns to a single hva,
> > > > the current algorithm of checking multiple ranges could result in a much
> > > > larger range being rejected. Such aliasing should be uncommon, so the
> > > > impact is expected small.
> > > >
> > > > It also fixes a bug in kvm_zap_gfn_range() which has already been using
> > > 
> > > nit: Now it's kvm_unmap_gfn_range().
> > 
> > Forgot to mention: the bug is still with kvm_zap_gfn_range(). It calls
> > kvm_mmu_invalidate_begin/end with a gfn range but before this series
> > kvm_mmu_invalidate_begin/end actually accept a hva range. Note it's
> > unrelated to whether we use kvm_zap_gfn_range() or kvm_unmap_gfn_range()
> > in the following patch (patch 05).
> 
> Grr, in the future, if you find an existing bug, please send a patch.  At the
> very least, report the bug.

Agreed, this can be sent out separately from this series.

> The APICv case that this was added for could very
> well be broken because of this, and the resulting failures would be an absolute
> nightmare to debug.

Given the apicv_inhibit should be rare, the change looks good to me.
Just to be clear, your will send out this fix, right?

Chao

> 
> Compile tested only...
> 
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 4 Nov 2022 22:20:33 +0000
> Subject: [PATCH] KVM: x86/mmu: Block all page faults during
>  kvm_zap_gfn_range()
> 
> When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated
> range to effectively block all page faults while the zap is in-progress.
> The invalidation helpers take a host virtual address, whereas zapping a
> GFN obviously provides a guest physical address and with the wrong unit
> of measurement (frame vs. byte).
> 
> Alternatively, KVM could walk all memslots to get the associated HVAs,
> but thanks to SMM, that would require multiple lookups.  And practically
> speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path,
> e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0
> during boot, and APICv inhibits are similarly infrequent operations.
> 
> Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range")
> Cc: stable@vger.kernel.org
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f81539061d6..1ccb769f62af 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6056,7 +6056,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	write_lock(&kvm->mmu_lock);
>  
> -	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> +	kvm_mmu_invalidate_begin(kvm, 0, -1ul);
>  
>  	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> @@ -6070,7 +6070,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
>  						   gfn_end - gfn_start);
>  
> -	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> +	kvm_mmu_invalidate_end(kvm, 0, -1ul);
>  
>  	write_unlock(&kvm->mmu_lock);
>  }
> 
> base-commit: c12879206e47730ff5ab255bbf625b28ade4028f
> --

Sean Christopherson Nov. 10, 2022, 5:53 p.m. UTC | #5

On Tue, Nov 08, 2022, Chao Peng wrote:
> On Fri, Nov 04, 2022 at 10:29:48PM +0000, Sean Christopherson wrote:
> > The APICv case that this was added for could very well be broken because of
> > this, and the resulting failures would be an absolute nightmare to debug.
> 
> Given the apicv_inhibit should be rare, the change looks good to me.
> Just to be clear, your will send out this fix, right?

Ya, I'll post an official patch.

Sean Christopherson Nov. 10, 2022, 8:06 p.m. UTC | #6

On Tue, Oct 25, 2022, Chao Peng wrote:
> @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end)
> +static inline

Don't tag static functions with "inline" unless they're in headers, in which case
the inline is effectively required.  In pretty much every scenario, the compiler
can do a better job of optimizing inline vs. non-inline, i.e. odds are very good
the compiler would inline this helper anyways, and if not, there would likely be
a good reason not to inline it.

It'll be a moot point in this case (more below), but this would also reduce the
line length and avoid the wrap.

> void update_invalidate_range(struct kvm *kvm, gfn_t start,
> +							    gfn_t end)

I appreciate the effort to make this easier to read, but making such a big divergence
from the kernel's preferred formatting is often counter-productive, e.g. I blinked a
few times when first reading this code.

Again, moot point this time (still below ;-) ), but for future reference, better
options are to either let the line poke out or simply wrap early to get the
bundling of parameters that you want, e.g.

  static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, gfn_t end)

or 

  static inline void update_invalidate_range(struct kvm *kvm,
					     gfn_t start, gfn_t end)

>  {
> -	/*
> -	 * The count increase must become visible at unlock time as no
> -	 * spte can be established without taking the mmu_lock and
> -	 * count is also read inside the mmu_lock critical section.
> -	 */
> -	kvm->mmu_invalidate_in_progress++;
>  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>  		kvm->mmu_invalidate_range_start = start;
>  		kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  	}
>  }
>  
> +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)

Splitting the helpers this way yields a weird API overall, e.g. it's possible
(common, actually) to have an "end" without a "begin".

Taking the range in the "end" is also dangerous/misleading/imbalanced, because _if_
there are multiple ranges in a batch, each range would need to be unwound
independently, e.g. the invocation of the "end" helper in
kvm_mmu_notifier_invalidate_range_end() is flat out wrong, it just doesn't cause
problems because KVM doesn't (currently) try to unwind regions (and probably never
will, but that's beside the point).

Rather than shunt what is effectively the "begin" into a separate helper, provide
three separate APIs, e.g. begin, range_add, end.  That way, begin+end don't take a
range and thus are symmetrical, always paired, and can't screw up unwinding since
they don't have a range to unwind.

It'll require three calls in every case, but that's not the end of the world since
none of these flows are super hot paths.

> +{
> +	/*
> +	 * The count increase must become visible at unlock time as no
> +	 * spte can be established without taking the mmu_lock and
> +	 * count is also read inside the mmu_lock critical section.
> +	 */
> +	kvm->mmu_invalidate_in_progress++;

This should invalidate (ha!) mmu_invalidate_range_{start,end}, and then WARN in
mmu_invalidate_retry() if the range isn't valid.  And the "add" helper should
WARN if mmu_invalidate_in_progress == 0.

> +}
> +
> +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)

"handle" is waaaay too generic.  Just match kvm_unmap_gfn_range() and call it
kvm_mmu_unmap_gfn_range().  This is a local function so it's unlikely to collide
with arch code, now or in the future.

> +{
> +	update_invalidate_range(kvm, range->start, range->end);
> +	return kvm_unmap_gfn_range(kvm, range);
> +}

Overall, this?  Compile tested only...

---
 arch/x86/kvm/mmu/mmu.c   |  8 +++++---
 include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
 virt/kvm/kvm_main.c      | 30 +++++++++++++++++++++---------
 3 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 93c389eaf471..d4b373e3e524 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 		return true;
 
 	return fault->slot &&
-	       mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+	       mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 
-	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_begin(kvm);
+
+	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
 						   gfn_end - gfn_start);
 
-	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+	kvm_mmu_invalidate_end(kvm);
 
 	write_unlock(&kvm->mmu_lock);
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e6e66c5e56f2..29aa6d6827cc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -778,8 +778,8 @@ struct kvm {
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_invalidate_seq;
 	long mmu_invalidate_in_progress;
-	unsigned long mmu_invalidate_range_start;
-	unsigned long mmu_invalidate_range_end;
+	gfn_t mmu_invalidate_range_start;
+	gfn_t mmu_invalidate_range_end;
 #endif
 	struct list_head devices;
 	u64 manual_dirty_log_protect;
@@ -1378,10 +1378,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm);
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -1952,9 +1951,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 	return 0;
 }
 
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
 					   unsigned long mmu_seq,
-					   unsigned long hva)
+					   gfn_t gfn)
 {
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -1963,10 +1962,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 	 * that might be being invalidated. Note that it may include some false
 	 * positives, due to shortcuts when handing concurrent invalidations.
 	 */
-	if (unlikely(kvm->mmu_invalidate_in_progress) &&
-	    hva >= kvm->mmu_invalidate_range_start &&
-	    hva < kvm->mmu_invalidate_range_end)
-		return 1;
+	if (unlikely(kvm->mmu_invalidate_in_progress)) {
+		/*
+		 * Dropping mmu_lock after bumping mmu_invalidate_in_progress
+		 * but before updating the range is a KVM bug.
+		 */
+		if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
+				 kvm->mmu_invalidate_range_end == INVALID_GPA))
+			return 1;
+
+		if (gfn >= kvm->mmu_invalidate_range_start &&
+		    gfn < kvm->mmu_invalidate_range_end)
+			return 1;
+	}
+
 	if (kvm->mmu_invalidate_seq != mmu_seq)
 		return 1;
 	return 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 43bbe4fde078..e9e03b979f77 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-			     unsigned long end);
-
+typedef void (*on_lock_fn_t)(struct kvm *kvm);
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_hva_range {
@@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 				locked = true;
 				KVM_MMU_LOCK(kvm);
 				if (!IS_KVM_NULL_FN(range->on_lock))
-					range->on_lock(kvm, range->start, range->end);
+					range->on_lock(kvm);
+
 				if (IS_KVM_NULL_FN(range->handler))
 					break;
 			}
@@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
-			      unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
 {
 	/*
 	 * The count increase must become visible at unlock time as no
@@ -724,6 +722,15 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_invalidate_in_progress++;
+
+	kvm->mmu_invalidate_range_start = INVALID_GPA;
+	kvm->mmu_invalidate_range_end = INVALID_GPA;
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
 	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
 		kvm->mmu_invalidate_range_start = start;
 		kvm->mmu_invalidate_range_end = end;
@@ -744,6 +751,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 	}
 }
 
+static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+	return kvm_unmap_gfn_range(kvm, range);
+}
+
 static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
@@ -752,7 +765,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
+		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
 		.on_unlock	= kvm_arch_guest_memory_reclaimed,
 		.flush_on_ret	= true,
@@ -791,8 +804,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	return 0;
 }
 
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-			    unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
 {
 	/*
 	 * This sequence increase will notify the kvm page fault that

base-commit: d663b8a285986072428a6a145e5994bc275df994
--

Chao Peng Nov. 11, 2022, 8:27 a.m. UTC | #7

On Thu, Nov 10, 2022 at 08:06:33PM +0000, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Chao Peng wrote:
> > @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> >  }
> >  
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > -			      unsigned long end)
> > +static inline
> 
> Don't tag static functions with "inline" unless they're in headers, in which case
> the inline is effectively required.  In pretty much every scenario, the compiler
> can do a better job of optimizing inline vs. non-inline, i.e. odds are very good
> the compiler would inline this helper anyways, and if not, there would likely be
> a good reason not to inline it.

Yep, I know the rationale behind, I made a mistake.

> 
> It'll be a moot point in this case (more below), but this would also reduce the
> line length and avoid the wrap.
> 
> > void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > +							    gfn_t end)
> 
> I appreciate the effort to make this easier to read, but making such a big divergence
> from the kernel's preferred formatting is often counter-productive, e.g. I blinked a
> few times when first reading this code.
> 
> Again, moot point this time (still below ;-) ), but for future reference, better
> options are to either let the line poke out or simply wrap early to get the
> bundling of parameters that you want, e.g.
> 
>   static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, gfn_t end)
> 
> or 
> 
>   static inline void update_invalidate_range(struct kvm *kvm,
> 					     gfn_t start, gfn_t end)

Fully agreed.

> 
> >  {
> > -	/*
> > -	 * The count increase must become visible at unlock time as no
> > -	 * spte can be established without taking the mmu_lock and
> > -	 * count is also read inside the mmu_lock critical section.
> > -	 */
> > -	kvm->mmu_invalidate_in_progress++;
> >  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> >  		kvm->mmu_invalidate_range_start = start;
> >  		kvm->mmu_invalidate_range_end = end;
> > @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> >  	}
> >  }
> >  
> > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t end)
> 
> Splitting the helpers this way yields a weird API overall, e.g. it's possible
> (common, actually) to have an "end" without a "begin".
> 
> Taking the range in the "end" is also dangerous/misleading/imbalanced, because _if_
> there are multiple ranges in a batch, each range would need to be unwound
> independently, e.g. the invocation of the "end" helper in
> kvm_mmu_notifier_invalidate_range_end() is flat out wrong, it just doesn't cause
> problems because KVM doesn't (currently) try to unwind regions (and probably never
> will, but that's beside the point).

I actually also don't feel good with existing code (taking range in the
"start" and "end") but didn't go further to find a better solution.

> 
> Rather than shunt what is effectively the "begin" into a separate helper, provide
> three separate APIs, e.g. begin, range_add, end.  That way, begin+end don't take a
> range and thus are symmetrical, always paired, and can't screw up unwinding since
> they don't have a range to unwind.

This looks much better to me.

> 
> It'll require three calls in every case, but that's not the end of the world since
> none of these flows are super hot paths.
> 
> > +{
> > +	/*
> > +	 * The count increase must become visible at unlock time as no
> > +	 * spte can be established without taking the mmu_lock and
> > +	 * count is also read inside the mmu_lock critical section.
> > +	 */
> > +	kvm->mmu_invalidate_in_progress++;
> 
> This should invalidate (ha!) mmu_invalidate_range_{start,end}, and then WARN in
> mmu_invalidate_retry() if the range isn't valid.  And the "add" helper should
> WARN if mmu_invalidate_in_progress == 0.
> 
> > +}
> > +
> > +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> 
> "handle" is waaaay too generic.  Just match kvm_unmap_gfn_range() and call it
> kvm_mmu_unmap_gfn_range().  This is a local function so it's unlikely to collide
> with arch code, now or in the future.

Agreed.

> 
> > +{
> > +	update_invalidate_range(kvm, range->start, range->end);
> > +	return kvm_unmap_gfn_range(kvm, range);
> > +}
> 
> Overall, this?  Compile tested only...

Thanks!
Chao
> 
> ---
>  arch/x86/kvm/mmu/mmu.c   |  8 +++++---
>  include/linux/kvm_host.h | 33 +++++++++++++++++++++------------
>  virt/kvm/kvm_main.c      | 30 +++++++++++++++++++++---------
>  3 files changed, 47 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 93c389eaf471..d4b373e3e524 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>  		return true;
>  
>  	return fault->slot &&
> -	       mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> +	       mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
>  }
>  
>  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	write_lock(&kvm->mmu_lock);
>  
> -	kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> +	kvm_mmu_invalidate_begin(kvm);
> +
> +	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
>  
>  	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  		kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
>  						   gfn_end - gfn_start);
>  
> -	kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> +	kvm_mmu_invalidate_end(kvm);
>  
>  	write_unlock(&kvm->mmu_lock);
>  }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e6e66c5e56f2..29aa6d6827cc 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -778,8 +778,8 @@ struct kvm {
>  	struct mmu_notifier mmu_notifier;
>  	unsigned long mmu_invalidate_seq;
>  	long mmu_invalidate_in_progress;
> -	unsigned long mmu_invalidate_range_start;
> -	unsigned long mmu_invalidate_range_end;
> +	gfn_t mmu_invalidate_range_start;
> +	gfn_t mmu_invalidate_range_end;
>  #endif
>  	struct list_head devices;
>  	u64 manual_dirty_log_protect;
> @@ -1378,10 +1378,9 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
>  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  #endif
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end);
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -			    unsigned long end);
> +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_mmu_invalidate_end(struct kvm *kvm);
>  
>  long kvm_arch_dev_ioctl(struct file *filp,
>  			unsigned int ioctl, unsigned long arg);
> @@ -1952,9 +1951,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
>  	return 0;
>  }
>  
> -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
>  					   unsigned long mmu_seq,
> -					   unsigned long hva)
> +					   gfn_t gfn)
>  {
>  	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
> @@ -1963,10 +1962,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  	 * that might be being invalidated. Note that it may include some false
>  	 * positives, due to shortcuts when handing concurrent invalidations.
>  	 */
> -	if (unlikely(kvm->mmu_invalidate_in_progress) &&
> -	    hva >= kvm->mmu_invalidate_range_start &&
> -	    hva < kvm->mmu_invalidate_range_end)
> -		return 1;
> +	if (unlikely(kvm->mmu_invalidate_in_progress)) {
> +		/*
> +		 * Dropping mmu_lock after bumping mmu_invalidate_in_progress
> +		 * but before updating the range is a KVM bug.
> +		 */
> +		if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA ||
> +				 kvm->mmu_invalidate_range_end == INVALID_GPA))
> +			return 1;
> +
> +		if (gfn >= kvm->mmu_invalidate_range_start &&
> +		    gfn < kvm->mmu_invalidate_range_end)
> +			return 1;
> +	}
> +
>  	if (kvm->mmu_invalidate_seq != mmu_seq)
>  		return 1;
>  	return 0;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 43bbe4fde078..e9e03b979f77 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -540,9 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
>  
>  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>  
> -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> -			     unsigned long end);
> -
> +typedef void (*on_lock_fn_t)(struct kvm *kvm);
>  typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>  
>  struct kvm_hva_range {
> @@ -628,7 +626,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>  				locked = true;
>  				KVM_MMU_LOCK(kvm);
>  				if (!IS_KVM_NULL_FN(range->on_lock))
> -					range->on_lock(kvm, range->start, range->end);
> +					range->on_lock(kvm);
> +
>  				if (IS_KVM_NULL_FN(range->handler))
>  					break;
>  			}
> @@ -715,8 +714,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -			      unsigned long end)
> +void kvm_mmu_invalidate_begin(struct kvm *kvm)
>  {
>  	/*
>  	 * The count increase must become visible at unlock time as no
> @@ -724,6 +722,15 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  	 * count is also read inside the mmu_lock critical section.
>  	 */
>  	kvm->mmu_invalidate_in_progress++;
> +
> +	kvm->mmu_invalidate_range_start = INVALID_GPA;
> +	kvm->mmu_invalidate_range_end = INVALID_GPA;
> +}
> +
> +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> +
>  	if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>  		kvm->mmu_invalidate_range_start = start;
>  		kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +751,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  	}
>  }
>  
> +static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> +	return kvm_unmap_gfn_range(kvm, range);
> +}
> +
>  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  					const struct mmu_notifier_range *range)
>  {
> @@ -752,7 +765,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		.start		= range->start,
>  		.end		= range->end,
>  		.pte		= __pte(0),
> -		.handler	= kvm_unmap_gfn_range,
> +		.handler	= kvm_mmu_unmap_gfn_range,
>  		.on_lock	= kvm_mmu_invalidate_begin,
>  		.on_unlock	= kvm_arch_guest_memory_reclaimed,
>  		.flush_on_ret	= true,
> @@ -791,8 +804,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	return 0;
>  }
>  
> -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> -			    unsigned long end)
> +void kvm_mmu_invalidate_end(struct kvm *kvm)
>  {
>  	/*
>  	 * This sequence increase will notify the kvm page fault that
> 
> base-commit: d663b8a285986072428a6a145e5994bc275df994
> -- 
>

[v9,4/8] KVM: Use gfn instead of hva for mmu_notifier_retry

Commit Message

Comments

Patch