[v2,2/7] KVM: arm64: Add FEAT_TLBIRANGE support

Message ID	20230206172340.2639971-3-rananta@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> Date: Mon, 6 Feb 2023 17:23:35 +0000 In-Reply-To: <20230206172340.2639971-1-rananta@google.com> Mime-Version: 1.0 References: <20230206172340.2639971-1-rananta@google.com> Message-ID: <20230206172340.2639971-3-rananta@google.com> Subject: [PATCH v2 2/7] KVM: arm64: Add FEAT_TLBIRANGE support From: Raghavendra Rao Ananta <rananta@google.com> To: Oliver Upton <oupton@google.com>, Marc Zyngier <maz@kernel.org>, Ricardo Koller <ricarkol@google.com>, Reiji Watanabe <reijiw@google.com>, James Morse <james.morse@arm.com>, Alexandru Elisei <alexandru.elisei@arm.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Will Deacon <will@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com>, Catalin Marinas <catalin.marinas@arm.com>, Jing Zhang <jingzhangos@google.com>, Colton Lewis <coltonlewis@google.com>, Raghavendra Rao Anata <rananta@google.com>, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	KVM: arm64: Add support for FEAT_TLBIRANGE \| expand [v2,0/7] KVM: arm64: Add support for FEAT_TLBIRANGE [v2,1/7] arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range [v2,2/7] KVM: arm64: Add FEAT_TLBIRANGE support [v2,3/7] KVM: arm64: Implement __kvm_tlb_flush_range_vmid_ipa() [v2,4/7] KVM: arm64: Implement kvm_arch_flush_remote_tlbs_range() [v2,5/7] KVM: arm64: Flush only the memslot after write-protect [v2,6/7] KVM: arm64: Break the table entries using TLBI range instructions [v2,7/7] KVM: arm64: Create a fast stage-2 unmap path

Raghavendra Rao Ananta Feb. 6, 2023, 5:23 p.m. UTC

Define a generic function __kvm_tlb_flush_range() to
invalidate the TLBs over a range of addresses. The
implementation accepts 'op' as a generic TLBI operation.
Upcoming patches will use this to implement IPA based
TLB invalidations (ipas2e1is).

If the system doesn't support FEAT_TLBIRANGE, the
implementation falls back to flushing the pages one by one
for the range supplied.

Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
---
 arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Oliver Upton March 30, 2023, 1:19 a.m. UTC | #1

On Mon, Feb 06, 2023 at 05:23:35PM +0000, Raghavendra Rao Ananta wrote:
> Define a generic function __kvm_tlb_flush_range() to
> invalidate the TLBs over a range of addresses. The
> implementation accepts 'op' as a generic TLBI operation.
> Upcoming patches will use this to implement IPA based
> TLB invalidations (ipas2e1is).
> 
> If the system doesn't support FEAT_TLBIRANGE, the
> implementation falls back to flushing the pages one by one
> for the range supplied.
> 
> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> ---
>  arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 43c3bc0f9544d..995ff048e8851 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -221,6 +221,24 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
>  DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
>  #define __bp_harden_hyp_vecs	CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)
>  
> +#define __kvm_tlb_flush_range(op, mmu, start, end, level, tlb_level) do {	\
> +	unsigned long pages, stride;						\
> +										\
> +	stride = kvm_granule_size(level);					\

Hmm... There's a rather subtle and annoying complication here that I
don't believe is handled.

Similar to what I said in the last spin of the series, there is no
guarantee that a range of IPAs is mapped at the exact same level
throughout. Dirty logging and memslots that aren't hugepage aligned
could lead to a mix of mapping levels being used within a range of the
IPA space.

> +	start = round_down(start, stride);					\
> +	end = round_up(end, stride);						\
> +	pages = (end - start) >> PAGE_SHIFT;					\
> +										\
> +	if ((!system_supports_tlb_range() &&					\
> +	     (end - start) >= (MAX_TLBI_OPS * stride)) ||			\

Doesn't checking for TLBIRANGE above eliminate the need to test against
MAX_TLBI_OPS?

> +	    pages >= MAX_TLBI_RANGE_PAGES) {					\
> +		__kvm_tlb_flush_vmid(mmu);					\
> +		break;								\
> +	}									\
> +										\
> +	__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false);	\
> +} while (0)
> +
>  extern void __kvm_flush_vm_context(void);
>  extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
>  extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
> -- 
> 2.39.1.519.gcb327c4b5f-goog
> 
>

Raghavendra Rao Ananta April 3, 2023, 5:26 p.m. UTC | #2

Hi Oliver,

On Wed, Mar 29, 2023 at 6:19 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Mon, Feb 06, 2023 at 05:23:35PM +0000, Raghavendra Rao Ananta wrote:
> > Define a generic function __kvm_tlb_flush_range() to
> > invalidate the TLBs over a range of addresses. The
> > implementation accepts 'op' as a generic TLBI operation.
> > Upcoming patches will use this to implement IPA based
> > TLB invalidations (ipas2e1is).
> >
> > If the system doesn't support FEAT_TLBIRANGE, the
> > implementation falls back to flushing the pages one by one
> > for the range supplied.
> >
> > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > index 43c3bc0f9544d..995ff048e8851 100644
> > --- a/arch/arm64/include/asm/kvm_asm.h
> > +++ b/arch/arm64/include/asm/kvm_asm.h
> > @@ -221,6 +221,24 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
> >  DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
> >  #define __bp_harden_hyp_vecs CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)
> >
> > +#define __kvm_tlb_flush_range(op, mmu, start, end, level, tlb_level) do {    \
> > +     unsigned long pages, stride;                                            \
> > +                                                                             \
> > +     stride = kvm_granule_size(level);                                       \
>
> Hmm... There's a rather subtle and annoying complication here that I
> don't believe is handled.
>
> Similar to what I said in the last spin of the series, there is no
> guarantee that a range of IPAs is mapped at the exact same level
> throughout. Dirty logging and memslots that aren't hugepage aligned
> could lead to a mix of mapping levels being used within a range of the
> IPA space.
>
Unlike the comment on v1, the level/stride here is used to jump the
addresses in case the system doesn't support TLBIRANGE. The TTL hint
is 0.
That being said, do you think we can always assume the least possible
stride (say, 4k) and hardcode it?
With respect to alignment, since the function is only called while
breaking the table PTE,  do you think it'll still be a problem even if
we go with the least granularity stride?

> > +     start = round_down(start, stride);                                      \
> > +     end = round_up(end, stride);                                            \
> > +     pages = (end - start) >> PAGE_SHIFT;                                    \
> > +                                                                             \
> > +     if ((!system_supports_tlb_range() &&                                    \
> > +          (end - start) >= (MAX_TLBI_OPS * stride)) ||                       \
>
> Doesn't checking for TLBIRANGE above eliminate the need to test against
> MAX_TLBI_OPS?
>
Derived from __flush_tlb_range(), I think the condition is used to
just flush everything if the range is too large to iterate and flush
when the system doesn't support TLBIRANGE. Probably to prevent
soft-lockups?

Thank you.
Raghavendra
> > +         pages >= MAX_TLBI_RANGE_PAGES) {                                    \
> > +             __kvm_tlb_flush_vmid(mmu);                                      \
> > +             break;                                                          \
> > +     }                                                                       \
> > +                                                                             \
> > +     __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false);    \
> > +} while (0)
> > +
> >  extern void __kvm_flush_vm_context(void);
> >  extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
> >  extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
> > --
> > 2.39.1.519.gcb327c4b5f-goog
> >
> >
>
> --
> Thanks,
> Oliver

Oliver Upton April 4, 2023, 6:41 p.m. UTC | #3

On Mon, Apr 03, 2023 at 10:26:01AM -0700, Raghavendra Rao Ananta wrote:
> Hi Oliver,
> 
> On Wed, Mar 29, 2023 at 6:19 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Mon, Feb 06, 2023 at 05:23:35PM +0000, Raghavendra Rao Ananta wrote:
> > > Define a generic function __kvm_tlb_flush_range() to
> > > invalidate the TLBs over a range of addresses. The
> > > implementation accepts 'op' as a generic TLBI operation.
> > > Upcoming patches will use this to implement IPA based
> > > TLB invalidations (ipas2e1is).
> > >
> > > If the system doesn't support FEAT_TLBIRANGE, the
> > > implementation falls back to flushing the pages one by one
> > > for the range supplied.
> > >
> > > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
> > >  1 file changed, 18 insertions(+)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > > index 43c3bc0f9544d..995ff048e8851 100644
> > > --- a/arch/arm64/include/asm/kvm_asm.h
> > > +++ b/arch/arm64/include/asm/kvm_asm.h
> > > @@ -221,6 +221,24 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
> > >  DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
> > >  #define __bp_harden_hyp_vecs CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)
> > >
> > > +#define __kvm_tlb_flush_range(op, mmu, start, end, level, tlb_level) do {    \
> > > +     unsigned long pages, stride;                                            \
> > > +                                                                             \
> > > +     stride = kvm_granule_size(level);                                       \
> >
> > Hmm... There's a rather subtle and annoying complication here that I
> > don't believe is handled.
> >
> > Similar to what I said in the last spin of the series, there is no
> > guarantee that a range of IPAs is mapped at the exact same level
> > throughout. Dirty logging and memslots that aren't hugepage aligned
> > could lead to a mix of mapping levels being used within a range of the
> > IPA space.
> >
> Unlike the comment on v1, the level/stride here is used to jump the
> addresses in case the system doesn't support TLBIRANGE. The TTL hint
> is 0.

Right. So we agree that the level is not uniform throughout the provided
range. The invalidation by IPA is also used if 'pages' is odd, even on
systems with TLBIRANGE. We must assume the worst case here, in that the
TLBI by IPA invalidated a single PTE-level entry. You could wind up
over-invalidating in that case, but you'd still be correct.

> That being said, do you think we can always assume the least possible
> stride (say, 4k) and hardcode it?
> With respect to alignment, since the function is only called while
> breaking the table PTE,  do you think it'll still be a problem even if
> we go with the least granularity stride?

I believe so. If we want to apply the range-based invalidations generally
in KVM then we will not always be dealing with a block-aligned chunk of
address.

> > > +     start = round_down(start, stride);                                      \
> > > +     end = round_up(end, stride);                                            \
> > > +     pages = (end - start) >> PAGE_SHIFT;                                    \
> > > +                                                                             \
> > > +     if ((!system_supports_tlb_range() &&                                    \
> > > +          (end - start) >= (MAX_TLBI_OPS * stride)) ||                       \
> >
> > Doesn't checking for TLBIRANGE above eliminate the need to test against
> > MAX_TLBI_OPS?
> >
> Derived from __flush_tlb_range(), I think the condition is used to
> just flush everything if the range is too large to iterate and flush
> when the system doesn't support TLBIRANGE. Probably to prevent
> soft-lockups?

Right, but you test above for system_supports_tlb_range(), meaning that
you'd unconditionally call __kvm_tlb_flush_vmid() below.

> > > +         pages >= MAX_TLBI_RANGE_PAGES) {                                    \
> > > +             __kvm_tlb_flush_vmid(mmu);                                      \
> > > +             break;                                                          \
> > > +     }                                                                       \
> > > +                                                                             \
> > > +     __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false);    \
> > > +} while (0)

Oliver Upton April 4, 2023, 6:50 p.m. UTC | #4

On Tue, Apr 04, 2023 at 06:41:34PM +0000, Oliver Upton wrote:
> On Mon, Apr 03, 2023 at 10:26:01AM -0700, Raghavendra Rao Ananta wrote:
> > On Wed, Mar 29, 2023 at 6:19 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > +     start = round_down(start, stride);                                      \
> > > > +     end = round_up(end, stride);                                            \
> > > > +     pages = (end - start) >> PAGE_SHIFT;                                    \
> > > > +                                                                             \
> > > > +     if ((!system_supports_tlb_range() &&                                    \
> > > > +          (end - start) >= (MAX_TLBI_OPS * stride)) ||                       \
> > >
> > > Doesn't checking for TLBIRANGE above eliminate the need to test against
> > > MAX_TLBI_OPS?
> > >
> > Derived from __flush_tlb_range(), I think the condition is used to
> > just flush everything if the range is too large to iterate and flush
> > when the system doesn't support TLBIRANGE. Probably to prevent
> > soft-lockups?
> 
> Right, but you test above for system_supports_tlb_range(), meaning that
> you'd unconditionally call __kvm_tlb_flush_vmid() below.

Gah, I misread the parenthesis and managed to miss your statement in the
changelog about !TLBIRANGE systems. Apologies.

Raghavendra Rao Ananta April 4, 2023, 9:39 p.m. UTC | #5

On Tue, Apr 4, 2023 at 11:41 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Mon, Apr 03, 2023 at 10:26:01AM -0700, Raghavendra Rao Ananta wrote:
> > Hi Oliver,
> >
> > On Wed, Mar 29, 2023 at 6:19 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > On Mon, Feb 06, 2023 at 05:23:35PM +0000, Raghavendra Rao Ananta wrote:
> > > > Define a generic function __kvm_tlb_flush_range() to
> > > > invalidate the TLBs over a range of addresses. The
> > > > implementation accepts 'op' as a generic TLBI operation.
> > > > Upcoming patches will use this to implement IPA based
> > > > TLB invalidations (ipas2e1is).
> > > >
> > > > If the system doesn't support FEAT_TLBIRANGE, the
> > > > implementation falls back to flushing the pages one by one
> > > > for the range supplied.
> > > >
> > > > Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> > > > ---
> > > >  arch/arm64/include/asm/kvm_asm.h | 18 ++++++++++++++++++
> > > >  1 file changed, 18 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > > > index 43c3bc0f9544d..995ff048e8851 100644
> > > > --- a/arch/arm64/include/asm/kvm_asm.h
> > > > +++ b/arch/arm64/include/asm/kvm_asm.h
> > > > @@ -221,6 +221,24 @@ DECLARE_KVM_NVHE_SYM(__per_cpu_end);
> > > >  DECLARE_KVM_HYP_SYM(__bp_harden_hyp_vecs);
> > > >  #define __bp_harden_hyp_vecs CHOOSE_HYP_SYM(__bp_harden_hyp_vecs)
> > > >
> > > > +#define __kvm_tlb_flush_range(op, mmu, start, end, level, tlb_level) do {    \
> > > > +     unsigned long pages, stride;                                            \
> > > > +                                                                             \
> > > > +     stride = kvm_granule_size(level);                                       \
> > >
> > > Hmm... There's a rather subtle and annoying complication here that I
> > > don't believe is handled.
> > >
> > > Similar to what I said in the last spin of the series, there is no
> > > guarantee that a range of IPAs is mapped at the exact same level
> > > throughout. Dirty logging and memslots that aren't hugepage aligned
> > > could lead to a mix of mapping levels being used within a range of the
> > > IPA space.
> > >
> > Unlike the comment on v1, the level/stride here is used to jump the
> > addresses in case the system doesn't support TLBIRANGE. The TTL hint
> > is 0.
>
> Right. So we agree that the level is not uniform throughout the provided
> range. The invalidation by IPA is also used if 'pages' is odd, even on
> systems with TLBIRANGE. We must assume the worst case here, in that the
> TLBI by IPA invalidated a single PTE-level entry. You could wind up
> over-invalidating in that case, but you'd still be correct.
>
Sure, let's always assume the stride as 4k. But with
over-invalidation, do you think the penalty is acceptable, especially
when invalidating say >2M blocks for systems without TLBIRANGE?
In __kvm_tlb_flush_vmid_range(), what if we just rely on the iterative
approach for invalidating odd number pages on systems with TLBIRANGE.
For !TLBIRANGE systems simply invalidate all of TLB (like we do
today). Thoughts?

Thank you.
Raghavendra


> > That being said, do you think we can always assume the least possible
> > stride (say, 4k) and hardcode it?
> > With respect to alignment, since the function is only called while
> > breaking the table PTE,  do you think it'll still be a problem even if
> > we go with the least granularity stride?
>
> I believe so. If we want to apply the range-based invalidations generally
> in KVM then we will not always be dealing with a block-aligned chunk of
> address.
>
> > > > +     start = round_down(start, stride);                                      \
> > > > +     end = round_up(end, stride);                                            \
> > > > +     pages = (end - start) >> PAGE_SHIFT;                                    \
> > > > +                                                                             \
> > > > +     if ((!system_supports_tlb_range() &&                                    \
> > > > +          (end - start) >= (MAX_TLBI_OPS * stride)) ||                       \
> > >
> > > Doesn't checking for TLBIRANGE above eliminate the need to test against
> > > MAX_TLBI_OPS?
> > >
> > Derived from __flush_tlb_range(), I think the condition is used to
> > just flush everything if the range is too large to iterate and flush
> > when the system doesn't support TLBIRANGE. Probably to prevent
> > soft-lockups?
>
> Right, but you test above for system_supports_tlb_range(), meaning that
> you'd unconditionally call __kvm_tlb_flush_vmid() below.
>
> > > > +         pages >= MAX_TLBI_RANGE_PAGES) {                                    \
> > > > +             __kvm_tlb_flush_vmid(mmu);                                      \
> > > > +             break;                                                          \
> > > > +     }                                                                       \
> > > > +                                                                             \
> > > > +     __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false);    \
> > > > +} while (0)
>
> --
> Thanks,
> Oliver

[v2,2/7] KVM: arm64: Add FEAT_TLBIRANGE support

Commit Message

Comments

Patch