Message ID | 20240325185158.8565-2-will@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: TLBI fixes for the pgtable code | expand |
Hey Will, On Mon, Mar 25, 2024 at 06:51:56PM +0000, Will Deacon wrote: > Commit 7657ea920c54 ("KVM: arm64: Use TLBI range-based instructions for > unmap") introduced deferred TLB invalidation for the stage-2 page-table > so that range-based invalidation can be used for the accumulated > addresses. This works fine if the structure of the page-tables remains > unchanged, but if entire tables are zapped and subsequently freed then > we transiently leave the hardware page-table walker with a reference > to freed memory thanks to the translation walk caches. Yikes! Well spotted. This is rather unfortunate because the unmap path has been found to be a massive pain w/o aggregating invalidations. But sacrificing correctness in the name of performance... No thanks :) > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index 3fae5830f8d2..de0b667ba296 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -896,9 +896,11 @@ static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx, > if (kvm_pte_valid(ctx->old)) { > kvm_clear_pte(ctx->ptep); > > - if (!stage2_unmap_defer_tlb_flush(pgt)) > + if (!stage2_unmap_defer_tlb_flush(pgt) || > + kvm_pte_table(ctx->old, ctx->level)) { > kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, > ctx->addr, ctx->level); > + } I'm not sure this is correct, though. My understanding of TTL is that we're telling hardware where the *leaf* entry we're invalidating is found, however here we know that the addressed PTE is a table entry. So maybe in the case of a table PTE this invalidation should TLBI_TTL_UNKNOWN. > } > > mm_ops->put_page(ctx->ptep); At least for the 'normal' MMU where we use RCU, this could be changed to ->free_unlinked_table() which would defer the freeing of memory til after the invalidation completes. But that still hoses pKVM's stage-2 MMU freeing in-place.
On Tue, Mar 26, 2024 at 01:34:17AM -0700, Oliver Upton wrote: > > } > > > > mm_ops->put_page(ctx->ptep); > > At least for the 'normal' MMU where we use RCU, this could be changed to > ->free_unlinked_table() which would defer the freeing of memory til > after the invalidation completes. But that still hoses pKVM's stage-2 > MMU freeing in-place. How about this (untested) diff? I _think_ it should address the invalidation issue while leaving the performance optimization in place for a 'normal' stage-2. diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 3fae5830f8d2..896fdc0d157d 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -872,14 +872,19 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt) { /* - * If FEAT_TLBIRANGE is implemented, defer the individual - * TLB invalidations until the entire walk is finished, and - * then use the range-based TLBI instructions to do the - * invalidations. Condition deferred TLB invalidation on the - * system supporting FWB as the optimization is entirely - * pointless when the unmap walker needs to perform CMOs. + * It is possible to use FEAT_TLBIRANGE to do TLB invalidations at the + * end of the walk if certain conditions are met: + * + * - The stage-2 is for a 'normal' VM (i.e. managed in the kernel + * context). RCU provides sufficient guarantees to ensure that all + * hardware and software references on the stage-2 page tables are + * relinquished before freeing a table page. + * + * - The system supports FEAT_FWB. Otherwise, KVM needs to do CMOs + * during the page table table walk. */ - return system_supports_tlb_range() && stage2_has_fwb(pgt); + return !is_hyp_code() && system_supports_tlb_range() && + stage2_has_fwb(pgt); } static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx, @@ -1163,7 +1168,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, kvm_granule_size(ctx->level)); if (childp) - mm_ops->put_page(childp); + mm_ops->free_unlinked_table(childp, ctx->level); return 0; }
On Tue, Mar 26, 2024 at 07:31:27AM -0700, Oliver Upton wrote: > On Tue, Mar 26, 2024 at 01:34:17AM -0700, Oliver Upton wrote: > > > } > > > > > > mm_ops->put_page(ctx->ptep); > > > > At least for the 'normal' MMU where we use RCU, this could be changed to > > ->free_unlinked_table() which would defer the freeing of memory til > > after the invalidation completes. But that still hoses pKVM's stage-2 > > MMU freeing in-place. > > How about this (untested) diff? I _think_ it should address the > invalidation issue while leaving the performance optimization in place > for a 'normal' stage-2. > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index 3fae5830f8d2..896fdc0d157d 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -872,14 +872,19 @@ static void stage2_make_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t n > static bool stage2_unmap_defer_tlb_flush(struct kvm_pgtable *pgt) > { > /* > - * If FEAT_TLBIRANGE is implemented, defer the individual > - * TLB invalidations until the entire walk is finished, and > - * then use the range-based TLBI instructions to do the > - * invalidations. Condition deferred TLB invalidation on the > - * system supporting FWB as the optimization is entirely > - * pointless when the unmap walker needs to perform CMOs. > + * It is possible to use FEAT_TLBIRANGE to do TLB invalidations at the > + * end of the walk if certain conditions are met: > + * > + * - The stage-2 is for a 'normal' VM (i.e. managed in the kernel > + * context). RCU provides sufficient guarantees to ensure that all > + * hardware and software references on the stage-2 page tables are > + * relinquished before freeing a table page. > + * > + * - The system supports FEAT_FWB. Otherwise, KVM needs to do CMOs > + * during the page table table walk. > */ > - return system_supports_tlb_range() && stage2_has_fwb(pgt); > + return !is_hyp_code() && system_supports_tlb_range() && > + stage2_has_fwb(pgt); > } > > static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx, > @@ -1163,7 +1168,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > kvm_granule_size(ctx->level)); > > if (childp) > - mm_ops->put_page(childp); > + mm_ops->free_unlinked_table(childp, ctx->level); Hmm, but doesn't the deferred TLBI still happen after the RCU critical section? I also think I found another bug, so I'll send a v2 with an extra patch... Will
On Tue, Mar 26, 2024 at 04:10:01PM +0000, Will Deacon wrote: > > @@ -1163,7 +1168,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > kvm_granule_size(ctx->level)); > > > > if (childp) > > - mm_ops->put_page(childp); > > + mm_ops->free_unlinked_table(childp, ctx->level); > > Hmm, but doesn't the deferred TLBI still happen after the RCU critical > section? Right... We'd need to change that too :) Although I suppose table invalidations are relatively infrequent when compared to leaf invalidations. > I also think I found another bug, so I'll send a v2 with an extra patch... And I'm sure there's plenty more to be found too, heh.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 3fae5830f8d2..de0b667ba296 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -896,9 +896,11 @@ static void stage2_unmap_put_pte(const struct kvm_pgtable_visit_ctx *ctx, if (kvm_pte_valid(ctx->old)) { kvm_clear_pte(ctx->ptep); - if (!stage2_unmap_defer_tlb_flush(pgt)) + if (!stage2_unmap_defer_tlb_flush(pgt) || + kvm_pte_table(ctx->old, ctx->level)) { kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level); + } } mm_ops->put_page(ctx->ptep);
Commit 7657ea920c54 ("KVM: arm64: Use TLBI range-based instructions for unmap") introduced deferred TLB invalidation for the stage-2 page-table so that range-based invalidation can be used for the accumulated addresses. This works fine if the structure of the page-tables remains unchanged, but if entire tables are zapped and subsequently freed then we transiently leave the hardware page-table walker with a reference to freed memory thanks to the translation walk caches. For example, stage2_unmap_walker() will free page-table pages: if (childp) mm_ops->put_page(childp); and issue the TLB invalidation later in kvm_pgtable_stage2_unmap(): if (stage2_unmap_defer_tlb_flush(pgt)) /* Perform the deferred TLB invalidations */ kvm_tlb_flush_vmid_range(pgt->mmu, addr, size); For now, take the conservative approach and invalidate the TLB eagerly when we clear a table entry. Cc: Raghavendra Rao Ananta <rananta@google.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Will Deacon <will@kernel.org> --- arch/arm64/kvm/hyp/pgtable.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)