mbox series

[0/3] KVM: arm64: Load the stage-2 MMU from vcpu_load() for VHE

Message ID 20231006093600.1250986-1-oliver.upton@linux.dev (mailing list archive)
Headers show
Series KVM: arm64: Load the stage-2 MMU from vcpu_load() for VHE | expand

Message

Oliver Upton Oct. 6, 2023, 9:35 a.m. UTC
Unlike nVHE, there is no need to switch the stage-2 MMU around on guest
entry/exit in VHE mode as the host is running at EL2. Despite this KVM
reloads the stage-2 on every guest entry, which is needless.

This series moves the setup of the stage-2 MMU context to vcpu_load()
when running in VHE mode. This is likely to be a win across the board,
but also allows us to remove an ISB on the guest entry path for systems
with one of the speculative AT errata.

None of my machines affected by the AT errata are VHE-capable, so it'd
be appreciated if someone could give this series a go and make sure I
haven't wrecked anything.

Oliver Upton (3):
  KVM: arm64: Don't zero VTTBR in __tlb_switch_to_host()
  KVM: arm64: Rename helpers for VHE vCPU load/put
  KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe()

 arch/arm64/include/asm/kvm_host.h  |  4 ++--
 arch/arm64/include/asm/kvm_hyp.h   |  2 ++
 arch/arm64/kvm/arm.c               |  4 ++--
 arch/arm64/kvm/hyp/vhe/switch.c    | 33 ++++++++++++++++++------------
 arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 11 ++++------
 arch/arm64/kvm/hyp/vhe/tlb.c       |  1 -
 6 files changed, 30 insertions(+), 25 deletions(-)


base-commit: 6465e260f48790807eef06b583b38ca9789b6072

Comments

Marc Zyngier Oct. 6, 2023, 11:13 a.m. UTC | #1
On Fri, 06 Oct 2023 10:35:57 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Unlike nVHE, there is no need to switch the stage-2 MMU around on guest
> entry/exit in VHE mode as the host is running at EL2. Despite this KVM
> reloads the stage-2 on every guest entry, which is needless.
> 
> This series moves the setup of the stage-2 MMU context to vcpu_load()
> when running in VHE mode. This is likely to be a win across the board,
> but also allows us to remove an ISB on the guest entry path for systems
> with one of the speculative AT errata.
> 
> None of my machines affected by the AT errata are VHE-capable, so it'd
> be appreciated if someone could give this series a go and make sure I
> haven't wrecked anything.

It totally breaks on my A55 board. Running a single guest seems OK,
but running a number of the concurrently makes them explode early on
(faults in EFI...)

I guess we end-up running with the wrong VTTBR at times, which would
be interesting...

	M.
Marc Zyngier Oct. 6, 2023, 12:34 p.m. UTC | #2
On Fri, 06 Oct 2023 12:13:00 +0100,
Marc Zyngier <maz@kernel.org> wrote:
> 
> On Fri, 06 Oct 2023 10:35:57 +0100,
> Oliver Upton <oliver.upton@linux.dev> wrote:
> > 
> > Unlike nVHE, there is no need to switch the stage-2 MMU around on guest
> > entry/exit in VHE mode as the host is running at EL2. Despite this KVM
> > reloads the stage-2 on every guest entry, which is needless.
> > 
> > This series moves the setup of the stage-2 MMU context to vcpu_load()
> > when running in VHE mode. This is likely to be a win across the board,
> > but also allows us to remove an ISB on the guest entry path for systems
> > with one of the speculative AT errata.
> > 
> > None of my machines affected by the AT errata are VHE-capable, so it'd
> > be appreciated if someone could give this series a go and make sure I
> > haven't wrecked anything.
> 
> It totally breaks on my A55 board. Running a single guest seems OK,
> but running a number of the concurrently makes them explode early on
> (faults in EFI...)
> 
> I guess we end-up running with the wrong VTTBR at times, which would
> be interesting...

Fun fact:

diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index b0cafd7c5f8f..40c84db5884a 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -195,6 +195,11 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
 	sysreg_restore_guest_state_vhe(guest_ctxt);
 	__debug_switch_to_guest(vcpu);
 
+	WARN_ONCE(kvm_get_vttbr(vcpu->arch.hw_mmu) != read_sysreg(vttbr_el2),
+		  "Oh crap %llx vs %llx\n",
+		  kvm_get_vttbr(vcpu->arch.hw_mmu),
+		  read_sysreg(vttbr_el2));
+
 	if (is_hyp_ctxt(vcpu))
 		vcpu_set_flag(vcpu, VCPU_HYP_CONTEXT);
 	else

[   36.190355] Oh crap 10000057a6001 vs 57a6001

My bet is that the VMID isn't allocated on first load, and everything
goes downhill from there.

	M.
Marc Zyngier Oct. 6, 2023, 1:33 p.m. UTC | #3
On Fri, 06 Oct 2023 13:34:34 +0100,
Marc Zyngier <maz@kernel.org> wrote:
> 
> On Fri, 06 Oct 2023 12:13:00 +0100,
> Marc Zyngier <maz@kernel.org> wrote:
> > 
> > On Fri, 06 Oct 2023 10:35:57 +0100,
> > Oliver Upton <oliver.upton@linux.dev> wrote:
> > > 
> > > Unlike nVHE, there is no need to switch the stage-2 MMU around on guest
> > > entry/exit in VHE mode as the host is running at EL2. Despite this KVM
> > > reloads the stage-2 on every guest entry, which is needless.
> > > 
> > > This series moves the setup of the stage-2 MMU context to vcpu_load()
> > > when running in VHE mode. This is likely to be a win across the board,
> > > but also allows us to remove an ISB on the guest entry path for systems
> > > with one of the speculative AT errata.
> > > 
> > > None of my machines affected by the AT errata are VHE-capable, so it'd
> > > be appreciated if someone could give this series a go and make sure I
> > > haven't wrecked anything.
> > 
> > It totally breaks on my A55 board. Running a single guest seems OK,
> > but running a number of the concurrently makes them explode early on
> > (faults in EFI...)
> > 
> > I guess we end-up running with the wrong VTTBR at times, which would
> > be interesting...
> 
> Fun fact:
> 
> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> index b0cafd7c5f8f..40c84db5884a 100644
> --- a/arch/arm64/kvm/hyp/vhe/switch.c
> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> @@ -195,6 +195,11 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
>  	sysreg_restore_guest_state_vhe(guest_ctxt);
>  	__debug_switch_to_guest(vcpu);
>  
> +	WARN_ONCE(kvm_get_vttbr(vcpu->arch.hw_mmu) != read_sysreg(vttbr_el2),
> +		  "Oh crap %llx vs %llx\n",
> +		  kvm_get_vttbr(vcpu->arch.hw_mmu),
> +		  read_sysreg(vttbr_el2));
> +
>  	if (is_hyp_ctxt(vcpu))
>  		vcpu_set_flag(vcpu, VCPU_HYP_CONTEXT);
>  	else
> 
> [   36.190355] Oh crap 10000057a6001 vs 57a6001
> 
> My bet is that the VMID isn't allocated on first load, and everything
> goes downhill from there.

So I was correct that the VMID isn't allocated on the first run, and
the following patch should address that particular problem:

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index e2d38c7d6555..759adee42018 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1025,7 +1025,7 @@ int kvm_arm_pvtime_has_attr(struct kvm_vcpu *vcpu,
 extern unsigned int __ro_after_init kvm_arm_vmid_bits;
 int __init kvm_arm_vmid_alloc_init(void);
 void __init kvm_arm_vmid_alloc_free(void);
-void kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid);
+bool kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid);
 void kvm_arm_vmid_clear_active(void);
 
 static inline void kvm_arm_pvtime_vcpu_init(struct kvm_vcpu_arch *vcpu_arch)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 39c969c05990..584be562b1d4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -950,7 +950,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		 * making a thread's VMID inactive. So we need to call
 		 * kvm_arm_vmid_update() in non-premptible context.
 		 */
-		kvm_arm_vmid_update(&vcpu->arch.hw_mmu->vmid);
+		if (kvm_arm_vmid_update(&vcpu->arch.hw_mmu->vmid) &&
+		    has_vhe())
+			__load_stage2(vcpu->arch.hw_mmu,
+				      vcpu->arch.hw_mmu->arch);
 
 		kvm_pmu_flush_hwstate(vcpu);
 
diff --git a/arch/arm64/kvm/vmid.c b/arch/arm64/kvm/vmid.c
index 7fe8ba1a2851..281e4f86d9a2 100644
--- a/arch/arm64/kvm/vmid.c
+++ b/arch/arm64/kvm/vmid.c
@@ -135,10 +135,11 @@ void kvm_arm_vmid_clear_active(void)
 	atomic64_set(this_cpu_ptr(&active_vmids), VMID_ACTIVE_INVALID);
 }
 
-void kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid)
+bool kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid)
 {
 	unsigned long flags;
 	u64 vmid, old_active_vmid;
+	bool updated = false;
 
 	vmid = atomic64_read(&kvm_vmid->id);
 
@@ -156,17 +157,21 @@ void kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid)
 	if (old_active_vmid != 0 && vmid_gen_match(vmid) &&
 	    0 != atomic64_cmpxchg_relaxed(this_cpu_ptr(&active_vmids),
 					  old_active_vmid, vmid))
-		return;
+		return false;
 
 	raw_spin_lock_irqsave(&cpu_vmid_lock, flags);
 
 	/* Check that our VMID belongs to the current generation. */
 	vmid = atomic64_read(&kvm_vmid->id);
-	if (!vmid_gen_match(vmid))
+	if (!vmid_gen_match(vmid)) {
 		vmid = new_vmid(kvm_vmid);
+		updated = true;
+	}
 
 	atomic64_set(this_cpu_ptr(&active_vmids), vmid);
 	raw_spin_unlock_irqrestore(&cpu_vmid_lock, flags);
+
+	return updated;
 }
 
 /*

However, this isn't enough.

[   63.450113] Oh crap 400000435c001 vs 3000004430001

So there are situations where we end-up with the wrong VTTBR, rather
than the wrong VMID, which is even worse. Haven't worked out the
scenario yet, but it apparently involves being preempted by a vcpu
from a different VM and not doing the right thing.

	M.
Marc Zyngier Oct. 6, 2023, 3:03 p.m. UTC | #4
On Fri, 06 Oct 2023 14:33:04 +0100,
Marc Zyngier <maz@kernel.org> wrote:

Still talking to myself! :D

[...]

>
> However, this isn't enough.
> 
> [   63.450113] Oh crap 400000435c001 vs 3000004430001
> 
> So there are situations where we end-up with the wrong VTTBR, rather
> than the wrong VMID, which is even worse. Haven't worked out the
> scenario yet, but it apparently involves being preempted by a vcpu
> from a different VM and not doing the right thing.

Actually, no. It is the MMU notifiers kicking in and performing TLB
invalidation for a guest while we're in the context of another. The
joy of running 4 large VMs on a box with 2GB of RAM, basically running
from swap.

The trace looks like this:

[   66.147484] Call trace:
[   66.149899]  dump_backtrace+0xa0/0x128
[   66.153607]  show_stack+0x20/0x38
[   66.156884]  dump_stack_lvl+0x78/0xc8
[   66.160507]  dump_stack+0x18/0x28
[   66.163784]  __tlb_switch_to_guest+0x50/0x148
[   66.168097]  __kvm_tlb_flush_vmid_ipa+0x3c/0xc8
[   66.172582]  stage2_unmap_put_pte+0xd0/0xe8
[   66.176722]  stage2_unmap_walker+0x160/0x1c0
[   66.180948]  __kvm_pgtable_visit+0x170/0x1f8
[   66.185174]  __kvm_pgtable_walk+0x94/0xc8
[   66.189142]  __kvm_pgtable_visit+0xd8/0x1f8
[   66.193282]  __kvm_pgtable_walk+0x94/0xc8
[   66.197249]  __kvm_pgtable_visit+0xd8/0x1f8
[   66.201389]  __kvm_pgtable_walk+0x94/0xc8
[   66.205357]  kvm_pgtable_walk+0xd4/0x170
[   66.209238]  kvm_pgtable_stage2_unmap+0x54/0xd0
[   66.213723]  stage2_apply_range+0x9c/0x108
[   66.217777]  __unmap_stage2_range+0x34/0x70
[   66.221917]  kvm_unmap_gfn_range+0x38/0x58
[   66.225970]  kvm_mmu_notifier_invalidate_range_start+0xe8/0x310
[   66.231835]  mn_hlist_invalidate_range_start+0x80/0x158
[   66.237010]  __mmu_notifier_invalidate_range_start+0x40/0x78
[   66.242617]  try_to_migrate_one+0x8b0/0xa10
[   66.246757]  rmap_walk_anon+0xec/0x268
[   66.250465]  try_to_migrate+0xc8/0x120
[   66.254174]  migrate_folio_unmap+0x180/0x438
[   66.258401]  migrate_pages_batch+0x14c/0x798
[   66.262627]  migrate_pages_sync+0x8c/0x258
[   66.266680]  migrate_pages+0x4f0/0x690
[   66.270389]  compact_zone+0x1d8/0x6b8
[   66.274012]  compact_zone_order+0xa0/0xf0
[   66.277979]  try_to_compact_pages+0xfc/0x378
[   66.282205]  __alloc_pages_direct_compact+0x80/0x398
[   66.287122]  __alloc_pages_slowpath.constprop.0+0x328/0x868
[   66.292642]  __alloc_pages+0x2cc/0x358
[   66.296350]  __folio_alloc+0x24/0x68
[   66.299887]  vma_alloc_folio+0x2ac/0x340
[   66.303768]  do_huge_pmd_anonymous_page+0xb0/0x3b8
[   66.308512]  __handle_mm_fault+0x31c/0x358
[   66.312566]  handle_mm_fault+0x64/0x270
[   66.316360]  faultin_page+0x74/0x130
[   66.319897]  __get_user_pages+0xc8/0x340
[   66.323778]  get_user_pages_unlocked+0xc8/0x3b8
[   66.328263]  hva_to_pfn+0xfc/0x338
[   66.331627]  __gfn_to_pfn_memslot+0xa8/0x100
[   66.335853]  user_mem_abort+0x17c/0x7c0
[   66.339648]  kvm_handle_guest_abort+0x2f4/0x3d8
[   66.344133]  handle_trap_exceptions+0x44/0xb8
[   66.348445]  handle_exit+0x4c/0x118
[   66.351895]  kvm_arch_vcpu_ioctl_run+0x24c/0x5e0
[   66.356467]  kvm_vcpu_ioctl+0x28c/0x9e0
[   66.360262]  __arm64_sys_ioctl+0xc0/0xe8
[   66.364143]  invoke_syscall+0x50/0x128
[   66.367852]  el0_svc_common.constprop.0+0x48/0xf0
[   66.372509]  do_el0_svc+0x24/0x38
[   66.375787]  el0_svc+0x3c/0xe0
[   66.378805]  el0t_64_sync_handler+0xc0/0xc8
[   66.382945]  el0t_64_sync+0x1a4/0x1a8

Which says it all: we're reclaiming pages from a guest while faulting
on another, resulting in the wrong MMU setup being left behind. Damn!

There's the sum of my hacks, which keeps the box alive.

Thanks,

	M.

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index e2d38c7d6555..759adee42018 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1025,7 +1025,7 @@ int kvm_arm_pvtime_has_attr(struct kvm_vcpu *vcpu,
 extern unsigned int __ro_after_init kvm_arm_vmid_bits;
 int __init kvm_arm_vmid_alloc_init(void);
 void __init kvm_arm_vmid_alloc_free(void);
-void kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid);
+bool kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid);
 void kvm_arm_vmid_clear_active(void);
 
 static inline void kvm_arm_pvtime_vcpu_init(struct kvm_vcpu_arch *vcpu_arch)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 39c969c05990..584be562b1d4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -950,7 +950,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		 * making a thread's VMID inactive. So we need to call
 		 * kvm_arm_vmid_update() in non-premptible context.
 		 */
-		kvm_arm_vmid_update(&vcpu->arch.hw_mmu->vmid);
+		if (kvm_arm_vmid_update(&vcpu->arch.hw_mmu->vmid) &&
+		    has_vhe())
+			__load_stage2(vcpu->arch.hw_mmu,
+				      vcpu->arch.hw_mmu->arch);
 
 		kvm_pmu_flush_hwstate(vcpu);
 
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index f3f2e142e4f4..bc510c7e53f1 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -12,6 +12,7 @@
 
 struct tlb_inv_context {
 	unsigned long	flags;
+	struct kvm_s2_mmu *mmu;
 	u64		tcr;
 	u64		sctlr;
 };
@@ -19,10 +20,16 @@ struct tlb_inv_context {
 static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu,
 				  struct tlb_inv_context *cxt)
 {
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 	u64 val;
 
 	local_irq_save(cxt->flags);
 
+	if (vcpu && mmu != vcpu->arch.hw_mmu)
+		cxt->mmu = vcpu->arch.hw_mmu;
+	else
+		cxt->mmu = NULL;
+
 	if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
 		/*
 		 * For CPUs that are affected by ARM errata 1165522 or 1530923,
@@ -64,8 +71,10 @@ static void __tlb_switch_to_host(struct tlb_inv_context *cxt)
 {
 	/*
 	 * We're done with the TLB operation, let's restore the host's
-	 * view of HCR_EL2.
+	 * view of HCR_EL2 and current S2 MMU context.
 	 */
+	if (cxt->mmu)
+		__load_stage2(cxt->mmu, cxt->mmu->arch);
 	write_sysreg(HCR_HOST_VHE_FLAGS, hcr_el2);
 	isb();
 
diff --git a/arch/arm64/kvm/vmid.c b/arch/arm64/kvm/vmid.c
index 7fe8ba1a2851..806223b7022a 100644
--- a/arch/arm64/kvm/vmid.c
+++ b/arch/arm64/kvm/vmid.c
@@ -135,10 +135,11 @@ void kvm_arm_vmid_clear_active(void)
 	atomic64_set(this_cpu_ptr(&active_vmids), VMID_ACTIVE_INVALID);
 }
 
-void kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid)
+bool kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid)
 {
 	unsigned long flags;
 	u64 vmid, old_active_vmid;
+	bool updated = false;
 
 	vmid = atomic64_read(&kvm_vmid->id);
 
@@ -156,17 +157,21 @@ void kvm_arm_vmid_update(struct kvm_vmid *kvm_vmid)
 	if (old_active_vmid != 0 && vmid_gen_match(vmid) &&
 	    0 != atomic64_cmpxchg_relaxed(this_cpu_ptr(&active_vmids),
 					  old_active_vmid, vmid))
-		return;
+		return false;
 
 	raw_spin_lock_irqsave(&cpu_vmid_lock, flags);
 
 	/* Check that our VMID belongs to the current generation. */
 	vmid = atomic64_read(&kvm_vmid->id);
-	if (!vmid_gen_match(vmid))
+	if (!vmid_gen_match(vmid)) {
 		vmid = new_vmid(kvm_vmid);
+		updated = true;
+	}
 
 	atomic64_set(this_cpu_ptr(&active_vmids), vmid);
 	raw_spin_unlock_irqrestore(&cpu_vmid_lock, flags);
+
+	return updated;
 }
 
 /*
Oliver Upton Oct. 6, 2023, 6:11 p.m. UTC | #5
Hey Marc,

On Fri, Oct 06, 2023 at 04:03:32PM +0100, Marc Zyngier wrote:
> On Fri, 06 Oct 2023 14:33:04 +0100,
> Marc Zyngier <maz@kernel.org> wrote:
> 
> Still talking to myself! :D

Seems like you had a good bit of fun :)

> >
> > However, this isn't enough.
> > 
> > [   63.450113] Oh crap 400000435c001 vs 3000004430001
> > 
> > So there are situations where we end-up with the wrong VTTBR, rather
> > than the wrong VMID, which is even worse. Haven't worked out the
> > scenario yet, but it apparently involves being preempted by a vcpu
> > from a different VM and not doing the right thing.
> 
> Actually, no. It is the MMU notifiers kicking in and performing TLB
> invalidation for a guest while we're in the context of another. The
> joy of running 4 large VMs on a box with 2GB of RAM, basically running
> from swap.

Whelp, looks like my self-rule of no patches on the list after midnight
is in force again. Clearly this was all quite gently tested, thanks for
being the guinea pig.

> There's the sum of my hacks, which keeps the box alive.

Thanks! I'll roll it into v2 so we have something that actually works.