diff mbox series

[2/2] KVM: arm64: Ensure TLBI uses correct VMID after changing context

Message ID 20240814123429.20457-3-will@kernel.org (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Fix a couple of pKVM/nVHE TLB invalidation bugs | expand

Commit Message

Will Deacon Aug. 14, 2024, 12:34 p.m. UTC
When the target context passed to enter_vmid_context() matches the
current running context, the function returns early without manipulating
the registers of the stage-2 MMU. This can result in a stale VMID due to
the lack of an ISB instruction in exit_vmid_context() after writing the
VTTBR when ARM64_WORKAROUND_SPECULATIVE_AT is not enabled.

For example, with pKVM enabled:

	// Initially running in host context
	enter_vmid_context(guest);
		-> __load_stage2(guest); isb	// Writes VTCR & VTTBR
	exit_vmid_context(guest);
		-> __load_stage2(host);		// Restores VTCR & VTTBR

	enter_vmid_context(host);
		-> Returns early as we're already in host context
	tlbi vmalls12e1is	// !!! Can use the stale VMID as we
				// haven't performed context
				// synchronisation since restoring
				// VTTBR.VMID

Add an unconditional ISB instruction to exit_vmid_context() after
restoring the VTTBR. This already existed for the
ARM64_WORKAROUND_SPECULATIVE_AT path, so we can simply hoist that onto
the common path.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Fuad Tabba <tabba@google.com>
Fixes: 58f3b0fc3b87 ("KVM: arm64: Support TLB invalidation in guest context")
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kvm/hyp/nvhe/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Marc Zyngier Aug. 14, 2024, 1:30 p.m. UTC | #1
Hi Will,

On Wed, 14 Aug 2024 13:34:29 +0100,
Will Deacon <will@kernel.org> wrote:
> 
> When the target context passed to enter_vmid_context() matches the
> current running context, the function returns early without manipulating
> the registers of the stage-2 MMU. This can result in a stale VMID due to
> the lack of an ISB instruction in exit_vmid_context() after writing the

This refers to the exit_vmid_context() *before* a subsequent
enter_vmid_context(), right?

> VTTBR when ARM64_WORKAROUND_SPECULATIVE_AT is not enabled.

I'm not sure I completely get the failure mode, so please bear with me.

> 
> For example, with pKVM enabled:
> 
> 	// Initially running in host context
> 	enter_vmid_context(guest);
> 		-> __load_stage2(guest); isb	// Writes VTCR & VTTBR
> 	exit_vmid_context(guest);
> 		-> __load_stage2(host);		// Restores VTCR & VTTBR
> 
> 	enter_vmid_context(host);
> 		-> Returns early as we're already in host context

Is this the code you are referring to?

	vcpu = host_ctxt->__hyp_running_vcpu;

	[...]

	if (vcpu) {
		/*
		 * We're in guest context. However, for this to work, this needs
		 * to be called from within __kvm_vcpu_run(), which ensures that
		 * __hyp_running_vcpu is set to the current guest vcpu.
		 */
		if (mmu == vcpu->arch.hw_mmu || WARN_ON(mmu != host_s2_mmu))
			return;

		cxt->mmu = vcpu->arch.hw_mmu;
	} else {
		/* We're in host context. */
		if (mmu == host_s2_mmu)
			return;

		cxt->mmu = host_s2_mmu;
	}

So I take it that no vcpu is running at this point, and that we have
done a vcpu_put().

Is there an actual path within pKVM that causes a guest TLBI to be
followed by a host __kvm_tlb_flush_vmid() *without* a CSE? I can't
convinced myself that such a path exist in the current upstream code
base.

> 	tlbi vmalls12e1is	// !!! Can use the stale VMID as we
> 				// haven't performed context
> 				// synchronisation since restoring
> 				// VTTBR.VMID
> 
> Add an unconditional ISB instruction to exit_vmid_context() after
> restoring the VTTBR. This already existed for the
> ARM64_WORKAROUND_SPECULATIVE_AT path, so we can simply hoist that onto
> the common path.
> 
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Oliver Upton <oliver.upton@linux.dev>
> Cc: Fuad Tabba <tabba@google.com>
> Fixes: 58f3b0fc3b87 ("KVM: arm64: Support TLB invalidation in guest context")
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/kvm/hyp/nvhe/tlb.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
> index ca3c09df8d7c..48da9ca9763f 100644
> --- a/arch/arm64/kvm/hyp/nvhe/tlb.c
> +++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
> @@ -132,10 +132,10 @@ static void exit_vmid_context(struct tlb_inv_context *cxt)
>  	else
>  		__load_host_stage2();
>  
> -	if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
> -		/* Ensure write of the old VMID */
> -		isb();
> +	/* Ensure write of the old VMID */
> +	isb();
>  
> +	if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
>  		if (!(cxt->sctlr & SCTLR_ELx_M)) {
>  			write_sysreg_el1(cxt->sctlr, SYS_SCTLR);
>  			isb();

The patch itself makes sense to me. I'd just like to understand how we
end-up in that situation.

Thanks,

	M.
Will Deacon Aug. 15, 2024, 12:08 p.m. UTC | #2
Hey Marc,

Thanks for having a look.

On Wed, Aug 14, 2024 at 02:30:53PM +0100, Marc Zyngier wrote:
> On Wed, 14 Aug 2024 13:34:29 +0100,
> Will Deacon <will@kernel.org> wrote:
> > 
> > When the target context passed to enter_vmid_context() matches the
> > current running context, the function returns early without manipulating
> > the registers of the stage-2 MMU. This can result in a stale VMID due to
> > the lack of an ISB instruction in exit_vmid_context() after writing the
> 
> This refers to the exit_vmid_context() *before* a subsequent
> enter_vmid_context(), right?

Yes!

> > VTTBR when ARM64_WORKAROUND_SPECULATIVE_AT is not enabled.
> 
> I'm not sure I completely get the failure mode, so please bear with me.

Of course, it's confused me a tonne of times and writing the commit
message was a real pain.

> > For example, with pKVM enabled:
> > 
> > 	// Initially running in host context
> > 	enter_vmid_context(guest);
> > 		-> __load_stage2(guest); isb	// Writes VTCR & VTTBR
> > 	exit_vmid_context(guest);
> > 		-> __load_stage2(host);		// Restores VTCR & VTTBR
> > 
> > 	enter_vmid_context(host);
> > 		-> Returns early as we're already in host context
> 
> Is this the code you are referring to?
> 
> 	vcpu = host_ctxt->__hyp_running_vcpu;
> 
> 	[...]
> 
> 	if (vcpu) {
> 		/*
> 		 * We're in guest context. However, for this to work, this needs
> 		 * to be called from within __kvm_vcpu_run(), which ensures that
> 		 * __hyp_running_vcpu is set to the current guest vcpu.
> 		 */
> 		if (mmu == vcpu->arch.hw_mmu || WARN_ON(mmu != host_s2_mmu))
> 			return;
> 
> 		cxt->mmu = vcpu->arch.hw_mmu;
> 	} else {
> 		/* We're in host context. */
> 		if (mmu == host_s2_mmu)
> 			return;
> 
> 		cxt->mmu = host_s2_mmu;
> 	}

Yes, this is the "early return" where we don't bother touching the MMU
because we're already in the correct context.

> So I take it that no vcpu is running at this point, and that we have
> done a vcpu_put().
> 
> Is there an actual path within pKVM that causes a guest TLBI to be
> followed by a host __kvm_tlb_flush_vmid() *without* a CSE? I can't
> convinced myself that such a path exist in the current upstream code
> base.

I think you're right that this can't happen upstream. We see the issue
in Android when reclaiming pages from a guest during teardown. That
amounts to unmapping pages from the guest, poisoning them and mapping
them back into the host. Mapping them into the host can then trigger
table -> block conversion and the associated TLB invalidation wasn't
effective because it was still using the guest VMID.

We can carry this patch in Android if you prefer, but given that
{enter,exit}_vmid_context() are upstream, it would be nice to land the
fix so that we don't run into this bug again in future (it took some
debugging!).

Cheers,

Will
Marc Zyngier Aug. 15, 2024, 12:31 p.m. UTC | #3
On Thu, 15 Aug 2024 13:08:03 +0100,
Will Deacon <will@kernel.org> wrote:
> 
> > Is there an actual path within pKVM that causes a guest TLBI to be
> > followed by a host __kvm_tlb_flush_vmid() *without* a CSE? I can't
> > convinced myself that such a path exist in the current upstream code
> > base.
> 
> I think you're right that this can't happen upstream. We see the issue
> in Android when reclaiming pages from a guest during teardown. That
> amounts to unmapping pages from the guest, poisoning them and mapping
> them back into the host. Mapping them into the host can then trigger
> table -> block conversion and the associated TLB invalidation wasn't
> effective because it was still using the guest VMID.
> 
> We can carry this patch in Android if you prefer, but given that
> {enter,exit}_vmid_context() are upstream, it would be nice to land the
> fix so that we don't run into this bug again in future (it took some
> debugging!).

I think it is definitely worth addressing, and given that this is nVHE
only, an extra CSE isn't going to show on the radar.

The question is more whether this is 6.11 or 6.12 material. If that's
not an immediate fix for upstream, I'm tempted to queue it for 6.12.

Does this work for you?

	M.
Will Deacon Aug. 15, 2024, 12:38 p.m. UTC | #4
On Thu, Aug 15, 2024 at 01:31:54PM +0100, Marc Zyngier wrote:
> On Thu, 15 Aug 2024 13:08:03 +0100,
> Will Deacon <will@kernel.org> wrote:
> > 
> > > Is there an actual path within pKVM that causes a guest TLBI to be
> > > followed by a host __kvm_tlb_flush_vmid() *without* a CSE? I can't
> > > convinced myself that such a path exist in the current upstream code
> > > base.
> > 
> > I think you're right that this can't happen upstream. We see the issue
> > in Android when reclaiming pages from a guest during teardown. That
> > amounts to unmapping pages from the guest, poisoning them and mapping
> > them back into the host. Mapping them into the host can then trigger
> > table -> block conversion and the associated TLB invalidation wasn't
> > effective because it was still using the guest VMID.
> > 
> > We can carry this patch in Android if you prefer, but given that
> > {enter,exit}_vmid_context() are upstream, it would be nice to land the
> > fix so that we don't run into this bug again in future (it took some
> > debugging!).
> 
> I think it is definitely worth addressing, and given that this is nVHE
> only, an extra CSE isn't going to show on the radar.
> 
> The question is more whether this is 6.11 or 6.12 material. If that's
> not an immediate fix for upstream, I'm tempted to queue it for 6.12.
> 
> Does this work for you?

6.12 is absolutely fine, thank you!

Will
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index ca3c09df8d7c..48da9ca9763f 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -132,10 +132,10 @@  static void exit_vmid_context(struct tlb_inv_context *cxt)
 	else
 		__load_host_stage2();
 
-	if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
-		/* Ensure write of the old VMID */
-		isb();
+	/* Ensure write of the old VMID */
+	isb();
 
+	if (cpus_have_final_cap(ARM64_WORKAROUND_SPECULATIVE_AT)) {
 		if (!(cxt->sctlr & SCTLR_ELx_M)) {
 			write_sysreg_el1(cxt->sctlr, SYS_SCTLR);
 			isb();