diff mbox series

[10/11] KVM: arm64: nv: Honor guest hypervisor's FP/SVE traps in CPTR_EL2

Message ID 20240531231358.1000039-11-oliver.upton@linux.dev (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: nv: FPSIMD/SVE support | expand

Commit Message

Oliver Upton May 31, 2024, 11:13 p.m. UTC
Start folding the guest hypervisor's FP/SVE traps into the value
programmed in hardware. Note that as of writing this is dead code, since
KVM does a full put() / load() for every nested exception boundary which
saves + flushes the FP/SVE state.

However, this will become useful when we can keep the guest's FP/SVE
state alive across a nested exception boundary and the host no longer
needs to conservatively program traps.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
 arch/arm64/kvm/hyp/vhe/switch.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Marc Zyngier June 3, 2024, 12:36 p.m. UTC | #1
On Sat, 01 Jun 2024 00:13:57 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Start folding the guest hypervisor's FP/SVE traps into the value
> programmed in hardware. Note that as of writing this is dead code, since
> KVM does a full put() / load() for every nested exception boundary which
> saves + flushes the FP/SVE state.
> 
> However, this will become useful when we can keep the guest's FP/SVE
> state alive across a nested exception boundary and the host no longer
> needs to conservatively program traps.
> 
> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> ---
>  arch/arm64/kvm/hyp/vhe/switch.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
> index 697253673d7b..d07b4f4be5e5 100644
> --- a/arch/arm64/kvm/hyp/vhe/switch.c
> +++ b/arch/arm64/kvm/hyp/vhe/switch.c
> @@ -85,6 +85,19 @@ static void __activate_cptr_traps(struct kvm_vcpu *vcpu)
>  		__activate_traps_fpsimd32(vcpu);
>  	}
>  
> +	/*
> +	 * Layer the guest hypervisor's trap configuration on top of our own if
> +	 * we're in a nested context.
> +	 */
> +	if (!vcpu_has_nv(vcpu) || is_hyp_ctxt(vcpu))
> +		goto write;
> +
> +	if (guest_hyp_fpsimd_traps_enabled(vcpu))
> +		val &= ~CPACR_ELx_FPEN;
> +	if (guest_hyp_sve_traps_enabled(vcpu))
> +		val &= ~CPACR_ELx_ZEN;

I'm afraid this isn't quite right. You are clearing both FPEN (resp
ZEN) bits based on any of the two bits being clear, while what we want
is to actually propagate the 0 bits (and only those).

What I have in my tree is something along the lines of:

	cptr = vcpu_sanitised_cptr_el2(vcpu);
	tmp = cptr & (CPACR_ELx_ZEN_MASK | CPACR_ELx_FPEN_MASK);
	val &= ~(tmp ^ (CPACR_ELx_ZEN_MASK | CPACR_ELx_FPEN_MASK));

which makes sure that we only clear the relevant bits.

Thanks,

	M.
Oliver Upton June 3, 2024, 5:28 p.m. UTC | #2
Hey,

On Mon, Jun 03, 2024 at 01:36:54PM +0100, Marc Zyngier wrote:

[...]

> > +	/*
> > +	 * Layer the guest hypervisor's trap configuration on top of our own if
> > +	 * we're in a nested context.
> > +	 */
> > +	if (!vcpu_has_nv(vcpu) || is_hyp_ctxt(vcpu))
> > +		goto write;
> > +
> > +	if (guest_hyp_fpsimd_traps_enabled(vcpu))
> > +		val &= ~CPACR_ELx_FPEN;
> > +	if (guest_hyp_sve_traps_enabled(vcpu))
> > +		val &= ~CPACR_ELx_ZEN;
> 
> I'm afraid this isn't quite right. You are clearing both FPEN (resp
> ZEN) bits based on any of the two bits being clear, while what we want
> is to actually propagate the 0 bits (and only those).

An earlier version of the series I had was effectively doing this,
applying the L0 trap configuration on top of L1's CPTR_EL2. Unless I'm
missing something terribly obvious, I think this is still correct, as:

 - If we're in a hyp context, vEL2's CPTR_EL2 is loaded into CPACR_EL1.
   The independent EL0/EL1 enable bits are handled by hardware. All this
   junk gets skipped and we go directly to writing CPTR_EL2.

 - If we are not in a hyp context, vEL2's CPTR_EL2 gets folded into the
   hardware value for CPTR_EL2. TGE must be 0 in this case, so there is
   no conditional trap based on what EL the vCPU is in. There's only two
   functional trap states at this point, hence the all-or-nothing
   approach.

> What I have in my tree is something along the lines of:
> 
> 	cptr = vcpu_sanitised_cptr_el2(vcpu);
> 	tmp = cptr & (CPACR_ELx_ZEN_MASK | CPACR_ELx_FPEN_MASK);
> 	val &= ~(tmp ^ (CPACR_ELx_ZEN_MASK | CPACR_ELx_FPEN_MASK));

My hesitation with this is it gives the impression that both trap bits
are significant, but in reality only the LSB is useful. Unless my
understanding is disastrously wrong, of course :)

Anyway, my _slight_ preference is towards keeping what I have if
possible, with a giant comment explaining the reasoning behind it. But I
can take your approach instead too.
Marc Zyngier June 4, 2024, 11:14 a.m. UTC | #3
On Mon, 03 Jun 2024 18:28:56 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Hey,
> 
> On Mon, Jun 03, 2024 at 01:36:54PM +0100, Marc Zyngier wrote:
> 
> [...]
> 
> > > +	/*
> > > +	 * Layer the guest hypervisor's trap configuration on top of our own if
> > > +	 * we're in a nested context.
> > > +	 */
> > > +	if (!vcpu_has_nv(vcpu) || is_hyp_ctxt(vcpu))
> > > +		goto write;
> > > +
> > > +	if (guest_hyp_fpsimd_traps_enabled(vcpu))
> > > +		val &= ~CPACR_ELx_FPEN;
> > > +	if (guest_hyp_sve_traps_enabled(vcpu))
> > > +		val &= ~CPACR_ELx_ZEN;
> > 
> > I'm afraid this isn't quite right. You are clearing both FPEN (resp
> > ZEN) bits based on any of the two bits being clear, while what we want
> > is to actually propagate the 0 bits (and only those).
> 
> An earlier version of the series I had was effectively doing this,
> applying the L0 trap configuration on top of L1's CPTR_EL2. Unless I'm
> missing something terribly obvious, I think this is still correct, as:
> 
>  - If we're in a hyp context, vEL2's CPTR_EL2 is loaded into CPACR_EL1.
>    The independent EL0/EL1 enable bits are handled by hardware. All this
>    junk gets skipped and we go directly to writing CPTR_EL2.

Yup.

> 
>  - If we are not in a hyp context, vEL2's CPTR_EL2 gets folded into the
>    hardware value for CPTR_EL2. TGE must be 0 in this case, so there is
>    no conditional trap based on what EL the vCPU is in. There's only two
>    functional trap states at this point, hence the all-or-nothing
>    approach.

Ah, I see it now. Only bit[0] of each 2-bit field matters in that
case. This thing is giving me a headache.

> 
> > What I have in my tree is something along the lines of:
> > 
> > 	cptr = vcpu_sanitised_cptr_el2(vcpu);
> > 	tmp = cptr & (CPACR_ELx_ZEN_MASK | CPACR_ELx_FPEN_MASK);
> > 	val &= ~(tmp ^ (CPACR_ELx_ZEN_MASK | CPACR_ELx_FPEN_MASK));
> 
> My hesitation with this is it gives the impression that both trap bits
> are significant, but in reality only the LSB is useful. Unless my
> understanding is disastrously wrong, of course :)

No, you are absolutely right. Although you *are* clearing both bits
anyway ;-).

> 
> Anyway, my _slight_ preference is towards keeping what I have if
> possible, with a giant comment explaining the reasoning behind it. But I
> can take your approach instead too.

I think the only arguments for my own solution are:

- slightly better codegen (no function call or inlining), and a
  smaller .text section in switch.o, because the helpers are not
  cheap:

  LLVM:

	0 .text         00003ef8 (guest_hyp_*_traps_enabled)
	0 .text         00003d48 (bit ops)

  GCC:
	0 .text         00002624 (guest_hyp_*_traps_enabled)
	0 .text         000024b4 (bit ops)


  Yes, LLVM is an absolute pig because of BTI...

- tracking the guest's bits more precisely may make it easier to debug

but these are pretty weak arguments, and I don't really care either
way at this precise moment.

Thanks,

	M.
Oliver Upton June 4, 2024, 5:44 p.m. UTC | #4
On Tue, Jun 04, 2024 at 12:14:42PM +0100, Marc Zyngier wrote:
> On Mon, 03 Jun 2024 18:28:56 +0100, Oliver Upton <oliver.upton@linux.dev> wrote:
> > Anyway, my _slight_ preference is towards keeping what I have if
> > possible, with a giant comment explaining the reasoning behind it. But I
> > can take your approach instead too.
> 
> I think the only arguments for my own solution are:
> 
> - slightly better codegen (no function call or inlining), and a
>   smaller .text section in switch.o, because the helpers are not
>   cheap:
> 
>   LLVM:
> 
> 	0 .text         00003ef8 (guest_hyp_*_traps_enabled)
> 	0 .text         00003d48 (bit ops)
> 
>   GCC:
> 	0 .text         00002624 (guest_hyp_*_traps_enabled)
> 	0 .text         000024b4 (bit ops)
> 

Oh, that's spectacular :-)

>   Yes, LLVM is an absolute pig because of BTI...
> 
> - tracking the guest's bits more precisely may make it easier to debug
> 
> but these are pretty weak arguments, and I don't really care either
> way at this precise moment.

Yeah, so I think the right direction here is to combine our approaches,
and do direct bit manipulation, but only on bit[0]. That way we still
have an opportunity to document the very intentional simplification of
trap state too.
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 697253673d7b..d07b4f4be5e5 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -85,6 +85,19 @@  static void __activate_cptr_traps(struct kvm_vcpu *vcpu)
 		__activate_traps_fpsimd32(vcpu);
 	}
 
+	/*
+	 * Layer the guest hypervisor's trap configuration on top of our own if
+	 * we're in a nested context.
+	 */
+	if (!vcpu_has_nv(vcpu) || is_hyp_ctxt(vcpu))
+		goto write;
+
+	if (guest_hyp_fpsimd_traps_enabled(vcpu))
+		val &= ~CPACR_ELx_FPEN;
+	if (guest_hyp_sve_traps_enabled(vcpu))
+		val &= ~CPACR_ELx_ZEN;
+
+write:
 	write_sysreg(val, cpacr_el1);
 }