diff mbox series

KVM: arm64: VHE: Initialize PMSCR_EL1

Message ID 20241106122654.38234-1-alexandru.elisei@arm.com (mailing list archive)
State New
Headers show
Series KVM: arm64: VHE: Initialize PMSCR_EL1 | expand

Commit Message

Alexandru Elisei Nov. 6, 2024, 12:26 p.m. UTC
According to the pseudocode for StatisticalProfilingEnabled() from Arm
DDI0487K.a, PMSCR_EL1 controls profiling at EL1 and EL0:

- PMSCR_EL1.E1SPE controls profiling at EL1.
- PMSCR_EL1.E0SPE controls profiling at EL0 if HCR_EL2.TGE=0. KVM always
  clears HCR_EL2.TGE when running a VM.

When profiling is enabled in the host, and the host is running in nVHE mode
(HCR_EL2.E2H=0), KVM clears PMSCR_EL1.{E1SPE,E0SPE} before jumping into the
guest.

When profiling is enabled in the host, and the host is running at EL2
(HCR_EL2.E2H=1), KVM will not touch PMSCR_EL1.{E1SPE,E0SPE} before jumping
into the guest. PMSCR_EL1.{E1SPE,E0SPE} reset to an architecturally UNKNOWN
value, which means it might be possible that KVM unintentionally profiles
the guest when is running in VHE mode.

Clear PMSCR_EL1.{E1SPE,E0SPE} when setting up VHE mode to keep the
behaviour consistent and predictable.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---

Tested on the model, by setting the PMSCR_EL1.E1SPE and E0SPE bits in
__init_el2_debug to simulate a system where they reset to 1. Without the
patch, when the host is running at EL2, and the user is profiling the
kvmtool process, I can see records taken at EL1:

# perf record -e arm_spe// -- ./lkvm-static run -c2 -m512 -k Image -d disk -p earlycon

With this patch, those records disappear; and the size of perf.data has
been more than halved.

 arch/arm64/kernel/hyp-stub.S | 2 ++
 1 file changed, 2 insertions(+)

Comments

Marc Zyngier Nov. 6, 2024, 1:51 p.m. UTC | #1
On Wed, 06 Nov 2024 12:26:54 +0000,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> According to the pseudocode for StatisticalProfilingEnabled() from Arm
> DDI0487K.a, PMSCR_EL1 controls profiling at EL1 and EL0:
> 
> - PMSCR_EL1.E1SPE controls profiling at EL1.
> - PMSCR_EL1.E0SPE controls profiling at EL0 if HCR_EL2.TGE=0. KVM always
>   clears HCR_EL2.TGE when running a VM.
> 
> When profiling is enabled in the host, and the host is running in nVHE mode
> (HCR_EL2.E2H=0), KVM clears PMSCR_EL1.{E1SPE,E0SPE} before jumping into the
> guest.
> 
> When profiling is enabled in the host, and the host is running at EL2
> (HCR_EL2.E2H=1), KVM will not touch PMSCR_EL1.{E1SPE,E0SPE} before jumping
> into the guest. PMSCR_EL1.{E1SPE,E0SPE} reset to an architecturally UNKNOWN
> value, which means it might be possible that KVM unintentionally profiles
> the guest when is running in VHE mode.
> 
> Clear PMSCR_EL1.{E1SPE,E0SPE} when setting up VHE mode to keep the
> behaviour consistent and predictable.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
> 
> Tested on the model, by setting the PMSCR_EL1.E1SPE and E0SPE bits in
> __init_el2_debug to simulate a system where they reset to 1. Without the
> patch, when the host is running at EL2, and the user is profiling the
> kvmtool process, I can see records taken at EL1:
> 
> # perf record -e arm_spe// -- ./lkvm-static run -c2 -m512 -k Image -d disk -p earlycon
> 
> With this patch, those records disappear; and the size of perf.data has
> been more than halved.
> 
>  arch/arm64/kernel/hyp-stub.S | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
> index 65f76064c86b..df63f329d400 100644
> --- a/arch/arm64/kernel/hyp-stub.S
> +++ b/arch/arm64/kernel/hyp-stub.S
> @@ -117,6 +117,8 @@ SYM_CODE_START_LOCAL(__finalise_el2)
>  	bic	x0, x0, #(MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT)
>  	bic	x0, x0, #(MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT)
>  	msr	mdcr_el2, x0
> +	// Disable profiling when running a virtual machine
> +	msr_s	SYS_PMSCR_EL12, xzr

... resulting in an early crash on anything that doesn't have SPE.
That's indeed "consistent and predictable" :-).

>  
>  	// Transfer the MM state from EL1 to EL2
>  	mrs_s	x0, SYS_TCR_EL12

I find it pretty odd to hide something that is squarely guest state in
the hyp stubs, and I'd rather see something like this (untested):

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 48cafb65d6acf..806f25a8753ed 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2139,8 +2139,12 @@ static void cpu_hyp_init_features(void)
 	cpu_set_hyp_vector();
 	kvm_arm_init_debug();
 
-	if (is_kernel_in_hyp_mode())
+	if (is_kernel_in_hyp_mode()) {
+		if (SYS_FIELD_GET(ID_AA64DFR0_EL1, PMSVer,
+				  read_sysreg(id_aa64dfr0_el1)))
+			write_sysreg_el1(0, SYS_PMSCR);
 		kvm_timer_init_vhe();
+	}
 
 	if (vgic_present)
 		kvm_vgic_init_cpu_hardware();

Thanks,

	M.
Alexandru Elisei Nov. 7, 2024, 12:07 p.m. UTC | #2
Hi Marc,

On Wed, Nov 06, 2024 at 01:51:19PM +0000, Marc Zyngier wrote:
> On Wed, 06 Nov 2024 12:26:54 +0000,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > 
> > According to the pseudocode for StatisticalProfilingEnabled() from Arm
> > DDI0487K.a, PMSCR_EL1 controls profiling at EL1 and EL0:
> > 
> > - PMSCR_EL1.E1SPE controls profiling at EL1.
> > - PMSCR_EL1.E0SPE controls profiling at EL0 if HCR_EL2.TGE=0. KVM always
> >   clears HCR_EL2.TGE when running a VM.
> > 
> > When profiling is enabled in the host, and the host is running in nVHE mode
> > (HCR_EL2.E2H=0), KVM clears PMSCR_EL1.{E1SPE,E0SPE} before jumping into the
> > guest.
> > 
> > When profiling is enabled in the host, and the host is running at EL2
> > (HCR_EL2.E2H=1), KVM will not touch PMSCR_EL1.{E1SPE,E0SPE} before jumping
> > into the guest. PMSCR_EL1.{E1SPE,E0SPE} reset to an architecturally UNKNOWN
> > value, which means it might be possible that KVM unintentionally profiles
> > the guest when is running in VHE mode.
> > 
> > Clear PMSCR_EL1.{E1SPE,E0SPE} when setting up VHE mode to keep the
> > behaviour consistent and predictable.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> > 
> > Tested on the model, by setting the PMSCR_EL1.E1SPE and E0SPE bits in
> > __init_el2_debug to simulate a system where they reset to 1. Without the
> > patch, when the host is running at EL2, and the user is profiling the
> > kvmtool process, I can see records taken at EL1:
> > 
> > # perf record -e arm_spe// -- ./lkvm-static run -c2 -m512 -k Image -d disk -p earlycon
> > 
> > With this patch, those records disappear; and the size of perf.data has
> > been more than halved.
> > 
> >  arch/arm64/kernel/hyp-stub.S | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
> > index 65f76064c86b..df63f329d400 100644
> > --- a/arch/arm64/kernel/hyp-stub.S
> > +++ b/arch/arm64/kernel/hyp-stub.S
> > @@ -117,6 +117,8 @@ SYM_CODE_START_LOCAL(__finalise_el2)
> >  	bic	x0, x0, #(MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT)
> >  	bic	x0, x0, #(MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT)
> >  	msr	mdcr_el2, x0
> > +	// Disable profiling when running a virtual machine
> > +	msr_s	SYS_PMSCR_EL12, xzr
> 
> ... resulting in an early crash on anything that doesn't have SPE.
> That's indeed "consistent and predictable" :-).

Yes, that's a double fail on my part: I just assumed that __finalise_el2
checks for FEAT_SPE before fiddling with MDCR_EL2.E2PB, like init_el2_state
does; and I didn't test with FEAT_SPE not present.

> 
> >  
> >  	// Transfer the MM state from EL1 to EL2
> >  	mrs_s	x0, SYS_TCR_EL12
> 
> I find it pretty odd to hide something that is squarely guest state in
> the hyp stubs, and I'd rather see something like this (untested):

Sure.

> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 48cafb65d6acf..806f25a8753ed 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -2139,8 +2139,12 @@ static void cpu_hyp_init_features(void)
>  	cpu_set_hyp_vector();
>  	kvm_arm_init_debug();
>  
> -	if (is_kernel_in_hyp_mode())
> +	if (is_kernel_in_hyp_mode()) {
> +		if (SYS_FIELD_GET(ID_AA64DFR0_EL1, PMSVer,
> +				  read_sysreg(id_aa64dfr0_el1)))
> +			write_sysreg_el1(0, SYS_PMSCR);
>  		kvm_timer_init_vhe();
> +	}

Do you think this is an improvement (looks like a pretty big diff, but it's
mostly refactoring, the actual change is in kvm_arm_init_debug()):

diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index ce8886122ed3..21b260b02216 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -65,12 +65,30 @@ static void restore_guest_debug_regs(struct kvm_vcpu *vcpu)
                *vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
 }
 
+static bool cpu_has_spe(void)
+{
+       return cpuid_feature_extract_unsigned_field(read_sysreg(id_aa64dfr0_el1),
+                                                   ID_AA64DFR0_EL1_PMSVer);
+}
+
+static bool cpu_has_trbe(void)
+{
+       return cpuid_feature_extract_unsigned_field(read_sysreg(id_aa64dfr0_el1),
+                                                   ID_AA64DFR0_EL1_TraceBuffer);
+}
+
 /**
  * kvm_arm_init_debug - grab what we need for debug
  *
- * Currently the sole task of this function is to retrieve the initial
- * value of mdcr_el2 so we can preserve MDCR_EL2.HPMN which has
- * presumably been set-up by some knowledgeable bootcode.
+ * This function does two things:
+ *
+ * 1. Retrieve the initial value of mdcr_el2 so we can preserve
+ * MDCR_EL2.HPMN which has presumably been set-up by some knowledgeable
+ * bootcode.
+ *
+ * 2. Clear PMSCR_EL1.E1SPE and E0SPE when the host is running at EL2. The
+ * bits reset to an unknown value, and clearing them prevents the host from
+ * accidently profiling a virtual machine.
  *
  * It is called once per-cpu during CPU hyp initialisation.
  */
@@ -78,6 +96,9 @@ static void restore_guest_debug_regs(struct kvm_vcpu *vcpu)
 void kvm_arm_init_debug(void)
 {
        __this_cpu_write(mdcr_el2, kvm_call_hyp_ret(__kvm_get_mdcr_el2));
+
+       if (is_kernel_in_hyp_mode() && cpu_has_spe())
+               write_sysreg_el1(0, SYS_PMSCR);
 }
 
 /**
@@ -317,23 +338,20 @@ void kvm_arm_clear_debug(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
 {
-       u64 dfr0;
-
        /* For VHE, there is nothing to do */
        if (has_vhe())
                return;
 
-       dfr0 = read_sysreg(id_aa64dfr0_el1);
        /*
         * If SPE is present on this CPU and is available at current EL,
         * we may need to check if the host state needs to be saved.
         */
-       if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_PMSVer_SHIFT) &&
+       if (cpu_has_spe() &&
            !(read_sysreg_s(SYS_PMBIDR_EL1) & BIT(PMBIDR_EL1_P_SHIFT)))
                vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_SPE);
 
        /* Check if we have TRBE implemented and available at the host */
-       if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_TraceBuffer_SHIFT) &&
+       if (cpu_has_trbe() &&
            !(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_EL1_P))
                vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
 }


Two questions:

1. As far as I can tell, KVM uses at least two functions for extracting a
field from an ID register: the ones above, which take a _SHIFT argument for
the field position, and the SYS_FIELD_GET ones, which take a mask argument.
Are they equivalent, is one is preferred over the other, or they have
different use cases?

2. has_vhe() vs is_kernel_in_hyp_mode(). I couldn't find any documentation
when to use one over the other. Looks to me like has_vhe() is faster
because uses cpu caps.

And one interesting find: when booting v6.12-rc6 (no patches on top) with
kvm-arm.mode=protected, and when profiling the kvmtool process, I see
unexpected buffer faults:

[    0.762373] kvm [1]: Protected hVHE mode initialized successfully
..
[   84.716647] arm_spe_pmu: Unexpected buffer fault on CPU 3 [PMBSR=0x0000000094020007, PMBPTR=0xffff800088804738, PMBLIMITR=0xffff800088a03001]

Same messages with the patches applied.

I'll try to investigate further, but I don't have the time until the end of
next week.

Thanks,
Alex
Marc Zyngier Nov. 7, 2024, 5:59 p.m. UTC | #3
On Thu, 07 Nov 2024 12:07:40 +0000,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Do you think this is an improvement (looks like a pretty big diff, but it's
> mostly refactoring, the actual change is in kvm_arm_init_debug()):
> 
> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
> index ce8886122ed3..21b260b02216 100644
> --- a/arch/arm64/kvm/debug.c
> +++ b/arch/arm64/kvm/debug.c
> @@ -65,12 +65,30 @@ static void restore_guest_debug_regs(struct kvm_vcpu *vcpu)
>                 *vcpu_cpsr(vcpu) &= ~DBG_SPSR_SS;
>  }
>  
> +static bool cpu_has_spe(void)
> +{
> +       return cpuid_feature_extract_unsigned_field(read_sysreg(id_aa64dfr0_el1),
> +                                                   ID_AA64DFR0_EL1_PMSVer);
> +}
> +
> +static bool cpu_has_trbe(void)
> +{
> +       return cpuid_feature_extract_unsigned_field(read_sysreg(id_aa64dfr0_el1),
> +                                                   ID_AA64DFR0_EL1_TraceBuffer);
> +}
> +
>  /**
>   * kvm_arm_init_debug - grab what we need for debug
>   *
> - * Currently the sole task of this function is to retrieve the initial
> - * value of mdcr_el2 so we can preserve MDCR_EL2.HPMN which has
> - * presumably been set-up by some knowledgeable bootcode.
> + * This function does two things:
> + *
> + * 1. Retrieve the initial value of mdcr_el2 so we can preserve
> + * MDCR_EL2.HPMN which has presumably been set-up by some knowledgeable
> + * bootcode.
> + *
> + * 2. Clear PMSCR_EL1.E1SPE and E0SPE when the host is running at EL2. The
> + * bits reset to an unknown value, and clearing them prevents the host from
> + * accidently profiling a virtual machine.
>   *
>   * It is called once per-cpu during CPU hyp initialisation.
>   */
> @@ -78,6 +96,9 @@ static void restore_guest_debug_regs(struct kvm_vcpu *vcpu)
>  void kvm_arm_init_debug(void)
>  {
>         __this_cpu_write(mdcr_el2, kvm_call_hyp_ret(__kvm_get_mdcr_el2));
> +
> +       if (is_kernel_in_hyp_mode() && cpu_has_spe())
> +               write_sysreg_el1(0, SYS_PMSCR);
>  }
>  
>  /**
> @@ -317,23 +338,20 @@ void kvm_arm_clear_debug(struct kvm_vcpu *vcpu)
>  
>  void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
>  {
> -       u64 dfr0;
> -
>         /* For VHE, there is nothing to do */
>         if (has_vhe())
>                 return;
>  
> -       dfr0 = read_sysreg(id_aa64dfr0_el1);
>         /*
>          * If SPE is present on this CPU and is available at current EL,
>          * we may need to check if the host state needs to be saved.
>          */
> -       if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_PMSVer_SHIFT) &&
> +       if (cpu_has_spe() &&
>             !(read_sysreg_s(SYS_PMBIDR_EL1) & BIT(PMBIDR_EL1_P_SHIFT)))
>                 vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_SPE);
>  
>         /* Check if we have TRBE implemented and available at the host */
> -       if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_TraceBuffer_SHIFT) &&
> +       if (cpu_has_trbe() &&
>             !(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_EL1_P))
>                 vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
>  }

Sure, that's a reasonable start. Oliver is busy putting a stick of
dynamite in this code, but I'm sure he could work with something like
this.

> Two questions:
> 
> 1. As far as I can tell, KVM uses at least two functions for extracting a
> field from an ID register: the ones above, which take a _SHIFT argument for
> the field position, and the SYS_FIELD_GET ones, which take a mask argument.
> Are they equivalent, is one is preferred over the other, or they have
> different use cases?

The least verbose, the better. As for their equivalence, you should
check that.

> 2. has_vhe() vs is_kernel_in_hyp_mode(). I couldn't find any documentation
> when to use one over the other. Looks to me like has_vhe() is faster
> because uses cpu caps.

They don't mean the same thing. One is a capability, the other tells
you where you run.

> And one interesting find: when booting v6.12-rc6 (no patches on top) with
> kvm-arm.mode=protected, and when profiling the kvmtool process, I see
> unexpected buffer faults:
> 
> [    0.762373] kvm [1]: Protected hVHE mode initialized successfully
> ..
> [   84.716647] arm_spe_pmu: Unexpected buffer fault on CPU 3 [PMBSR=0x0000000094020007, PMBPTR=0xffff800088804738, PMBLIMITR=0xffff800088a03001]
> 
> Same messages with the patches applied.

No idea. I don't think anyone ever tried SPE with pKVM, and I don't
have an SPE-capable box at hand.

	M.
diff mbox series

Patch

diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
index 65f76064c86b..df63f329d400 100644
--- a/arch/arm64/kernel/hyp-stub.S
+++ b/arch/arm64/kernel/hyp-stub.S
@@ -117,6 +117,8 @@  SYM_CODE_START_LOCAL(__finalise_el2)
 	bic	x0, x0, #(MDCR_EL2_E2PB_MASK << MDCR_EL2_E2PB_SHIFT)
 	bic	x0, x0, #(MDCR_EL2_E2TB_MASK << MDCR_EL2_E2TB_SHIFT)
 	msr	mdcr_el2, x0
+	// Disable profiling when running a virtual machine
+	msr_s	SYS_PMSCR_EL12, xzr
 
 	// Transfer the MM state from EL1 to EL2
 	mrs_s	x0, SYS_TCR_EL12