diff mbox series

arm64/fpsimd: Ensure we don't contend a SMCU from idling CPUs

Message ID 20240618-arm64-sme-no-cpuidle-v1-1-1de872e1691f@kernel.org (mailing list archive)
State New, archived
Headers show
Series arm64/fpsimd: Ensure we don't contend a SMCU from idling CPUs | expand

Commit Message

Mark Brown June 18, 2024, 1:57 p.m. UTC
When we enter the kernel we currently don't update any of the floating
point register state until either something else uses floating point or we
get a CPU_PM_ENTER notification during suspend or cpuidle. This means that
for a system which has been configured with both suspend and cpuidle
disabled we will leave whatever floating point state was loaded in the
registers present while a CPU is idling.

For SME this may cause an idling CPU to interfere with the operation of
other CPUs, SME may be implemented as a SMCU shared with between multiple
CPUs. Leaving ZA or especially streaming mode enabled may be taken by
the hardware as an indication that SME is active use by the CPU and cause
resources to be allocated to it at the expense of other CPUs.

Add an arch_cpu_idle_enter() implementation which disables SME if it is
active when we idle the CPU, ensuring that we don't create any spurious
contention even if cpuidle is not enabled.

Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/include/asm/fpsimd.h |  2 ++
 arch/arm64/kernel/fpsimd.c      | 14 ++++++++++++++
 arch/arm64/kernel/process.c     |  5 +++++
 3 files changed, 21 insertions(+)


---
base-commit: 83a7eefedc9b56fe7bfeff13b6c7356688ffa670
change-id: 20240617-arm64-sme-no-cpuidle-8c68ef86e370

Best regards,

Comments

Catalin Marinas July 10, 2024, 8:10 p.m. UTC | #1
+ Mark R, he commented on the other patch.

On Tue, Jun 18, 2024 at 02:57:42PM +0100, Mark Brown wrote:
> When we enter the kernel we currently don't update any of the floating
> point register state until either something else uses floating point or we
> get a CPU_PM_ENTER notification during suspend or cpuidle. This means that
> for a system which has been configured with both suspend and cpuidle
> disabled we will leave whatever floating point state was loaded in the
> registers present while a CPU is idling.
> 
> For SME this may cause an idling CPU to interfere with the operation of
> other CPUs, SME may be implemented as a SMCU shared with between multiple
> CPUs. Leaving ZA or especially streaming mode enabled may be taken by
> the hardware as an indication that SME is active use by the CPU and cause
> resources to be allocated to it at the expense of other CPUs.
> 
> Add an arch_cpu_idle_enter() implementation which disables SME if it is
> active when we idle the CPU, ensuring that we don't create any spurious
> contention even if cpuidle is not enabled.

I guess this approach is useful when the kernel does a light WFI rather
than going all the way to firmware for a deep sleep state. In general,
the shallower the sleep state is, the more CPU state is retained. From a
power perspective, I wonder whether we should leave the decision to drop
the SME state to a cpuidle driver.

If SME is implemented as a shared unit, doing an SMSTOP could be a
useful indication that other tasks can use it.

Which situations should we consider for such idle scenario (we discussed
them offline briefly)? I think:

1. Thread migration: a thread using SME is moved to a different CPU.
   Here SMSTOP makes sense because a new thread scheduled eventually
   will need a different SME state.

2. Thread page fault followed by waiting for I/O etc. and the kernel may
   switch to idle. Here it's probably less beneficial to do an SMSTOP.

Any other cases? Blocking syscalls don't matter since we don't preserve
the state between calls.

The trade-off is for case (2) above and it depends on whether it happens
sufficiently often to be noticeable. I wouldn't think so.

> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
>  arch/arm64/include/asm/fpsimd.h |  2 ++
>  arch/arm64/kernel/fpsimd.c      | 14 ++++++++++++++
>  arch/arm64/kernel/process.c     |  5 +++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
> index bc69ac368d73..e1453a1d261d 100644
> --- a/arch/arm64/include/asm/fpsimd.h
> +++ b/arch/arm64/include/asm/fpsimd.h
> @@ -82,6 +82,8 @@ extern void fpsimd_restore_current_state(void);
>  extern void fpsimd_update_current_state(struct user_fpsimd_state const *state);
>  extern void fpsimd_kvm_prepare(void);
>  
> +extern void fpsimd_idle_enter(void);
> +
>  struct cpu_fp_state {
>  	struct user_fpsimd_state *st;
>  	void *sve_state;
> diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
> index 82e8a6017382..400eaad374a2 100644
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -2126,6 +2126,20 @@ static void __init fpsimd_pm_init(void)
>  static inline void fpsimd_pm_init(void) { }
>  #endif /* CONFIG_CPU_PM */
>  
> +void fpsimd_idle_enter(void)
> +{
> +	/*
> +	 * Leaving SME enabled may leave this core contending with
> +	 * other cores if we have a SMCU, disable whenever we enter
> +	 * idle to avoid this.  Only do this if they're actually
> +	 * enabled to avoid overhead in cases where we don't enter a
> +	 * low enough power state to loose register state.
> +	 */
> +	if (system_supports_sme() &&
> +	    (read_sysreg_s(SYS_SVCR) & (SVCR_SM_MASK | SVCR_ZA_MASK)))
> +		fpsimd_save_and_flush_cpu_state();
> +}

Do we always enter here via the idle thread? If we already had a thread
switch we probably don't need to save the SME state again, only flush
the state.

>  #ifdef CONFIG_HOTPLUG_CPU
>  static int fpsimd_cpu_dead(unsigned int cpu)
>  {
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 4ae31b7af6c3..fd616882cd49 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -68,6 +68,11 @@ EXPORT_SYMBOL(__stack_chk_guard);
>  void (*pm_power_off)(void);
>  EXPORT_SYMBOL_GPL(pm_power_off);
>  
> +void arch_cpu_idle_enter(void)
> +{
> +       fpsimd_idle_enter();
> +}
> +
>  #ifdef CONFIG_HOTPLUG_CPU
>  void __noreturn arch_cpu_idle_dead(void)
>  {
Mark Brown July 10, 2024, 11:28 p.m. UTC | #2
On Wed, Jul 10, 2024 at 09:10:53PM +0100, Catalin Marinas wrote:
> On Tue, Jun 18, 2024 at 02:57:42PM +0100, Mark Brown wrote:

> > When we enter the kernel we currently don't update any of the floating
> > point register state until either something else uses floating point or we
> > get a CPU_PM_ENTER notification during suspend or cpuidle. This means that
> > for a system which has been configured with both suspend and cpuidle
> > disabled we will leave whatever floating point state was loaded in the
> > registers present while a CPU is idling.

> I guess this approach is useful when the kernel does a light WFI rather
> than going all the way to firmware for a deep sleep state. In general,
> the shallower the sleep state is, the more CPU state is retained. From a
> power perspective, I wonder whether we should leave the decision to drop
> the SME state to a cpuidle driver.

The concern here is if we don't have a cpuidle driver - we could also
make this conditional on !CPUIDLE.

> Which situations should we consider for such idle scenario (we discussed
> them offline briefly)? I think:

> 1. Thread migration: a thread using SME is moved to a different CPU.
>    Here SMSTOP makes sense because a new thread scheduled eventually
>    will need a different SME state.

> 2. Thread page fault followed by waiting for I/O etc. and the kernel may
>    switch to idle. Here it's probably less beneficial to do an SMSTOP.

> Any other cases? Blocking syscalls don't matter since we don't preserve
> the state between calls.

For syscalls we explicitly disable streaming mode, but we do allow ZA to
be active so you might have a blocking syscall with ZA enabled.  Having
state in ZA is less of a concern than streaming mode, it will have a
power impact but it is much less likely that there will be a performance
impact on other cores.

> The trade-off is for case (2) above and it depends on whether it happens
> sufficiently often to be noticeable. I wouldn't think so.

Yes, to me it seems much more likely that we would decide to schedule a
task out while it was using SME rather than getting faults where the
overhead of reloading the state was noticable.

> > +	/*
> > +	 * Leaving SME enabled may leave this core contending with
> > +	 * other cores if we have a SMCU, disable whenever we enter
> > +	 * idle to avoid this.  Only do this if they're actually
> > +	 * enabled to avoid overhead in cases where we don't enter a
> > +	 * low enough power state to loose register state.
> > +	 */
> > +	if (system_supports_sme() &&
> > +	    (read_sysreg_s(SYS_SVCR) & (SVCR_SM_MASK | SVCR_ZA_MASK)))
> > +		fpsimd_save_and_flush_cpu_state();
> > +}

> Do we always enter here via the idle thread? If we already had a thread
> switch we probably don't need to save the SME state again, only flush
> the state.

If we've actually switched the thread then TIF_FOREIGN_FPSTATE has been
set and we'll just check the flag and return for the save portion rather
than actually writing any register values out so the overhead should be
minimal.  It feels safer to check in case we get better at doing the
save lazily.

Otherwise arch_cpu_idle_enter() is called from psci_checker as well as
the idle thread, this code should not be relevant either way in that
context since it runs before userspace and AIUI it's trying to do the
same thing as the idle thread there anyway.
Catalin Marinas Sept. 5, 2024, 5:51 p.m. UTC | #3
(replying to the previous version as it looks like I haven't followed up
on the discussion)

On Thu, Jul 11, 2024 at 12:28:12AM +0100, Mark Brown wrote:
> On Wed, Jul 10, 2024 at 09:10:53PM +0100, Catalin Marinas wrote:
> > On Tue, Jun 18, 2024 at 02:57:42PM +0100, Mark Brown wrote:
> > > +	/*
> > > +	 * Leaving SME enabled may leave this core contending with
> > > +	 * other cores if we have a SMCU, disable whenever we enter
> > > +	 * idle to avoid this.  Only do this if they're actually
> > > +	 * enabled to avoid overhead in cases where we don't enter a
> > > +	 * low enough power state to loose register state.
> > > +	 */
> > > +	if (system_supports_sme() &&
> > > +	    (read_sysreg_s(SYS_SVCR) & (SVCR_SM_MASK | SVCR_ZA_MASK)))
> > > +		fpsimd_save_and_flush_cpu_state();
> > > +}
> 
> > Do we always enter here via the idle thread? If we already had a thread
> > switch we probably don't need to save the SME state again, only flush
> > the state.
> 
> If we've actually switched the thread then TIF_FOREIGN_FPSTATE has been
> set and we'll just check the flag and return for the save portion rather
> than actually writing any register values out so the overhead should be
> minimal.  It feels safer to check in case we get better at doing the
> save lazily.

OK, so likely the state is already saved, all we need to do here is
flush the state and SMSTOP. But why would switching to idle be any
different than switching to a thread that doesn't used SME? It feels
like we are just trying to optimise a special case only. Could we not
instead issue an SMSTOP in the context switch code?

Also this looks hypothetical until we have some hardware to test it on,
see how it would behave with a shared SME unit.
Mark Brown Sept. 5, 2024, 6:34 p.m. UTC | #4
On Thu, Sep 05, 2024 at 06:51:30PM +0100, Catalin Marinas wrote:

> OK, so likely the state is already saved, all we need to do here is
> flush the state and SMSTOP. But why would switching to idle be any
> different than switching to a thread that doesn't used SME? It feels
> like we are just trying to optimise a special case only. Could we not
> instead issue an SMSTOP in the context switch code?

On context switch the SMSTOP is issued as part of loading the state for
the task but we only do that when either returning to userspace or it's
a kernel thread with active FPSIMD usage.  The idle thread is a kernel
thread with no FPSIMD usage so we don't touch the state.  If we did the
SMSTOP unconditionally that'd mean that the optimisation where we don't
reload the FP state if we bounce through a kernel thread would be broken
while using SME which doesn't seem ideal, idling really does seem like a
meaningfully special case here.

> Also this looks hypothetical until we have some hardware to test it on,
> see how it would behave with a shared SME unit.

The specific performance impacts will depend on hardware (there'll
likely be some power impact even on things with a single FP unit per
PE) but given that keeping SM and ZA disabled when not in use is a
fairly strong recommendation in the programming model my inclination at
this point would be to program to the advertised model until we have
confirmation that the hardware actually behaves otherwise.
Catalin Marinas Sept. 6, 2024, 2:56 p.m. UTC | #5
On Thu, Sep 05, 2024 at 07:34:41PM +0100, Mark Brown wrote:
> On Thu, Sep 05, 2024 at 06:51:30PM +0100, Catalin Marinas wrote:
> 
> > OK, so likely the state is already saved, all we need to do here is
> > flush the state and SMSTOP. But why would switching to idle be any
> > different than switching to a thread that doesn't used SME? It feels
> > like we are just trying to optimise a special case only. Could we not
> > instead issue an SMSTOP in the context switch code?
> 
> On context switch the SMSTOP is issued as part of loading the state for
> the task but we only do that when either returning to userspace or it's
> a kernel thread with active FPSIMD usage.  The idle thread is a kernel
> thread with no FPSIMD usage so we don't touch the state.  If we did the
> SMSTOP unconditionally that'd mean that the optimisation where we don't
> reload the FP state if we bounce through a kernel thread would be broken
> while using SME which doesn't seem ideal, idling really does seem like a
> meaningfully special case here.

It depends on why the CPU is idling and we don't have the whole
information in this function. If it was a wait on a syscall, we already
discarded the state (but we only issue sme_smstop_sm() IIUC). With this
patch, we'd disable the ZA storage as well, can it cause any performance
issues by forcing the user to re-fault?

If it's some short-lived wait for I/O on page faults, we may not want to
disable streaming mode. I don't see this last case much different from
switching to a kernel thread that doesn't use SME.

So I think this leaves us with the case where a thread is migrated to a
different CPU and the current CPU goes into idle for longer. But, again,
we can't tell in the arch callback. The cpuidle driver calling into
firmware is slightly better informed since it knows it's been idle (or
going to be) for longer.

> > Also this looks hypothetical until we have some hardware to test it on,
> > see how it would behave with a shared SME unit.
> 
> The specific performance impacts will depend on hardware (there'll
> likely be some power impact even on things with a single FP unit per
> PE) but given that keeping SM and ZA disabled when not in use is a
> fairly strong recommendation in the programming model my inclination at
> this point would be to program to the advertised model until we have
> confirmation that the hardware actually behaves otherwise.

Does the programming model talk about shared units (I haven't read it,
not even sure where it is)? I hope one CPU cannot DoS another by not
issuing SMSTOPs and the hardware has some provisions for sharing that
guarantees forward progress on all CPUs. They may not be optimal but
it's highly depended on the software usage and hardware behaviour.

I'm inclined not to do anything at this stage until we see the actual
hardware behaviour in practice.
Mark Brown Sept. 6, 2024, 5:16 p.m. UTC | #6
On Fri, Sep 06, 2024 at 03:56:52PM +0100, Catalin Marinas wrote:
> On Thu, Sep 05, 2024 at 07:34:41PM +0100, Mark Brown wrote:

> > On context switch the SMSTOP is issued as part of loading the state for
> > the task but we only do that when either returning to userspace or it's
> > a kernel thread with active FPSIMD usage.  The idle thread is a kernel
> > thread with no FPSIMD usage so we don't touch the state.  If we did the
> > SMSTOP unconditionally that'd mean that the optimisation where we don't
> > reload the FP state if we bounce through a kernel thread would be broken
> > while using SME which doesn't seem ideal, idling really does seem like a
> > meaningfully special case here.

> It depends on why the CPU is idling and we don't have the whole
> information in this function. If it was a wait on a syscall, we already
> discarded the state (but we only issue sme_smstop_sm() IIUC). With this
> patch, we'd disable the ZA storage as well, can it cause any performance
> issues by forcing the user to re-fault?

There will be some overhead from reloading the FP state, yes.

> If it's some short-lived wait for I/O on page faults, we may not want to
> disable streaming mode. I don't see this last case much different from
> switching to a kernel thread that doesn't use SME.

Yeah, that one is going to depend a lot on how performant the I/O is.

> So I think this leaves us with the case where a thread is migrated to a
> different CPU and the current CPU goes into idle for longer. But, again,
> we can't tell in the arch callback. The cpuidle driver calling into
> firmware is slightly better informed since it knows it's been idle (or
> going to be) for longer.

Yes, cpuidle is a whole different case - this is mainly targeted at the
case where that's been disabled in the kernel configuration (I was
considering making this conditional on !CPUIDLE, it was an oversight not
to do that in the first place).

> > > Also this looks hypothetical until we have some hardware to test it on,
> > > see how it would behave with a shared SME unit.

> > The specific performance impacts will depend on hardware (there'll
> > likely be some power impact even on things with a single FP unit per
> > PE) but given that keeping SM and ZA disabled when not in use is a
> > fairly strong recommendation in the programming model my inclination at
> > this point would be to program to the advertised model until we have
> > confirmation that the hardware actually behaves otherwise.

> Does the programming model talk about shared units (I haven't read it,
> not even sure where it is)? I hope one CPU cannot DoS another by not
> issuing SMSTOPs and the hardware has some provisions for sharing that
> guarantees forward progress on all CPUs. They may not be optimal but
> it's highly depended on the software usage and hardware behaviour.

This is all getting totally into IMPDEF behaviour (and QoI issues) but
implementations are supposed to default to something which shares things
equally between all the users and guarantees forward progress.  Anything
that doesn't guarantee forward progress would obviously be quite
specialist, and you'd hope that if the PE isn't actually issuing FP
instructions it won't impact anything.  Even if things do great you'll
still have the cost of keeping the unit on though.

> I'm inclined not to do anything at this stage until we see the actual
> hardware behaviour in practice.

Like I say my inclination is the opposite way round, though probably
with a check for !CONFIG_CPUIDLE.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index bc69ac368d73..e1453a1d261d 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -82,6 +82,8 @@  extern void fpsimd_restore_current_state(void);
 extern void fpsimd_update_current_state(struct user_fpsimd_state const *state);
 extern void fpsimd_kvm_prepare(void);
 
+extern void fpsimd_idle_enter(void);
+
 struct cpu_fp_state {
 	struct user_fpsimd_state *st;
 	void *sve_state;
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 82e8a6017382..400eaad374a2 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -2126,6 +2126,20 @@  static void __init fpsimd_pm_init(void)
 static inline void fpsimd_pm_init(void) { }
 #endif /* CONFIG_CPU_PM */
 
+void fpsimd_idle_enter(void)
+{
+	/*
+	 * Leaving SME enabled may leave this core contending with
+	 * other cores if we have a SMCU, disable whenever we enter
+	 * idle to avoid this.  Only do this if they're actually
+	 * enabled to avoid overhead in cases where we don't enter a
+	 * low enough power state to loose register state.
+	 */
+	if (system_supports_sme() &&
+	    (read_sysreg_s(SYS_SVCR) & (SVCR_SM_MASK | SVCR_ZA_MASK)))
+		fpsimd_save_and_flush_cpu_state();
+}
+
 #ifdef CONFIG_HOTPLUG_CPU
 static int fpsimd_cpu_dead(unsigned int cpu)
 {
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 4ae31b7af6c3..fd616882cd49 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -68,6 +68,11 @@  EXPORT_SYMBOL(__stack_chk_guard);
 void (*pm_power_off)(void);
 EXPORT_SYMBOL_GPL(pm_power_off);
 
+void arch_cpu_idle_enter(void)
+{
+       fpsimd_idle_enter();
+}
+
 #ifdef CONFIG_HOTPLUG_CPU
 void __noreturn arch_cpu_idle_dead(void)
 {