diff mbox series

arm64/fpsimd: Ensure that offlined CPUs are not using SME

Message ID 20240618-arm64-fpsimd-sme-cpu-die-v1-1-9a90d1a34918@kernel.org (mailing list archive)
State New, archived
Headers show
Series arm64/fpsimd: Ensure that offlined CPUs are not using SME | expand

Commit Message

Mark Brown June 18, 2024, 2:03 p.m. UTC
When we use CPU hotplug to offline a CPU we may transition directly from
running a task which was using SME to the CPU being offlined. This means
that PSTATE.{SM,ZA} may still be set, indicating to the system that SME is
still in use. This could create contention with other still running CPUs if
the system uses shared SMCUs.

For most systems this shouldn't be an issue, we should have PSCI or some
other power management mechanism which will take care of this as part of
offlining the CPU. However we do still have support for spin tables, and it
is possible that system firmware may not be ideally implemented, so let's
explicitly disable SME during the process of offlining the CPU in order to
ensure there's no spurious contention.

Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/kernel/smp.c | 4 ++++
 1 file changed, 4 insertions(+)


---
base-commit: 83a7eefedc9b56fe7bfeff13b6c7356688ffa670
change-id: 20240617-arm64-fpsimd-sme-cpu-die-57205c7f220e

Best regards,

Comments

Mark Rutland June 18, 2024, 2:51 p.m. UTC | #1
On Tue, Jun 18, 2024 at 03:03:50PM +0100, Mark Brown wrote:
> When we use CPU hotplug to offline a CPU we may transition directly from
> running a task which was using SME to the CPU being offlined. This means
> that PSTATE.{SM,ZA} may still be set, indicating to the system that SME is
> still in use. This could create contention with other still running CPUs if
> the system uses shared SMCUs.

Does it actually cause contention if the CPU isn't issuing SME
instructions?

Is this theoretical or something you see in practice?

> For most systems this shouldn't be an issue, we should have PSCI or some
> other power management mechanism which will take care of this as part of
> offlining the CPU. However we do still have support for spin tables,

I don't think spin-table is relevant; there's no support whatsoever for
offlining CPUs with spin-table (and offlining will be rejected long
before cpu_die()).

> and it is possible that system firmware may not be ideally
> implemented, so let's explicitly disable SME during the process of
> offlining the CPU in order to ensure there's no spurious contention.

If this is an issue, surely it's the same with idle, or any other long
period spent in the kernel, or any long period where userspace leaves
the CPU in streaming mode?

It feels very odd that we should need to do something for cpu offlining
in particular.

Mark,

> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
>  arch/arm64/kernel/smp.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 31c8b3094dd7..9e8fc6ac758a 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -383,6 +383,10 @@ void __noreturn cpu_die(void)
>  	/* Tell cpuhp_bp_sync_dead() that this CPU is now safe to dispose of */
>  	cpuhp_ap_report_dead();
>  
> +	/* Ensure we are not spuriously contending any SMCU */
> +	if (system_supports_sme())
> +		sme_smstop();
> +
>  	/*
>  	 * Actually shutdown the CPU. This must never fail. The specific hotplug
>  	 * mechanism must perform all required cache maintenance to ensure that
> 
> ---
> base-commit: 83a7eefedc9b56fe7bfeff13b6c7356688ffa670
> change-id: 20240617-arm64-fpsimd-sme-cpu-die-57205c7f220e
> 
> Best regards,
> -- 
> Mark Brown <broonie@kernel.org>
> 
>
Mark Brown June 18, 2024, 3:43 p.m. UTC | #2
On Tue, Jun 18, 2024 at 03:51:47PM +0100, Mark Rutland wrote:
> On Tue, Jun 18, 2024 at 03:03:50PM +0100, Mark Brown wrote:

> > When we use CPU hotplug to offline a CPU we may transition directly from
> > running a task which was using SME to the CPU being offlined. This means
> > that PSTATE.{SM,ZA} may still be set, indicating to the system that SME is
> > still in use. This could create contention with other still running CPUs if
> > the system uses shared SMCUs.

> Does it actually cause contention if the CPU isn't issuing SME
> instructions?

It was misbehaving, I didn't dig into the specifics of how.  There will
be a power impact too regardless of any instructions being issued.

> Is this theoretical or something you see in practice?

It was inspired by a report, the reporter was able to fix their firmware
to be more sensible and issue the SMSTOP itself but it seemed like
reasonable defensiveness/politeness for us to release the resource
anyway.

> I don't think spin-table is relevant; there's no support whatsoever for
> offlining CPUs with spin-table (and offlining will be rejected long
> before cpu_die()).

Ah, good - I didn't spend enough time to convince myself there were no
situations where we'd try to take down the CPU anyway.

> > and it is possible that system firmware may not be ideally
> > implemented, so let's explicitly disable SME during the process of
> > offlining the CPU in order to ensure there's no spurious contention.

> If this is an issue, surely it's the same with idle, or any other long
> period spent in the kernel, or any long period where userspace leaves
> the CPU in streaming mode?

> It feels very odd that we should need to do something for cpu offlining
> in particular.

Yes, it's an issue for idle too in the case where we're not using
cpuidle - I sent a separate patch for that.  cpuidle should already
cover this either itself or when it notifies us that register state
will be lost.  

A good chunk of the other users that spend noticable time in kernel mode
will be using kernel mode floating point so disable anyway due to that,
and for everything else there's a tricky tradeoff with how long we're
spending in kernel vs how much pressure is being applied and the
likelyhood of returning to the same userspace process.  That feels like
we need some more real world experience to see what if anything is
needed.
diff mbox series

Patch

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 31c8b3094dd7..9e8fc6ac758a 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -383,6 +383,10 @@  void __noreturn cpu_die(void)
 	/* Tell cpuhp_bp_sync_dead() that this CPU is now safe to dispose of */
 	cpuhp_ap_report_dead();
 
+	/* Ensure we are not spuriously contending any SMCU */
+	if (system_supports_sme())
+		sme_smstop();
+
 	/*
 	 * Actually shutdown the CPU. This must never fail. The specific hotplug
 	 * mechanism must perform all required cache maintenance to ensure that