diff mbox series

[RFC,2/4] arm64: use IPIs to pause/resume remote CPUs

Message ID 20240806022114.3320543-3-yuzhao@google.com (mailing list archive)
State New, archived
Headers show
Series mm/arm64: re-enable HVO | expand

Commit Message

Yu Zhao Aug. 6, 2024, 2:21 a.m. UTC
Use pseudo-NMI IPIs to pause remote CPUs for a short period of time,
and then reliably resume them when the local CPU exits critical
sections that preclude the execution of remote CPUs.

A typical example of such critical sections is BBM on kernel PTEs.
HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by commit
060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")
due to the folllowing reason:

  This is deemed UNPREDICTABLE by the Arm architecture without a
  break-before-make sequence (make the PTE invalid, TLBI, write the
  new valid PTE). However, such sequence is not possible since the
  vmemmap may be concurrently accessed by the kernel.

Supporting BBM on kernel PTEs is one of the approaches that can
potentially make arm64 support HVO.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/arm64/include/asm/smp.h |   3 +
 arch/arm64/kernel/smp.c      | 110 +++++++++++++++++++++++++++++++++++
 2 files changed, 113 insertions(+)

Comments

David Hildenbrand Aug. 6, 2024, 9:12 a.m. UTC | #1
[...]

> +
> +void resume_remote_cpus(void)
> +{
> +	cpumask_t cpus_to_resume;
> +
> +	lockdep_assert_cpus_held();
> +	lockdep_assert_preemption_disabled();
> +
> +	cpumask_copy(&cpus_to_resume, cpu_online_mask);
> +	cpumask_clear_cpu(smp_processor_id(), &cpus_to_resume);
> +
> +	spin_lock(&cpu_pause_lock);
> +
> +	cpumask_setall(&resumed_cpus);
> +	/* A typical example for sleep and wake-up functions. */
> +	smp_mb();
> +	while (cpumask_intersects(&cpus_to_resume, &paused_cpus)) {
> +		sev();
> +		cpu_relax();
> +		barrier();
> +	}
>

I'm curious, is there a fundamental reason why we wait for paused CPUs 
to actually start running, or is it simply easier to get the 
implementation race-free, in particular when we have two 
pause_remote_cpus() calls shortly after each other and another remote 
CPU might still be on its way out of pause_local_cpu() from the first pause.
Doug Anderson Aug. 8, 2024, 4:09 p.m. UTC | #2
Hi,

On Mon, Aug 5, 2024 at 7:21 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Use pseudo-NMI IPIs to pause remote CPUs for a short period of time,
> and then reliably resume them when the local CPU exits critical
> sections that preclude the execution of remote CPUs.
>
> A typical example of such critical sections is BBM on kernel PTEs.
> HugeTLB Vmemmap Optimization (HVO) on arm64 was disabled by commit
> 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")
> due to the folllowing reason:
>
>   This is deemed UNPREDICTABLE by the Arm architecture without a
>   break-before-make sequence (make the PTE invalid, TLBI, write the
>   new valid PTE). However, such sequence is not possible since the
>   vmemmap may be concurrently accessed by the kernel.
>
> Supporting BBM on kernel PTEs is one of the approaches that can
> potentially make arm64 support HVO.
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  arch/arm64/include/asm/smp.h |   3 +
>  arch/arm64/kernel/smp.c      | 110 +++++++++++++++++++++++++++++++++++
>  2 files changed, 113 insertions(+)

I'm a bit curious how your approach is reliable / performant in all
cases. As far as I understand it:

1. Patch #4 in your series unconditionally turns on
"ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP" for arm64.

2. In order for it to work reliably, you need the "pause all CPUs"
functionality introduced in this patch.

3. In order for the "pause all CPUs" functionality to be performant
you need NMI or, at least, pseudo-NMI to be used to pause all CPUs.

4. Even when you configure the kernel for pseudo-NMI it's not 100%
guaranteed that pseudo-NMI will be turned on. Specifically:

4a) There's an extra kernel command line parameter you need to
actually enable pseudo-NMI. We can debate about the inability to turn
on pseudo-NMI without the command line parameter, but at the moment
it's there because pseudo-NMI has some performance implications.
Apparently these performance implications are more non-trivial on some
early arm64 CPUs.

4b) Even if we changed it so that the command-line parameter wasn't
needed, there are still some boards out there that are known not to be
able to enable pseudo-NMI. There are certainly some Mediatek
Chromebooks that have a BIOS bug making pseudo-NMI unreliable. See the
`mediatek,broken-save-restore-fw` device tree property. ...and even if
you ignore the Mediatek Chromebooks, there's at least one more system
I know of that's broken with pseudo-NMI. Since you're at Google, you
could look at b/308278090 for details but the quick summary is that
some devices running a TEE are hanging when pseudo NMI is enabled.
...and, even if that's fixed, it feels somewhat likely that there are
other systems where pseudo-NMI won't be usable.


Unless I'm misunderstanding, it feels like anything you have that
relies on NMI/pseudo-NMI needs to fall back safely/reliably if
NMI/pseudo-NMI isn't there.


> diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> index 2510eec026f7..cffb0cfed961 100644
> --- a/arch/arm64/include/asm/smp.h
> +++ b/arch/arm64/include/asm/smp.h
> @@ -133,6 +133,9 @@ bool cpus_are_stuck_in_kernel(void);
>  extern void crash_smp_send_stop(void);
>  extern bool smp_crash_stop_failed(void);
>
> +void pause_remote_cpus(void);
> +void resume_remote_cpus(void);
> +
>  #endif /* ifndef __ASSEMBLY__ */
>
>  #endif /* ifndef __ASM_SMP_H */
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 5e18fbcee9a2..aa80266e5c9d 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -68,16 +68,25 @@ enum ipi_msg_type {
>         IPI_RESCHEDULE,
>         IPI_CALL_FUNC,
>         IPI_CPU_STOP,
> +       IPI_CPU_PAUSE,
> +#ifdef CONFIG_KEXEC_CORE
>         IPI_CPU_CRASH_STOP,
> +#endif
> +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
>         IPI_TIMER,
> +#endif
> +#ifdef CONFIG_IRQ_WORK
>         IPI_IRQ_WORK,
> +#endif

I assume all these "ifdefs" are there because this adds up to more
than 8 IPIs. That means that someone wouldn't be able to enable all of
these things, right? Feels like we'd want to solve this before landing
things. In the least it would be good if this built upon:

https://lore.kernel.org/r/20240625160718.v2.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid/

...and then maybe we could figure out if there are other ways to
consolidate NMIs. Previously, for instance, we had the "KGDB" and
"backtrace" IPIs combined into one but we split them upon review
feedback. If necessary they would probably be easy to re-combine.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index 2510eec026f7..cffb0cfed961 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -133,6 +133,9 @@  bool cpus_are_stuck_in_kernel(void);
 extern void crash_smp_send_stop(void);
 extern bool smp_crash_stop_failed(void);
 
+void pause_remote_cpus(void);
+void resume_remote_cpus(void);
+
 #endif /* ifndef __ASSEMBLY__ */
 
 #endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 5e18fbcee9a2..aa80266e5c9d 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -68,16 +68,25 @@  enum ipi_msg_type {
 	IPI_RESCHEDULE,
 	IPI_CALL_FUNC,
 	IPI_CPU_STOP,
+	IPI_CPU_PAUSE,
+#ifdef CONFIG_KEXEC_CORE
 	IPI_CPU_CRASH_STOP,
+#endif
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 	IPI_TIMER,
+#endif
+#ifdef CONFIG_IRQ_WORK
 	IPI_IRQ_WORK,
+#endif
 	NR_IPI,
 	/*
 	 * Any enum >= NR_IPI and < MAX_IPI is special and not tracable
 	 * with trace_ipi_*
 	 */
 	IPI_CPU_BACKTRACE = NR_IPI,
+#ifdef CONFIG_KGDB
 	IPI_KGDB_ROUNDUP,
+#endif
 	MAX_IPI
 };
 
@@ -821,11 +830,20 @@  static const char *ipi_types[MAX_IPI] __tracepoint_string = {
 	[IPI_RESCHEDULE]	= "Rescheduling interrupts",
 	[IPI_CALL_FUNC]		= "Function call interrupts",
 	[IPI_CPU_STOP]		= "CPU stop interrupts",
+	[IPI_CPU_PAUSE]		= "CPU pause interrupts",
+#ifdef CONFIG_KEXEC_CORE
 	[IPI_CPU_CRASH_STOP]	= "CPU stop (for crash dump) interrupts",
+#endif
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 	[IPI_TIMER]		= "Timer broadcast interrupts",
+#endif
+#ifdef CONFIG_IRQ_WORK
 	[IPI_IRQ_WORK]		= "IRQ work interrupts",
+#endif
 	[IPI_CPU_BACKTRACE]	= "CPU backtrace interrupts",
+#ifdef CONFIG_KGDB
 	[IPI_KGDB_ROUNDUP]	= "KGDB roundup interrupts",
+#endif
 };
 
 static void smp_cross_call(const struct cpumask *target, unsigned int ipinr);
@@ -884,6 +902,85 @@  void __noreturn panic_smp_self_stop(void)
 	local_cpu_stop();
 }
 
+static DEFINE_SPINLOCK(cpu_pause_lock);
+static cpumask_t paused_cpus;
+static cpumask_t resumed_cpus;
+
+static void pause_local_cpu(void)
+{
+	int cpu = smp_processor_id();
+
+	cpumask_clear_cpu(cpu, &resumed_cpus);
+	/*
+	 * Paired with pause_remote_cpus() to confirm that this CPU not only
+	 * will be paused but also can be reliably resumed.
+	 */
+	smp_wmb();
+	cpumask_set_cpu(cpu, &paused_cpus);
+	/* A typical example for sleep and wake-up functions. */
+	smp_mb();
+	while (!cpumask_test_cpu(cpu, &resumed_cpus)) {
+		wfe();
+		barrier();
+	}
+	barrier();
+	cpumask_clear_cpu(cpu, &paused_cpus);
+}
+
+void pause_remote_cpus(void)
+{
+	cpumask_t cpus_to_pause;
+
+	lockdep_assert_cpus_held();
+	lockdep_assert_preemption_disabled();
+
+	cpumask_copy(&cpus_to_pause, cpu_online_mask);
+	cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause);
+
+	spin_lock(&cpu_pause_lock);
+
+	WARN_ON_ONCE(!cpumask_empty(&paused_cpus));
+
+	smp_cross_call(&cpus_to_pause, IPI_CPU_PAUSE);
+
+	while (!cpumask_equal(&cpus_to_pause, &paused_cpus)) {
+		cpu_relax();
+		barrier();
+	}
+	/*
+	 * Paired with pause_local_cpu() to confirm that all CPUs not only will
+	 * be paused but also can be reliably resumed.
+	 */
+	smp_rmb();
+	WARN_ON_ONCE(cpumask_intersects(&cpus_to_pause, &resumed_cpus));
+
+	spin_unlock(&cpu_pause_lock);
+}
+
+void resume_remote_cpus(void)
+{
+	cpumask_t cpus_to_resume;
+
+	lockdep_assert_cpus_held();
+	lockdep_assert_preemption_disabled();
+
+	cpumask_copy(&cpus_to_resume, cpu_online_mask);
+	cpumask_clear_cpu(smp_processor_id(), &cpus_to_resume);
+
+	spin_lock(&cpu_pause_lock);
+
+	cpumask_setall(&resumed_cpus);
+	/* A typical example for sleep and wake-up functions. */
+	smp_mb();
+	while (cpumask_intersects(&cpus_to_resume, &paused_cpus)) {
+		sev();
+		cpu_relax();
+		barrier();
+	}
+
+	spin_unlock(&cpu_pause_lock);
+}
+
 #ifdef CONFIG_KEXEC_CORE
 static atomic_t waiting_for_crash_ipi = ATOMIC_INIT(0);
 #endif
@@ -963,6 +1060,11 @@  static void do_handle_IPI(int ipinr)
 		local_cpu_stop();
 		break;
 
+	case IPI_CPU_PAUSE:
+		pause_local_cpu();
+		break;
+
+#ifdef CONFIG_KEXEC_CORE
 	case IPI_CPU_CRASH_STOP:
 		if (IS_ENABLED(CONFIG_KEXEC_CORE)) {
 			ipi_cpu_crash_stop(cpu, get_irq_regs());
@@ -970,6 +1072,7 @@  static void do_handle_IPI(int ipinr)
 			unreachable();
 		}
 		break;
+#endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 	case IPI_TIMER:
@@ -991,9 +1094,11 @@  static void do_handle_IPI(int ipinr)
 		nmi_cpu_backtrace(get_irq_regs());
 		break;
 
+#ifdef CONFIG_KGDB
 	case IPI_KGDB_ROUNDUP:
 		kgdb_nmicallback(cpu, get_irq_regs());
 		break;
+#endif
 
 	default:
 		pr_crit("CPU%u: Unknown IPI message 0x%x\n", cpu, ipinr);
@@ -1023,9 +1128,14 @@  static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
 
 	switch (ipi) {
 	case IPI_CPU_STOP:
+	case IPI_CPU_PAUSE:
+#ifdef CONFIG_KEXEC_CORE
 	case IPI_CPU_CRASH_STOP:
+#endif
 	case IPI_CPU_BACKTRACE:
+#ifdef CONFIG_KGDB
 	case IPI_KGDB_ROUNDUP:
+#endif
 		return true;
 	default:
 		return false;