mbox series

[v9,0/7] arm64: Add debug IPI for backtraces / kgdb; try to use NMI for it

Message ID 20230601213440.2488667-1-dianders@chromium.org (mailing list archive)
Headers show
Series arm64: Add debug IPI for backtraces / kgdb; try to use NMI for it | expand

Message

Doug Anderson June 1, 2023, 9:31 p.m. UTC
This is an attempt to resurrect Sumit's old patch series [1] that
allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and
also to round up CPUs in kdb/kgdb. The last post from Sumit that I
could find was v7, so I started my series at v8. I haven't copied all
of his old changelongs here, but you can find them from the link.

I'm really looking for a way to land this patch series. In response to
v8, Mark Rutland indicated [2] that he was worried about the soundness
of pseudo NMI. Those definitely need to get fixed, but IMO this patch
series could still land in the meantime. That would at least let
people test with it.

Request for anyone reading this: please help indicate your support of
this patch series landing by replying, even if you don't have the
background for a full review. My suspicion is that there are a lot of
people who agree that this would be super useful to get landed.

Since v8, I have cleaned up this patch series by integrating the 10th
patch from v8 [3] into the whole series. As part of this, I renamed
the "NMI IPI" to the "debug IPI" since it could now be backed by a
regular IPI in the case that pseudo NMIs weren't available. With the
fallback, this allowed me to drop some extra patches from the
series. This feels (to me) to be pretty clean and hopefully others
agree. Any patch I touched significantly I removed Masayoshi and
Chen-Yu's tags from.

...also in v8, I reorderd the patches a bit in a way that seemed a
little cleaner to me.

Since v7, I have:
* Addressed the small amount of feedback that was there for v7.
* Rebased.
* Added a new patch that prevents us from spamming the logs with idle
  tasks.
* Added an extra patch to gracefully fall back to regular IPIs if
  pseudo-NMIs aren't there.

It can be noted that this patch series works very well with the recent
"hardlockup" patches that have landed through Andrew Morton's tree and
are currently in linuxnext. It works especially well with the "buddy"
lockup detector.

[1] https://lore.kernel.org/linux-arm-kernel/1604317487-14543-1-git-send-email-sumit.garg@linaro.org/
[2] https://lore.kernel.org/lkml/ZFvGqD%2F%2Fpm%2FlZb+p@FVFF77S0Q05N.cambridge.arm.com/
[3] https://lore.kernel.org/r/20230419155341.v8.10.Ic3659997d6243139d0522fc3afcdfd88d7a5f030@changeid/

Changes in v9:
- Add a warning if we don't have enough IPIs for the NMI IPI
- Added comments that we might not be using NMI always.
- Added missing "inline"
- Added to commit message that this doesn't catch all cases.
- Fold in v8 patch #10 ("Fallback to a regular IPI if NMI isn't enabled")
- Moved header file out of "include" since it didn't need to be there.
- Remove arm64_supports_nmi()
- Remove fallback for when debug IPI isn't available.
- Renamed "NMI IPI" to "debug IPI" since it might not be backed by NMI.
- Update commit description
- arch_trigger_cpumask_backtrace() no longer returns bool

Changes in v8:
- "Provide a stub kgdb_nmicallback() if !CONFIG_KGDB" new for v8
- "Tag the arm64 idle functions as __cpuidle" new for v8
- Removed "#ifdef CONFIG_SMP" since arm64 is always SMP
- debug_ipi_setup() and debug_ipi_teardown() no longer take cpu param

Douglas Anderson (2):
  arm64: idle: Tag the arm64 idle functions as __cpuidle
  kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB

Sumit Garg (5):
  irqchip/gic-v3: Enable support for SGIs to act as NMIs
  arm64: Add framework for a debug IPI
  arm64: smp: Assign and setup the debug IPI
  arm64: ipi_debug: Add support for backtrace using the debug IPI
  arm64: kgdb: Roundup cpus using the debug IPI

 arch/arm64/include/asm/irq.h  |   3 +
 arch/arm64/kernel/Makefile    |   2 +-
 arch/arm64/kernel/idle.c      |   4 +-
 arch/arm64/kernel/ipi_debug.c | 102 ++++++++++++++++++++++++++++++++++
 arch/arm64/kernel/ipi_debug.h |  13 +++++
 arch/arm64/kernel/kgdb.c      |  14 +++++
 arch/arm64/kernel/smp.c       |  11 ++++
 drivers/irqchip/irq-gic-v3.c  |  29 +++++++---
 include/linux/kgdb.h          |   1 +
 9 files changed, 168 insertions(+), 11 deletions(-)
 create mode 100644 arch/arm64/kernel/ipi_debug.c
 create mode 100644 arch/arm64/kernel/ipi_debug.h

Comments

Sumit Garg June 2, 2023, 5:19 a.m. UTC | #1
Hi Doug,

On Fri, 2 Jun 2023 at 03:07, Douglas Anderson <dianders@chromium.org> wrote:
>
> This is an attempt to resurrect Sumit's old patch series [1] that
> allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and
> also to round up CPUs in kdb/kgdb. The last post from Sumit that I
> could find was v7, so I started my series at v8. I haven't copied all
> of his old changelongs here, but you can find them from the link.
>
> I'm really looking for a way to land this patch series. In response to
> v8, Mark Rutland indicated [2] that he was worried about the soundness
> of pseudo NMI. Those definitely need to get fixed, but IMO this patch
> series could still land in the meantime. That would at least let
> people test with it.

+1

>
> Request for anyone reading this: please help indicate your support of
> this patch series landing by replying, even if you don't have the
> background for a full review. My suspicion is that there are a lot of
> people who agree that this would be super useful to get landed.
>

You can ofcourse count me in here. This will certainly bring NMI
debugging capabilities to arm64 and I have been advocating for that
since the advent of pseudo NMIs on arm64.

> Since v8, I have cleaned up this patch series by integrating the 10th
> patch from v8 [3] into the whole series. As part of this, I renamed
> the "NMI IPI" to the "debug IPI" since it could now be backed by a
> regular IPI in the case that pseudo NMIs weren't available. With the
> fallback, this allowed me to drop some extra patches from the
> series. This feels (to me) to be pretty clean and hopefully others
> agree. Any patch I touched significantly I removed Masayoshi and
> Chen-Yu's tags from.
>
> ...also in v8, I reorderd the patches a bit in a way that seemed a
> little cleaner to me.

This cleanup looks good to me.

-Sumit

>
> Since v7, I have:
> * Addressed the small amount of feedback that was there for v7.
> * Rebased.
> * Added a new patch that prevents us from spamming the logs with idle
>   tasks.
> * Added an extra patch to gracefully fall back to regular IPIs if
>   pseudo-NMIs aren't there.
>
> It can be noted that this patch series works very well with the recent
> "hardlockup" patches that have landed through Andrew Morton's tree and
> are currently in linuxnext. It works especially well with the "buddy"
> lockup detector.
>
> [1] https://lore.kernel.org/linux-arm-kernel/1604317487-14543-1-git-send-email-sumit.garg@linaro.org/
> [2] https://lore.kernel.org/lkml/ZFvGqD%2F%2Fpm%2FlZb+p@FVFF77S0Q05N.cambridge.arm.com/
> [3] https://lore.kernel.org/r/20230419155341.v8.10.Ic3659997d6243139d0522fc3afcdfd88d7a5f030@changeid/
>
> Changes in v9:
> - Add a warning if we don't have enough IPIs for the NMI IPI
> - Added comments that we might not be using NMI always.
> - Added missing "inline"
> - Added to commit message that this doesn't catch all cases.
> - Fold in v8 patch #10 ("Fallback to a regular IPI if NMI isn't enabled")
> - Moved header file out of "include" since it didn't need to be there.
> - Remove arm64_supports_nmi()
> - Remove fallback for when debug IPI isn't available.
> - Renamed "NMI IPI" to "debug IPI" since it might not be backed by NMI.
> - Update commit description
> - arch_trigger_cpumask_backtrace() no longer returns bool
>
> Changes in v8:
> - "Provide a stub kgdb_nmicallback() if !CONFIG_KGDB" new for v8
> - "Tag the arm64 idle functions as __cpuidle" new for v8
> - Removed "#ifdef CONFIG_SMP" since arm64 is always SMP
> - debug_ipi_setup() and debug_ipi_teardown() no longer take cpu param
>
> Douglas Anderson (2):
>   arm64: idle: Tag the arm64 idle functions as __cpuidle
>   kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB
>
> Sumit Garg (5):
>   irqchip/gic-v3: Enable support for SGIs to act as NMIs
>   arm64: Add framework for a debug IPI
>   arm64: smp: Assign and setup the debug IPI
>   arm64: ipi_debug: Add support for backtrace using the debug IPI
>   arm64: kgdb: Roundup cpus using the debug IPI
>
>  arch/arm64/include/asm/irq.h  |   3 +
>  arch/arm64/kernel/Makefile    |   2 +-
>  arch/arm64/kernel/idle.c      |   4 +-
>  arch/arm64/kernel/ipi_debug.c | 102 ++++++++++++++++++++++++++++++++++
>  arch/arm64/kernel/ipi_debug.h |  13 +++++
>  arch/arm64/kernel/kgdb.c      |  14 +++++
>  arch/arm64/kernel/smp.c       |  11 ++++
>  drivers/irqchip/irq-gic-v3.c  |  29 +++++++---
>  include/linux/kgdb.h          |   1 +
>  9 files changed, 168 insertions(+), 11 deletions(-)
>  create mode 100644 arch/arm64/kernel/ipi_debug.c
>  create mode 100644 arch/arm64/kernel/ipi_debug.h
>
> --
> 2.41.0.rc2.161.g9c6817b8e7-goog
>
Mark Rutland Aug. 7, 2023, 10:41 a.m. UTC | #2
Hi Doug,

Apologies for the delay.

On Mon, Jul 24, 2023 at 08:55:44AM -0700, Doug Anderson wrote:
> On Thu, Jun 1, 2023 at 2:37 PM Douglas Anderson <dianders@chromium.org> wrote:
> I'm looking for some ideas on what to do to move this patch series
> forward. Thanks to Daniel, the kgdb patch is now in Linus's tree which
> hopefully makes this simpler to land. I guess there is still the
> irqchip dependency that will need to be sorted out, though...
> 
> Even if folks aren't in agreement about whether this is ready to be
> enabled in production, I don't think anything here is super
> objectionable or controversial, is it? Can we land it? If you feel
> like it needs extra review, would it help if I tried to drum up some
> extra people to provide review feedback?

Ignoring the soundness issues I mentioned before (which I'm slowly chipping
away at, and you're likely lucky enough to avoid in practice)...

Having looked over the series, I think the GICv3 bit isn't quite right, but is
easy enough to fix. I've commented on the patch with what I think we should
have there.

The only major thing otherwise from my PoV is the structure of the debug IPI
framework. I'm not keen on that being a separate body of code and I think it
should live in smp.c along with the other IPIs. I'd also strongly prefer if we
could have separate IPI_CPU_BACKTRACE and IPI_CPU_KGDB IPIs, and I think we can
do that either by unifying IPI_CPU_STOP && IPI_CPU_CRASH_STOP or by reclaiming
IPI_WAKEUP by reusing a different IPI for the parking protocol (e.g.
IPI_RESCHEDULE).

I think it'd be nice if the series could enable NMIs for backtrace and the
CPU_{,CRASH_}STOP cases, with KGDB being the bonus atop. That way it'd be
clearly beneficial for anyone trying to debug lockups even if they're not a
KGDB user.

> Also: in case it's interesting to anyone, I've been doing benchmarks
> on sc7180-trogdor devices in preparation for enabling this. On that
> platform, I did manage to see about 4% reduction in a set of hackbench
> numbers when fully enabling pseudo-NMI. However, when I instead ran
> Speedometer 2.1 I saw no difference. See:
> 
> https://issuetracker.google.com/issues/197061987

Thanks for the pointer!

I know that there are a couple of things that we could do to slightly improve
local_irq_*() when using pNMIs, though I suspect that the bulk of the cost
there will come from the necessary synchronization.

Thanks,
Mark.
Sumit Garg Aug. 7, 2023, 12:46 p.m. UTC | #3
On Mon, 7 Aug 2023 at 16:11, Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi Doug,
>
> Apologies for the delay.
>
> On Mon, Jul 24, 2023 at 08:55:44AM -0700, Doug Anderson wrote:
> > On Thu, Jun 1, 2023 at 2:37 PM Douglas Anderson <dianders@chromium.org> wrote:
> > I'm looking for some ideas on what to do to move this patch series
> > forward. Thanks to Daniel, the kgdb patch is now in Linus's tree which
> > hopefully makes this simpler to land. I guess there is still the
> > irqchip dependency that will need to be sorted out, though...
> >
> > Even if folks aren't in agreement about whether this is ready to be
> > enabled in production, I don't think anything here is super
> > objectionable or controversial, is it? Can we land it? If you feel
> > like it needs extra review, would it help if I tried to drum up some
> > extra people to provide review feedback?
>
> Ignoring the soundness issues I mentioned before (which I'm slowly chipping
> away at, and you're likely lucky enough to avoid in practice)...
>
> Having looked over the series, I think the GICv3 bit isn't quite right, but is
> easy enough to fix. I've commented on the patch with what I think we should
> have there.

Thanks for catching this and I agree with your proposed fix.

>
> The only major thing otherwise from my PoV is the structure of the debug IPI
> framework. I'm not keen on that being a separate body of code and I think it
> should live in smp.c along with the other IPIs.

That's a fair point.

> I'd also strongly prefer if we
> could have separate IPI_CPU_BACKTRACE and IPI_CPU_KGDB IPIs,

With current logic of single debug IPI, it is not required for a user
to enable KGDB in order to use that IPI for backtrace. The original
motivation for this logic was that the IPIs are a scarce resource on
arm64 as per comments from Marc. So I am fine either way to keep them
separate or unified.

> and I think we can
> do that either by unifying IPI_CPU_STOP && IPI_CPU_CRASH_STOP or by reclaiming
> IPI_WAKEUP by reusing a different IPI for the parking protocol (e.g.
> IPI_RESCHEDULE).

That sounds like a good cleanup.

>
> I think it'd be nice if the series could enable NMIs for backtrace and the
> CPU_{,CRASH_}STOP cases, with KGDB being the bonus atop. That way it'd be
> clearly beneficial for anyone trying to debug lockups even if they're not a
> KGDB user.
>

It's good to see other use-cases of IPIs turned into NMIs.

-Sumit

> > Also: in case it's interesting to anyone, I've been doing benchmarks
> > on sc7180-trogdor devices in preparation for enabling this. On that
> > platform, I did manage to see about 4% reduction in a set of hackbench
> > numbers when fully enabling pseudo-NMI. However, when I instead ran
> > Speedometer 2.1 I saw no difference. See:
> >
> > https://issuetracker.google.com/issues/197061987
>
> Thanks for the pointer!
>
> I know that there are a couple of things that we could do to slightly improve
> local_irq_*() when using pNMIs, though I suspect that the bulk of the cost
> there will come from the necessary synchronization.
>
> Thanks,
> Mark.
Mark Rutland Aug. 7, 2023, 2:43 p.m. UTC | #4
On Mon, Aug 07, 2023 at 06:16:24PM +0530, Sumit Garg wrote:
> On Mon, 7 Aug 2023 at 16:11, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Mon, Jul 24, 2023 at 08:55:44AM -0700, Doug Anderson wrote:
> > > On Thu, Jun 1, 2023 at 2:37 PM Douglas Anderson <dianders@chromium.org> wrote:
> > I'd also strongly prefer if we
> > could have separate IPI_CPU_BACKTRACE and IPI_CPU_KGDB IPIs,
> 
> With current logic of single debug IPI, it is not required for a user
> to enable KGDB in order to use that IPI for backtrace. The original
> motivation for this logic was that the IPIs are a scarce resource on
> arm64 as per comments from Marc. So I am fine either way to keep them
> separate or unified.
> 
> > and I think we can
> > do that either by unifying IPI_CPU_STOP && IPI_CPU_CRASH_STOP or by reclaiming
> > IPI_WAKEUP by reusing a different IPI for the parking protocol (e.g.
> > IPI_RESCHEDULE).
> 
> That sounds like a good cleanup.

I took a quick stab at removing IPI_WAKEUP, and it seems simple enough; patch
below.

Mark.

----<8----
From 267401824576c1dd3bb6d380142c92f710098b67 Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland@arm.com>
Date: Mon, 7 Aug 2023 15:05:47 +0100
Subject: [PATCH] arm64: smp: Remove dedicated wakeup IPI

To enable NMI backtrace and KGDB's NMI cpu roundup, we need to free up
at least one dedicated IPI.

On arm64 the IPI_WAKEUP IPI is only used for the ACPI parking protocol,
which itself is only used on some very early ARMv8 systems which
couldn't implement PSCI.

Remove the IPI_WAKEUP IPI, and rely on the IPI_RESCHEDULE IPI to wake
CPUs from the parked state. This will cause a tiny amonut of redundant
work to check the thread flags, but this is miniscule in relation to the
cost of taking and handling the IPI in the first place. We can safely
handle redundant IPI_RESCHEDULE IPIs, so there should be no functional
impact as a result of this change.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Sumit Garg <sumit.garg@linaro.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/smp.h              |  4 ++--
 arch/arm64/kernel/acpi_parking_protocol.c |  2 +-
 arch/arm64/kernel/smp.c                   | 28 +++++++++--------------
 3 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index 9b31e6d0da174..798fd76a883a0 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -89,9 +89,9 @@ extern void arch_send_call_function_single_ipi(int cpu);
 extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
 
 #ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
-extern void arch_send_wakeup_ipi_mask(const struct cpumask *mask);
+extern void arch_send_wakeup_ipi(int cpu);
 #else
-static inline void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
+static inline void arch_send_wakeup_ipi(int cpu)
 {
 	BUILD_BUG();
 }
diff --git a/arch/arm64/kernel/acpi_parking_protocol.c b/arch/arm64/kernel/acpi_parking_protocol.c
index b1990e38aed0a..e1be29e608b71 100644
--- a/arch/arm64/kernel/acpi_parking_protocol.c
+++ b/arch/arm64/kernel/acpi_parking_protocol.c
@@ -103,7 +103,7 @@ static int acpi_parking_protocol_cpu_boot(unsigned int cpu)
 		       &mailbox->entry_point);
 	writel_relaxed(cpu_entry->gic_cpu_id, &mailbox->cpu_id);
 
-	arch_send_wakeup_ipi_mask(cpumask_of(cpu));
+	arch_send_wakeup_ipi(cpu);
 
 	return 0;
 }
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index edd63894d61e8..be00c8c5c1711 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -72,7 +72,6 @@ enum ipi_msg_type {
 	IPI_CPU_CRASH_STOP,
 	IPI_TIMER,
 	IPI_IRQ_WORK,
-	IPI_WAKEUP,
 	NR_IPI
 };
 
@@ -764,7 +763,6 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
 	[IPI_CPU_CRASH_STOP]	= "CPU stop (for crash dump) interrupts",
 	[IPI_TIMER]		= "Timer broadcast interrupts",
 	[IPI_IRQ_WORK]		= "IRQ work interrupts",
-	[IPI_WAKEUP]		= "CPU wake-up interrupts",
 };
 
 static void smp_cross_call(const struct cpumask *target, unsigned int ipinr);
@@ -797,13 +795,6 @@ void arch_send_call_function_single_ipi(int cpu)
 	smp_cross_call(cpumask_of(cpu), IPI_CALL_FUNC);
 }
 
-#ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
-void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
-{
-	smp_cross_call(mask, IPI_WAKEUP);
-}
-#endif
-
 #ifdef CONFIG_IRQ_WORK
 void arch_irq_work_raise(void)
 {
@@ -897,14 +888,6 @@ static void do_handle_IPI(int ipinr)
 		break;
 #endif
 
-#ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
-	case IPI_WAKEUP:
-		WARN_ONCE(!acpi_parking_protocol_valid(cpu),
-			  "CPU%u: Wake-up IPI outside the ACPI parking protocol\n",
-			  cpu);
-		break;
-#endif
-
 	default:
 		pr_crit("CPU%u: Unknown IPI message 0x%x\n", cpu, ipinr);
 		break;
@@ -979,6 +962,17 @@ void arch_smp_send_reschedule(int cpu)
 	smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
 }
 
+#ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
+void arch_send_wakeup_ipi(int cpu)
+{
+	/*
+	 * We use a scheduler IPI to wake the CPU as this avoids the need for a
+	 * dedicated IPI and we can safely handle spurious scheduler IPIs.
+	 */
+	arch_smp_send_reschedule(cpu);
+}
+#endif
+
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void tick_broadcast(const struct cpumask *mask)
 {