diff mbox series

ftrace: Have architectures opt-in for mcount build time sorting

Message ID 20220127114249.03b1b52b@gandalf.local.home (mailing list archive)
State New, archived
Headers show
Series ftrace: Have architectures opt-in for mcount build time sorting | expand

Commit Message

Steven Rostedt Jan. 27, 2022, 4:42 p.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

First S390 complained that the sorting of the mcount sections at build
time caused the kernel to crash on their architecture. Now PowerPC is
complaining about it too. And also ARM64 appears to be having issues.

It may be necessary to also update the relocation table for the values
in the mcount table. Not only do we have to sort the table, but also
update the relocations that may be applied to the items in the table.

If the system is not relocatable, then it is fine to sort, but if it is,
some architectures may have issues (although x86 does not as it shifts all
addresses the same).

Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is
safe to do the sorting at build time.

Also update the config to compile in build time sorting in the sorttable
code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT.

Link: https://lore.kernel.org/all/944D10DA-8200-4BA9-8D0A-3BED9AA99F82@linux.ibm.com/

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Yinan Liu <yinan@linux.alibaba.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linuxppc-dev@lists.ozlabs.org
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Fixes: 72b3942a173c ("scripts: ftrace - move the sort-processing in ftrace_init")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/arm/Kconfig     | 1 +
 arch/x86/Kconfig     | 1 +
 kernel/trace/Kconfig | 8 +++++++-
 scripts/Makefile     | 2 +-
 4 files changed, 10 insertions(+), 2 deletions(-)

Comments

Mark Rutland Jan. 27, 2022, 6:23 p.m. UTC | #1
On Thu, Jan 27, 2022 at 11:42:49AM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> First S390 complained that the sorting of the mcount sections at build
> time caused the kernel to crash on their architecture. Now PowerPC is
> complaining about it too. And also ARM64 appears to be having issues.
> 
> It may be necessary to also update the relocation table for the values
> in the mcount table. Not only do we have to sort the table, but also
> update the relocations that may be applied to the items in the table.
> 
> If the system is not relocatable, then it is fine to sort, but if it is,
> some architectures may have issues (although x86 does not as it shifts all
> addresses the same).
> 
> Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is
> safe to do the sorting at build time.
> 
> Also update the config to compile in build time sorting in the sorttable
> code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT.
> 
> Link: https://lore.kernel.org/all/944D10DA-8200-4BA9-8D0A-3BED9AA99F82@linux.ibm.com/
> 
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Yinan Liu <yinan@linux.alibaba.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Reported-by: Sachin Sant <sachinp@linux.ibm.com>
> Tested-by: Sachin Sant <sachinp@linux.ibm.com>
> Fixes: 72b3942a173c ("scripts: ftrace - move the sort-processing in ftrace_init")
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Thanks for this; the rationale and patch makes sense to me. As previously, I
understand that:

* For arch/arm, the build-time sort should be safe as the 32-bit kernel isn't
  virtually relocatable, and so the sort affects the final values, and will not
  be clobbered later.

* For arch/x86, the build time sort should be safe as the mcount_loc will be
  initialazed with values consistent with the default load address, and the
  boot-time relocation will add the same offset to all values, so there's no
  need to sort the relocs.

So enabling this for arch/arm and arch/x86 makes sense to me.

Similarly, I understand that for arm64 the build-time sort isn't sound due to
not adjusting the relocations, and so it needs to be disabled there (for now at
least).

I gave this patch a spin on arm64 atop v5.17-rc1, using GCC 11.1.0 from the
kernel.org crosstool page:

  https://mirrors.edge.kernel.org/pub/tools/crosstool/

... with this applied, I'm no longer seeing a number of ftrace selftest
failures and ftrace internal bugs I previously reported at:

  https://lore.kernel.org/all/YfKGKWW5UfZ15kCW@FVFF77S0Q05N/

It looks like there's a trivial bit of whitespace damage in the patch (noted
below), but regardless:

  Reviewed-by: Mark Rutland <mark.rutland@arm.com>
  Tested-by: Mark Rutland <mark.rutlad@arm.com> [arm64]



As a heads-up, with this issue out of the way I'm hitting some unrelated issues
when running the ftrace selftests on arm64, which I'll look into. Quick summary
on those below, but I'll start new threads once I've got more detail.

* The duplicate events test seems to fail consistently:

  [15] Generic dynamic event - check if duplicate events are caught       [FAIL]

* I intermittently see a hang when running the tests. I previously hit that
  when originally trying to bisect this issue (and IIRC that bisected down to
  some RCU changes, but I need to re-run that). When the tests hang I
  magic-srsrq + L tells me:

  [  271.938438] sysrq: Show Blocked State
  [  271.939245] task:ftracetest      state:D stack:    0 pid: 5687 ppid:  5627 flags:0x00000200
  [  271.940961] Call trace:
  [  271.941472]  __switch_to+0x104/0x160
  [  271.942213]  __schedule+0x2b0/0x6e0
  [  271.942933]  schedule+0x5c/0xf0
  [  271.943586]  schedule_timeout+0x184/0x1c4
  [  271.944410]  wait_for_completion+0x8c/0x12c
  [  271.945274]  __wait_rcu_gp+0x184/0x190
  [  271.946047]  synchronize_rcu_tasks_rude+0x48/0x70
  [  271.947007]  update_ftrace_function+0xa4/0xec
  [  271.947897]  __unregister_ftrace_function+0xa4/0xf0
  [  271.948898]  unregister_ftrace_function+0x34/0x70
  [  271.949857]  wakeup_tracer_reset+0x4c/0x100
  [  271.950713]  tracing_set_tracer+0xd0/0x2b0
  [  271.951552]  tracing_set_trace_write+0xe8/0x150
  [  271.952477]  vfs_write+0xfc/0x284
  [  271.953171]  ksys_write+0x7c/0x110
  [  271.953874]  __arm64_sys_write+0x2c/0x40
  [  271.954678]  invoke_syscall+0x5c/0x130
  [  271.955442]  el0_svc_common.constprop.0+0x108/0x130
  [  271.956435]  do_el0_svc+0x74/0x90
  [  271.957124]  el0_svc+0x2c/0x90
  [  271.957757]  el0t_64_sync_handler+0xa8/0x12c
  [  271.958629]  el0t_64_sync+0x1a0/0x1a4

> ---
>  arch/arm/Kconfig     | 1 +
>  arch/x86/Kconfig     | 1 +
>  kernel/trace/Kconfig | 8 +++++++-
>  scripts/Makefile     | 2 +-
>  4 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index c2724d986fa0..5256ebe57451 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -82,6 +82,7 @@ config ARM
>  	select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
>  	select HAVE_CONTEXT_TRACKING
>  	select HAVE_C_RECORDMCOUNT
> +	select HAVE_BUILDTIME_MCOUNT_SORT
>  	select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
>  	select HAVE_DMA_CONTIGUOUS if MMU
>  	select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7399327d1eff..46080dea5dba 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -186,6 +186,7 @@ config X86
>  	select HAVE_CONTEXT_TRACKING_OFFSTACK	if HAVE_CONTEXT_TRACKING
>  	select HAVE_C_RECORDMCOUNT
>  	select HAVE_OBJTOOL_MCOUNT		if STACK_VALIDATION
> +	select HAVE_BUILDTIME_MCOUNT_SORT
>  	select HAVE_DEBUG_KMEMLEAK
>  	select HAVE_DMA_CONTIGUOUS
>  	select HAVE_DYNAMIC_FTRACE
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 752ed89a293b..7e5b92090faa 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -70,10 +70,16 @@ config HAVE_C_RECORDMCOUNT
>  	help
>  	  C version of recordmcount available?
>  
> +config HAVE_BUILDTIME_MCOUNT_SORT
> +       bool
> +       help
> +         An architecture selects this if it sorts the mcount_loc section
> +	 at build time.

Whitespace damage here? The second line has a tab, the first doesn't.

Thanks,
Mark.

> +
>  config BUILDTIME_MCOUNT_SORT
>         bool
>         default y
> -       depends on BUILDTIME_TABLE_SORT && !S390
> +       depends on HAVE_BUILDTIME_MCOUNT_SORT
>         help
>           Sort the mcount_loc section at build time.
>  
> diff --git a/scripts/Makefile b/scripts/Makefile
> index b082d2f93357..cedc1f0e21d8 100644
> --- a/scripts/Makefile
> +++ b/scripts/Makefile
> @@ -32,7 +32,7 @@ HOSTCFLAGS_sorttable.o += -I$(srctree)/tools/arch/x86/include
>  HOSTCFLAGS_sorttable.o += -DUNWINDER_ORC_ENABLED
>  endif
>  
> -ifdef CONFIG_DYNAMIC_FTRACE
> +ifdef CONFIG_BUILDTIME_MCOUNT_SORT
>  HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
>  endif
>  
> -- 
> 2.33.0
>
Sven Schnelle Jan. 27, 2022, 6:42 p.m. UTC | #2
Hi Mark,

Mark Rutland <mark.rutland@arm.com> writes:

> * I intermittently see a hang when running the tests. I previously hit that
>   when originally trying to bisect this issue (and IIRC that bisected down to
>   some RCU changes, but I need to re-run that). When the tests hang I
>   magic-srsrq + L tells me:
>
>   [  271.938438] sysrq: Show Blocked State
>   [  271.939245] task:ftracetest      state:D stack:    0 pid: 5687 ppid:  5627 flags:0x00000200
>   [  271.940961] Call trace:
>   [  271.941472]  __switch_to+0x104/0x160
>   [  271.942213]  __schedule+0x2b0/0x6e0
>   [  271.942933]  schedule+0x5c/0xf0
>   [  271.943586]  schedule_timeout+0x184/0x1c4
>   [  271.944410]  wait_for_completion+0x8c/0x12c
>   [  271.945274]  __wait_rcu_gp+0x184/0x190
>   [  271.946047]  synchronize_rcu_tasks_rude+0x48/0x70
>   [  271.947007]  update_ftrace_function+0xa4/0xec
>   [  271.947897]  __unregister_ftrace_function+0xa4/0xf0
>   [  271.948898]  unregister_ftrace_function+0x34/0x70
>   [  271.949857]  wakeup_tracer_reset+0x4c/0x100
>   [  271.950713]  tracing_set_tracer+0xd0/0x2b0
>   [  271.951552]  tracing_set_trace_write+0xe8/0x150
>   [  271.952477]  vfs_write+0xfc/0x284
>   [  271.953171]  ksys_write+0x7c/0x110
>   [  271.953874]  __arm64_sys_write+0x2c/0x40
>   [  271.954678]  invoke_syscall+0x5c/0x130
>   [  271.955442]  el0_svc_common.constprop.0+0x108/0x130
>   [  271.956435]  do_el0_svc+0x74/0x90
>   [  271.957124]  el0_svc+0x2c/0x90
>   [  271.957757]  el0t_64_sync_handler+0xa8/0x12c
>   [  271.958629]  el0t_64_sync+0x1a0/0x1a4

that's interesting. On s390 i'm seeing the same problem in CI, but with
the startup ftrace tests. So that's likely not arm64 spacific.

On s390, the last messages from ftrace are [    5.663568] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    5.667099] futex hash table entries: 65536 (order: 12, 16777216 bytes, vmalloc)
[    5.739549] Running postponed tracer tests:
[    5.740662] Testing tracer function: PASSED
[    6.194635] Testing dynamic ftrace: PASSED
[    6.471213] Testing dynamic ftrace ops #1: 
[    6.558445] (1 0 1 0 0) 
[    6.558458] (1 1 2 0 0) 
[    6.699135] (2 1 3 0 764347) 
[    6.699252] (2 2 4 0 766466) 
[    6.759857] (3 2 4 0 1159604)
[..] hangs here

The backtrace looks like this, which is very similar to the one above:

crash> bt 1
PID: 1      TASK: 80e68100          CPU: 133  COMMAND: "swapper/0"
 #0 [380004df808] __schedule at cda39f0e
 #1 [380004df880] schedule at cda3a488
 #2 [380004df8b0] schedule_timeout at cda41ef6
 #3 [380004df978] wait_for_completion at cda3bd0a
 #4 [380004df9d8] __wait_rcu_gp at ccdddd92
 #5 [380004dfa30] synchronize_rcu_tasks_generic at ccdde0aa
 #6 [380004dfad8] ftrace_shutdown at cce7b050
 #7 [380004dfb18] unregister_ftrace_function at cce7b192
 #8 [380004dfb50] trace_selftest_ops at cda1e0fa
 #9 [380004dfba0] run_tracer_selftest at cda1e4f2
#10 [380004dfc00] trace_selftest_startup_function at ce74355c
#11 [380004dfc58] run_tracer_selftest at cda1e2fc
#12 [380004dfc98] init_trace_selftests at ce742d30
#13 [380004dfcd0] do_one_initcall at cccdca16
#14 [380004dfd68] do_initcalls at ce72e776
#15 [380004dfde0] kernel_init_freeable at ce72ea60
#16 [380004dfe50] kernel_init at cda333fe
#17 [380004dfe68] __ret_from_fork at cccdf920
#18 [380004dfe98] ret_from_fork at cda444ca

I didn't had success reproducing it so far, but it is good to know that
this also happens when running the ftrace testsuite.

I have several crashdumps, so i could try to pull out some information
if someone tells me what to look for.

Thanks,
Sven
Mark Rutland Jan. 28, 2022, 3:42 p.m. UTC | #3
Hi Sven,

On Thu, Jan 27, 2022 at 07:42:35PM +0100, Sven Schnelle wrote:
> Mark Rutland <mark.rutland@arm.com> writes:
> 
> > * I intermittently see a hang when running the tests. I previously hit that
> >   when originally trying to bisect this issue (and IIRC that bisected down to
> >   some RCU changes, but I need to re-run that). When the tests hang I
> >   magic-srsrq + L tells me:
> >
> >   [  271.938438] sysrq: Show Blocked State
> >   [  271.939245] task:ftracetest      state:D stack:    0 pid: 5687 ppid:  5627 flags:0x00000200
> >   [  271.940961] Call trace:
> >   [  271.941472]  __switch_to+0x104/0x160
> >   [  271.942213]  __schedule+0x2b0/0x6e0
> >   [  271.942933]  schedule+0x5c/0xf0
> >   [  271.943586]  schedule_timeout+0x184/0x1c4
> >   [  271.944410]  wait_for_completion+0x8c/0x12c
> >   [  271.945274]  __wait_rcu_gp+0x184/0x190
> >   [  271.946047]  synchronize_rcu_tasks_rude+0x48/0x70
> >   [  271.947007]  update_ftrace_function+0xa4/0xec
> >   [  271.947897]  __unregister_ftrace_function+0xa4/0xf0
> >   [  271.948898]  unregister_ftrace_function+0x34/0x70
> >   [  271.949857]  wakeup_tracer_reset+0x4c/0x100
> >   [  271.950713]  tracing_set_tracer+0xd0/0x2b0
> >   [  271.951552]  tracing_set_trace_write+0xe8/0x150
> >   [  271.952477]  vfs_write+0xfc/0x284
> >   [  271.953171]  ksys_write+0x7c/0x110
> >   [  271.953874]  __arm64_sys_write+0x2c/0x40
> >   [  271.954678]  invoke_syscall+0x5c/0x130
> >   [  271.955442]  el0_svc_common.constprop.0+0x108/0x130
> >   [  271.956435]  do_el0_svc+0x74/0x90
> >   [  271.957124]  el0_svc+0x2c/0x90
> >   [  271.957757]  el0t_64_sync_handler+0xa8/0x12c
> >   [  271.958629]  el0t_64_sync+0x1a0/0x1a4

On arm64 I bisected this down to:

  7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection")

Which was going wrong because ilog2() rounds down, and so the shift was wrong
for any nr_cpus that was not a power-of-two. Paul had already fixed that in
rcu-next, and just sent a pull request to Linus:

  https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/

With that applied, I no longer see these hangs.

Does your s390 test machine have a non-power-of-two nr_cpus, and does that fix
the issue for you?

On arm64 the startup tests didn't seem to trigger the hang, but I was able to
trigger the hang fairly reliably with the ftrace selftests, e.g.

  $ for N in $(seq 1 10); do ./ftracetest test.d/00basic/basic2.tc; done

... which prior to the fix, would hang between runs 2 to 5.

Thanks,
Mark.

> that's interesting. On s390 i'm seeing the same problem in CI, but with
> the startup ftrace tests. So that's likely not arm64 spacific.
> 
> On s390, the last messages from ftrace are [    5.663568] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
> [    5.667099] futex hash table entries: 65536 (order: 12, 16777216 bytes, vmalloc)
> [    5.739549] Running postponed tracer tests:
> [    5.740662] Testing tracer function: PASSED
> [    6.194635] Testing dynamic ftrace: PASSED
> [    6.471213] Testing dynamic ftrace ops #1: 
> [    6.558445] (1 0 1 0 0) 
> [    6.558458] (1 1 2 0 0) 
> [    6.699135] (2 1 3 0 764347) 
> [    6.699252] (2 2 4 0 766466) 
> [    6.759857] (3 2 4 0 1159604)
> [..] hangs here
> 
> The backtrace looks like this, which is very similar to the one above:
> 
> crash> bt 1
> PID: 1      TASK: 80e68100          CPU: 133  COMMAND: "swapper/0"
>  #0 [380004df808] __schedule at cda39f0e
>  #1 [380004df880] schedule at cda3a488
>  #2 [380004df8b0] schedule_timeout at cda41ef6
>  #3 [380004df978] wait_for_completion at cda3bd0a
>  #4 [380004df9d8] __wait_rcu_gp at ccdddd92
>  #5 [380004dfa30] synchronize_rcu_tasks_generic at ccdde0aa
>  #6 [380004dfad8] ftrace_shutdown at cce7b050
>  #7 [380004dfb18] unregister_ftrace_function at cce7b192
>  #8 [380004dfb50] trace_selftest_ops at cda1e0fa
>  #9 [380004dfba0] run_tracer_selftest at cda1e4f2
> #10 [380004dfc00] trace_selftest_startup_function at ce74355c
> #11 [380004dfc58] run_tracer_selftest at cda1e2fc
> #12 [380004dfc98] init_trace_selftests at ce742d30
> #13 [380004dfcd0] do_one_initcall at cccdca16
> #14 [380004dfd68] do_initcalls at ce72e776
> #15 [380004dfde0] kernel_init_freeable at ce72ea60
> #16 [380004dfe50] kernel_init at cda333fe
> #17 [380004dfe68] __ret_from_fork at cccdf920
> #18 [380004dfe98] ret_from_fork at cda444ca
> 
> I didn't had success reproducing it so far, but it is good to know that
> this also happens when running the ftrace testsuite.
> 
> I have several crashdumps, so i could try to pull out some information
> if someone tells me what to look for.
> 
> Thanks,
> Sven
Sven Schnelle Jan. 28, 2022, 4:08 p.m. UTC | #4
Hi Mark,

Mark Rutland <mark.rutland@arm.com> writes:

> On arm64 I bisected this down to:
>
>   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection")
>
> Which was going wrong because ilog2() rounds down, and so the shift was wrong
> for any nr_cpus that was not a power-of-two. Paul had already fixed that in
> rcu-next, and just sent a pull request to Linus:
>
>   https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
>
> With that applied, I no longer see these hangs.
>
> Does your s390 test machine have a non-power-of-two nr_cpus, and does that fix
> the issue for you?

We noticed the PR from Paul and are currently testing the fix. So far
it's looking good. The configuration where we have seen the hang is a
bit unusual:

- 16 physical CPUs on the kvm host
- 248 logical CPUs inside kvm
- debug kernel both on the host and kvm guest

So things are likely a bit slow in the kvm guest. Interesting is that
the number of CPUs is even. But maybe RCU sees an odd number of CPUs
and gets confused before all cpus are brought up. Have to read code/test
to see whether that could be possible.

Thanks for investigating!
Sven
Mark Rutland Jan. 28, 2022, 4:11 p.m. UTC | #5
On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
> Hi Mark,
> 
> Mark Rutland <mark.rutland@arm.com> writes:
> 
> > On arm64 I bisected this down to:
> >
> >   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection")
> >
> > Which was going wrong because ilog2() rounds down, and so the shift was wrong
> > for any nr_cpus that was not a power-of-two. Paul had already fixed that in
> > rcu-next, and just sent a pull request to Linus:
> >
> >   https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
> >
> > With that applied, I no longer see these hangs.
> >
> > Does your s390 test machine have a non-power-of-two nr_cpus, and does that fix
> > the issue for you?
> 
> We noticed the PR from Paul and are currently testing the fix. So far
> it's looking good. The configuration where we have seen the hang is a
> bit unusual:
> 
> - 16 physical CPUs on the kvm host
> - 248 logical CPUs inside kvm

Aha! 248 is notably *NOT* a power of two, and in this case the shift would be
wrong (ilog2() would give 7, when we need a shift of 8).

So I suspect you're hitting the same issue as I was.

Thanks,
Mark.

> - debug kernel both on the host and kvm guest
> 
> So things are likely a bit slow in the kvm guest. Interesting is that
> the number of CPUs is even. But maybe RCU sees an odd number of CPUs
> and gets confused before all cpus are brought up. Have to read code/test
> to see whether that could be possible.
> 
> Thanks for investigating!
> Sven
Paul E. McKenney Jan. 28, 2022, 4:15 p.m. UTC | #6
On Fri, Jan 28, 2022 at 04:11:57PM +0000, Mark Rutland wrote:
> On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
> > Hi Mark,
> > 
> > Mark Rutland <mark.rutland@arm.com> writes:
> > 
> > > On arm64 I bisected this down to:
> > >
> > >   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection")
> > >
> > > Which was going wrong because ilog2() rounds down, and so the shift was wrong
> > > for any nr_cpus that was not a power-of-two. Paul had already fixed that in
> > > rcu-next, and just sent a pull request to Linus:
> > >
> > >   https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
> > >
> > > With that applied, I no longer see these hangs.
> > >
> > > Does your s390 test machine have a non-power-of-two nr_cpus, and does that fix
> > > the issue for you?
> > 
> > We noticed the PR from Paul and are currently testing the fix. So far
> > it's looking good. The configuration where we have seen the hang is a
> > bit unusual:
> > 
> > - 16 physical CPUs on the kvm host
> > - 248 logical CPUs inside kvm
> 
> Aha! 248 is notably *NOT* a power of two, and in this case the shift would be
> wrong (ilog2() would give 7, when we need a shift of 8).
> 
> So I suspect you're hitting the same issue as I was.

And apparently no one runs -next on systems having a non-power-of-two
number of CPUs.  ;-)

							Thanx, Paul

> Thanks,
> Mark.
> 
> > - debug kernel both on the host and kvm guest
> > 
> > So things are likely a bit slow in the kvm guest. Interesting is that
> > the number of CPUs is even. But maybe RCU sees an odd number of CPUs
> > and gets confused before all cpus are brought up. Have to read code/test
> > to see whether that could be possible.
> > 
> > Thanks for investigating!
> > Sven
Sven Schnelle Jan. 28, 2022, 4:17 p.m. UTC | #7
Hi Mark,

Mark Rutland <mark.rutland@arm.com> writes:

> On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
>> We noticed the PR from Paul and are currently testing the fix. So far
>> it's looking good. The configuration where we have seen the hang is a
>> bit unusual:
>> 
>> - 16 physical CPUs on the kvm host
>> - 248 logical CPUs inside kvm
>
> Aha! 248 is notably *NOT* a power of two, and in this case the shift would be
> wrong (ilog2() would give 7, when we need a shift of 8).
>
> So I suspect you're hitting the same issue as I was.

Argh, indeed! I somehow changed 'power of two' to 'odd number' in my
head. I guess it's time for the weekend. :-)

Thanks!
Paul E. McKenney Jan. 28, 2022, 5:47 p.m. UTC | #8
On Fri, Jan 28, 2022 at 08:15:47AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 28, 2022 at 04:11:57PM +0000, Mark Rutland wrote:
> > On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
> > > Hi Mark,
> > > 
> > > Mark Rutland <mark.rutland@arm.com> writes:
> > > 
> > > > On arm64 I bisected this down to:
> > > >
> > > >   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection")
> > > >
> > > > Which was going wrong because ilog2() rounds down, and so the shift was wrong
> > > > for any nr_cpus that was not a power-of-two. Paul had already fixed that in
> > > > rcu-next, and just sent a pull request to Linus:
> > > >
> > > >   https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
> > > >
> > > > With that applied, I no longer see these hangs.
> > > >
> > > > Does your s390 test machine have a non-power-of-two nr_cpus, and does that fix
> > > > the issue for you?
> > > 
> > > We noticed the PR from Paul and are currently testing the fix. So far
> > > it's looking good. The configuration where we have seen the hang is a
> > > bit unusual:
> > > 
> > > - 16 physical CPUs on the kvm host
> > > - 248 logical CPUs inside kvm
> > 
> > Aha! 248 is notably *NOT* a power of two, and in this case the shift would be
> > wrong (ilog2() would give 7, when we need a shift of 8).
> > 
> > So I suspect you're hitting the same issue as I was.
> 
> And apparently no one runs -next on systems having a non-power-of-two
> number of CPUs.  ;-)

And the fix is now in mainline.

							Thanx, Paul

> > Thanks,
> > Mark.
> > 
> > > - debug kernel both on the host and kvm guest
> > > 
> > > So things are likely a bit slow in the kvm guest. Interesting is that
> > > the number of CPUs is even. But maybe RCU sees an odd number of CPUs
> > > and gets confused before all cpus are brought up. Have to read code/test
> > > to see whether that could be possible.
> > > 
> > > Thanks for investigating!
> > > Sven
Joe Lawrence Jan. 28, 2022, 9:11 p.m. UTC | #9
On Thu, Jan 27, 2022 at 11:42:49AM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
>
> First S390 complained that the sorting of the mcount sections at build
> time caused the kernel to crash on their architecture. Now PowerPC is
> complaining about it too. And also ARM64 appears to be having issues.
>
> It may be necessary to also update the relocation table for the values
> in the mcount table. Not only do we have to sort the table, but also
> update the relocations that may be applied to the items in the table.
>
> If the system is not relocatable, then it is fine to sort, but if it is,
> some architectures may have issues (although x86 does not as it shifts all
> addresses the same).
>
> Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is
> safe to do the sorting at build time.
>
> Also update the config to compile in build time sorting in the sorttable
> code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT.
>
> Link: https://lore.kernel.org/all/944D10DA-8200-4BA9-8D0A-3BED9AA99F82@linux.ibm.com/
>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Yinan Liu <yinan@linux.alibaba.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Reported-by: Sachin Sant <sachinp@linux.ibm.com>
> Tested-by: Sachin Sant <sachinp@linux.ibm.com>
> Fixes: 72b3942a173c ("scripts: ftrace - move the sort-processing in ftrace_init")
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  arch/arm/Kconfig     | 1 +
>  arch/x86/Kconfig     | 1 +
>  kernel/trace/Kconfig | 8 +++++++-
>  scripts/Makefile     | 2 +-
>  4 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index c2724d986fa0..5256ebe57451 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -82,6 +82,7 @@ config ARM
>  	select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
>  	select HAVE_CONTEXT_TRACKING
>  	select HAVE_C_RECORDMCOUNT
> +	select HAVE_BUILDTIME_MCOUNT_SORT
>  	select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
>  	select HAVE_DMA_CONTIGUOUS if MMU
>  	select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7399327d1eff..46080dea5dba 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -186,6 +186,7 @@ config X86
>  	select HAVE_CONTEXT_TRACKING_OFFSTACK	if HAVE_CONTEXT_TRACKING
>  	select HAVE_C_RECORDMCOUNT
>  	select HAVE_OBJTOOL_MCOUNT		if STACK_VALIDATION
> +	select HAVE_BUILDTIME_MCOUNT_SORT
>  	select HAVE_DEBUG_KMEMLEAK
>  	select HAVE_DMA_CONTIGUOUS
>  	select HAVE_DYNAMIC_FTRACE
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 752ed89a293b..7e5b92090faa 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -70,10 +70,16 @@ config HAVE_C_RECORDMCOUNT
>  	help
>  	  C version of recordmcount available?
>
> +config HAVE_BUILDTIME_MCOUNT_SORT
> +       bool
> +       help
> +         An architecture selects this if it sorts the mcount_loc section
> +	 at build time.
> +
>  config BUILDTIME_MCOUNT_SORT
>         bool
>         default y
> -       depends on BUILDTIME_TABLE_SORT && !S390
> +       depends on HAVE_BUILDTIME_MCOUNT_SORT
>         help
>           Sort the mcount_loc section at build time.
>
> diff --git a/scripts/Makefile b/scripts/Makefile
> index b082d2f93357..cedc1f0e21d8 100644
> --- a/scripts/Makefile
> +++ b/scripts/Makefile
> @@ -32,7 +32,7 @@ HOSTCFLAGS_sorttable.o += -I$(srctree)/tools/arch/x86/include
>  HOSTCFLAGS_sorttable.o += -DUNWINDER_ORC_ENABLED
>  endif
>
> -ifdef CONFIG_DYNAMIC_FTRACE
> +ifdef CONFIG_BUILDTIME_MCOUNT_SORT
>  HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
>  endif
>
> --
> 2.33.0
>

Hi Steve,

I just finished bisecting what is probably the same problem... when
running the livepatch selftests for 5.17-rc1, x86_64 passes, but I kept
getting errors like this on ppc64le:

  kernel: livepatch: enabling patch 'test_klp_livepatch'
  kernel: livepatch: failed to find location for function 'cmdline_proc_show'
  kernel: livepatch: failed to patch object 'vmlinux'
  kernel: livepatch: failed to enable patch 'test_klp_livepatch'
  kernel: livepatch: 'test_klp_livepatch': unpatching complete

which means klp_get_ftrace_location() / ftrace_location_range() hit a
problem with that function.

The bisect finally landed on:

  72b3942a173c387b27860ba1069636726e208777 is the first bad commit
  commit 72b3942a173c387b27860ba1069636726e208777
  Author: Yinan Liu <yinan@linux.alibaba.com>
  Date:   Sun Dec 12 19:33:58 2021 +0800

      scripts: ftrace - move the sort-processing in ftrace_init

and I can confirm that your updates today in "[for-linus][PATCH 00/10]
tracing: Fixes for 5.17-rc1" fix or avoid the issue.  I just wanted to
add my report in case this adds any future complications for mcount
build time sorting.  Let me know if any additional tests would be
helpful.

Regards,

-- Joe
Steven Rostedt Jan. 28, 2022, 9:39 p.m. UTC | #10
On Fri, 28 Jan 2022 16:11:39 -0500
Joe Lawrence <joe.lawrence@redhat.com> wrote:

> The bisect finally landed on:
> 
>   72b3942a173c387b27860ba1069636726e208777 is the first bad commit
>   commit 72b3942a173c387b27860ba1069636726e208777
>   Author: Yinan Liu <yinan@linux.alibaba.com>
>   Date:   Sun Dec 12 19:33:58 2021 +0800
> 
>       scripts: ftrace - move the sort-processing in ftrace_init
> 
> and I can confirm that your updates today in "[for-linus][PATCH 00/10]
> tracing: Fixes for 5.17-rc1" fix or avoid the issue.  I just wanted to
> add my report in case this adds any future complications for mcount
> build time sorting.  Let me know if any additional tests would be
> helpful.

Thanks for letting me know. That patch set has already landed in Linus's
tree.


-- Steve
diff mbox series

Patch

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index c2724d986fa0..5256ebe57451 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -82,6 +82,7 @@  config ARM
 	select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_C_RECORDMCOUNT
+	select HAVE_BUILDTIME_MCOUNT_SORT
 	select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
 	select HAVE_DMA_CONTIGUOUS if MMU
 	select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7399327d1eff..46080dea5dba 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -186,6 +186,7 @@  config X86
 	select HAVE_CONTEXT_TRACKING_OFFSTACK	if HAVE_CONTEXT_TRACKING
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_OBJTOOL_MCOUNT		if STACK_VALIDATION
+	select HAVE_BUILDTIME_MCOUNT_SORT
 	select HAVE_DEBUG_KMEMLEAK
 	select HAVE_DMA_CONTIGUOUS
 	select HAVE_DYNAMIC_FTRACE
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 752ed89a293b..7e5b92090faa 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -70,10 +70,16 @@  config HAVE_C_RECORDMCOUNT
 	help
 	  C version of recordmcount available?
 
+config HAVE_BUILDTIME_MCOUNT_SORT
+       bool
+       help
+         An architecture selects this if it sorts the mcount_loc section
+	 at build time.
+
 config BUILDTIME_MCOUNT_SORT
        bool
        default y
-       depends on BUILDTIME_TABLE_SORT && !S390
+       depends on HAVE_BUILDTIME_MCOUNT_SORT
        help
          Sort the mcount_loc section at build time.
 
diff --git a/scripts/Makefile b/scripts/Makefile
index b082d2f93357..cedc1f0e21d8 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -32,7 +32,7 @@  HOSTCFLAGS_sorttable.o += -I$(srctree)/tools/arch/x86/include
 HOSTCFLAGS_sorttable.o += -DUNWINDER_ORC_ENABLED
 endif
 
-ifdef CONFIG_DYNAMIC_FTRACE
+ifdef CONFIG_BUILDTIME_MCOUNT_SORT
 HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
 endif