tracing: Add trace_trigger kernel command line option

Message ID	20221019200137.70343645@gandalf.local.home (mailing list archive)
State	Accepted
Headers	show Return-Path: <linux-trace-kernel-owner@kernel.org> Date: Wed, 19 Oct 2022 20:01:37 -0400 From: Steven Rostedt <rostedt@goodmis.org> To: LKML <linux-kernel@vger.kernel.org> Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>, Masami Hiramatsu <mhiramat@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, "Paul E. McKenney" <paulmck@kernel.org>, Joel Fernandes <joel@joelfernandes.org>, Tom Zanussi <zanussi@kernel.org> Subject: [PATCH] tracing: Add trace_trigger kernel command line option Message-ID: <20221019200137.70343645@gandalf.local.home> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk
Series	tracing: Add trace_trigger kernel command line option \| expand tracing: Add trace_trigger kernel command line option

Steven Rostedt Oct. 20, 2022, 12:01 a.m. UTC

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Allow triggers to be enabled at kernel boot up. For example:

  trace_trigger="sched_switch.stacktrace if prev_state == 2"

The above will enable the stacktrace trigger on top of the sched_switch
event and only trigger if its prev_state is 2 (TASK_UNINTERRUPTIBLE). Then
at boot up, a stacktrace will trigger and be recorded in the tracing ring
buffer every time the sched_switch happens where the previous state is
TASK_INTERRUPTIBLE.

As this calls into tracepoint logic during very early boot (before
interrupts are enabled), a check has to be done to see if early boot
interrupts are still disabled, and if so, avoid any call to RCU
synchronization, as that will enable interrupts and cause boot up issues.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 .../admin-guide/kernel-parameters.txt         | 19 ++++++
 include/linux/tracepoint.h                    |  4 ++
 kernel/trace/trace.c                          |  3 +-
 kernel/trace/trace_events.c                   | 63 ++++++++++++++++++-
 kernel/tracepoint.c                           |  6 ++
 5 files changed, 92 insertions(+), 3 deletions(-)

Steven Rostedt Oct. 20, 2022, 12:07 a.m. UTC | #1

On Wed, 19 Oct 2022 20:01:37 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> @@ -90,6 +90,10 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
>  #ifdef CONFIG_TRACEPOINTS
>  static inline void tracepoint_synchronize_unregister(void)
>  {
> +	/* Early updates do not need synchronization */
> +	if (early_boot_irqs_disabled)
> +		return;
> +
>  	synchronize_srcu(&tracepoint_srcu);
>  	synchronize_rcu();

I wonder if this check should be just moved to the RCU synchronization
code? That is, if early_boot_irqs_disabled is set, do nothing, as there
should be nothing to synchronize against.

It took me a bit to figure out why I was constantly getting:

[    1.072013] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    1.078888] ------------[ cut here ]------------
[    1.083422] Interrupts were enabled early
[    1.087399] WARNING: CPU: 0 PID: 0 at init/main.c:1061 start_kernel+0x36f/0x4e8
[    1.094656] Modules linked in:
[    1.097683] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.0-rc1-test+ #461
[    1.104509] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
[    1.113411] RIP: 0010:start_kernel+0x36f/0x4e8
[    1.117817] Code: 48 27 5f e8 fd 7b 03 00 e8 55 72 e5 fe e8 6b 05 03 00 ff 15 2a f9 89 ff 0f ba e0 09 73 0e 48 c7 c7 98 26 31 a0 e8 01 5b df fe <0f> 0b c6 05 c7 2c af ff 00 e8 da 36 47 fe ff 15 14 f9 89 ff e8 56
[    1.136485] RSP: 0000:ffffffffa0603f30 EFLAGS: 00010286
[    1.141670] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    1.148754] RDX: 0000000000000002 RSI: 00000000ffffdfff RDI: 0000000000000001
[    1.155840] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffffa0603dd8
[    1.162926] R10: 0000000000000003 R11: ffffffffa0734168 R12: 000000000000180a
[    1.170011] R13: 00000000010fecf0 R14: 0000000000000000 R15: 00000000d44e3018
[    1.177097] FS:  0000000000000000(0000) GS:ffff91909aa00000(0000) knlGS:0000000000000000
[    1.185131] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.190835] CR2: ffff91909edff000 CR3: 000000007760a001 CR4: 00000000000606f0
[    1.197924] Call Trace:
[    1.200345]  <TASK>


It was a bit of whack-a-mole in finding all the places that were calling
synchronize RCU.

-- Steve

kernel test robot Oct. 20, 2022, 7:49 a.m. UTC | #2

Hi Steven,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.1-rc1 next-20221020]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Add-trace_trigger-kernel-command-line-option/20221020-102230
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20221019200137.70343645%40gandalf.local.home
patch subject: [PATCH] tracing: Add trace_trigger kernel command line option
config: i386-randconfig-a001
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/476ab107174fa2ac06ef7171b093235382008c77
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Steven-Rostedt/tracing-Add-trace_trigger-kernel-command-line-option/20221020-102230
        git checkout 476ab107174fa2ac06ef7171b093235382008c77
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash kernel/trace/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   kernel/trace/trace_events.c: In function '__trace_early_add_new_event':
>> kernel/trace/trace_events.c:2864:13: warning: unused variable 'i' [-Wunused-variable]
    2864 |         int i;
         |             ^


vim +/i +2864 kernel/trace/trace_events.c

  2852	
  2853	/*
  2854	 * Just create a descriptor for early init. A descriptor is required
  2855	 * for enabling events at boot. We want to enable events before
  2856	 * the filesystem is initialized.
  2857	 */
  2858	static int
  2859	__trace_early_add_new_event(struct trace_event_call *call,
  2860				    struct trace_array *tr)
  2861	{
  2862		struct trace_event_file *file;
  2863		int ret;
> 2864		int i;
  2865	
  2866		file = trace_create_new_event(call, tr);
  2867		if (!file)
  2868			return -ENOMEM;
  2869	
  2870		ret = event_define_fields(call);
  2871		if (ret)
  2872			return ret;
  2873

kernel test robot Oct. 20, 2022, 8:50 a.m. UTC | #3

Hi Steven,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.1-rc1 next-20221020]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Steven-Rostedt/tracing-Add-trace_trigger-kernel-command-line-option/20221020-102230
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20221019200137.70343645%40gandalf.local.home
patch subject: [PATCH] tracing: Add trace_trigger kernel command line option
config: s390-allmodconfig
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/476ab107174fa2ac06ef7171b093235382008c77
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Steven-Rostedt/tracing-Add-trace_trigger-kernel-command-line-option/20221020-102230
        git checkout 476ab107174fa2ac06ef7171b093235382008c77
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "early_boot_irqs_disabled" [mm/kasan/kasan_test.ko] undefined!
>> ERROR: modpost: "early_boot_irqs_disabled" [mm/kfence/kfence_test.ko] undefined!
ERROR: modpost: "devm_ioremap_resource" [drivers/dma/qcom/hdma.ko] undefined!
ERROR: modpost: "devm_ioremap_resource" [drivers/dma/fsl-edma.ko] undefined!
ERROR: modpost: "devm_ioremap_resource" [drivers/dma/idma64.ko] undefined!
ERROR: modpost: "iounmap" [drivers/tty/ipwireless/ipwireless.ko] undefined!
ERROR: modpost: "ioremap" [drivers/tty/ipwireless/ipwireless.ko] undefined!
ERROR: modpost: "devm_platform_ioremap_resource" [drivers/char/xillybus/xillybus_of.ko] undefined!
ERROR: modpost: "devm_memremap" [drivers/misc/open-dice.ko] undefined!
ERROR: modpost: "devm_memunmap" [drivers/misc/open-dice.ko] undefined!
WARNING: modpost: suppressed 20 unresolved symbol warnings because there were too many)

Masami Hiramatsu (Google) Oct. 20, 2022, 2:33 p.m. UTC | #4

On Wed, 19 Oct 2022 20:01:37 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> Allow triggers to be enabled at kernel boot up. For example:
> 
>   trace_trigger="sched_switch.stacktrace if prev_state == 2"
> 
> The above will enable the stacktrace trigger on top of the sched_switch
> event and only trigger if its prev_state is 2 (TASK_UNINTERRUPTIBLE). Then
> at boot up, a stacktrace will trigger and be recorded in the tracing ring
> buffer every time the sched_switch happens where the previous state is
> TASK_INTERRUPTIBLE.
> 
> As this calls into tracepoint logic during very early boot (before
> interrupts are enabled), a check has to be done to see if early boot
> interrupts are still disabled, and if so, avoid any call to RCU
> synchronization, as that will enable interrupts and cause boot up issues.

Just out of curiousity, can you do it by boot-time tracer?
(Is it too late for your issue?)

$ cat >> stacktrace.bconf
ftrace.event.sched.sched_switch.actions = "stacktrace if prev_state == 2"
^D
$ bootconfig -a stacktrace.bconf initrd.img

And boot kernel with "bootconfig".
Then, 

----------
            init-1       [000] d..3.     0.546668: <stack trace>
 => trace_event_raw_event_sched_switch
 => __traceiter_sched_switch
 => __schedule
 => schedule
 => schedule_timeout
 => wait_for_completion_killable
 => __kthread_create_on_node
 => kthread_create_on_node
 => audit_init
 => do_one_initcall
 => kernel_init_freeable
 => kernel_init
 => ret_from_fork
         kauditd-57      [007] d..3.     0.546677: <stack trace>
 => trace_event_raw_event_sched_switch
 => __traceiter_sched_switch
 => __schedule
 => schedule
 => schedule_preempt_disabled
 => kthread
 => ret_from_fork
----------

Thank you,

> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  .../admin-guide/kernel-parameters.txt         | 19 ++++++
>  include/linux/tracepoint.h                    |  4 ++
>  kernel/trace/trace.c                          |  3 +-
>  kernel/trace/trace_events.c                   | 63 ++++++++++++++++++-
>  kernel/tracepoint.c                           |  6 ++
>  5 files changed, 92 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a465d5242774..ccf91a4bf113 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6257,6 +6257,25 @@
>  			See also Documentation/trace/ftrace.rst "trace options"
>  			section.
>  
> +	trace_trigger=[trigger-list]
> +			[FTRACE] Add a event trigger on specific events.
> +			Set a trigger on top of a specific event, with an optional
> +			filter.
> +
> +			The format is is "trace_trigger=<event>.<trigger>[ if <filter>],..."
> +			Where more than one trigger may be specified that are comma deliminated.
> +
> +			For example:
> +
> +			  trace_trigger="sched_switch.stacktrace if prev_state == 2"
> +
> +			The above will enable the "stacktrace" trigger on the "sched_switch"
> +			event but only trigger it if the "prev_state" of the "sched_switch"
> +			event is "2" (TASK_UNINTERUPTIBLE).
> +
> +			See also "Event triggers" in Documentation/trace/events.rst
> +
> +
>  	traceoff_on_warning
>  			[FTRACE] enable this option to disable tracing when a
>  			warning is hit. This turns off "tracing_on". Tracing can
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 4b33b95eb8be..a5c6b5772897 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -90,6 +90,10 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
>  #ifdef CONFIG_TRACEPOINTS
>  static inline void tracepoint_synchronize_unregister(void)
>  {
> +	/* Early updates do not need synchronization */
> +	if (early_boot_irqs_disabled)
> +		return;
> +
>  	synchronize_srcu(&tracepoint_srcu);
>  	synchronize_rcu();
>  }
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 47a44b055a1d..c03fd7037add 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2749,7 +2749,8 @@ void trace_buffered_event_disable(void)
>  	preempt_enable();
>  
>  	/* Wait for all current users to finish */
> -	synchronize_rcu();
> +	if (!early_boot_irqs_disabled)
> +		synchronize_rcu();
>  
>  	for_each_tracing_cpu(cpu) {
>  		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 0356cae0cf74..06554939252c 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -2796,6 +2796,44 @@ trace_create_new_event(struct trace_event_call *call,
>  	return file;
>  }
>  
> +#ifdef CONFIG_HIST_TRIGGERS
> +#define MAX_BOOT_TRIGGERS 32
> +
> +static struct boot_triggers {
> +	const char		*event;
> +	char			*trigger;
> +} bootup_triggers[MAX_BOOT_TRIGGERS];
> +
> +static char bootup_trigger_buf[COMMAND_LINE_SIZE];
> +static int nr_boot_triggers;
> +
> +static __init int setup_trace_triggers(char *str)
> +{
> +	char *trigger;
> +	char *buf;
> +	int i;
> +
> +	strlcpy(bootup_trigger_buf, str, COMMAND_LINE_SIZE);
> +	ring_buffer_expanded = true;
> +	disable_tracing_selftest("running event triggers");
> +
> +	buf = bootup_trigger_buf;
> +	for (i = 0; i < MAX_BOOT_TRIGGERS; i++) {
> +		trigger = strsep(&buf, ",");
> +		if (!trigger)
> +			break;
> +		bootup_triggers[i].event = strsep(&trigger, ".");
> +		bootup_triggers[i].trigger = strsep(&trigger, ".");
> +		if (!bootup_triggers[i].trigger)
> +			break;
> +	}
> +
> +	nr_boot_triggers = i;
> +	return 1;
> +}
> +__setup("trace_trigger=", setup_trace_triggers);
> +#endif
> +
>  /* Add an event to a trace directory */
>  static int
>  __trace_add_new_event(struct trace_event_call *call, struct trace_array *tr)
> @@ -2822,12 +2860,32 @@ __trace_early_add_new_event(struct trace_event_call *call,
>  			    struct trace_array *tr)
>  {
>  	struct trace_event_file *file;
> +	int ret;
> +	int i;
>  
>  	file = trace_create_new_event(call, tr);
>  	if (!file)
>  		return -ENOMEM;
>  
> -	return event_define_fields(call);
> +	ret = event_define_fields(call);
> +	if (ret)
> +		return ret;
> +
> +#ifdef CONFIG_HIST_TRIGGERS
> +	for (i = 0; i < nr_boot_triggers; i++) {
> +		if (strcmp(trace_event_name(call), bootup_triggers[i].event))
> +			continue;
> +		mutex_lock(&event_mutex);
> +		ret = trigger_process_regex(file, bootup_triggers[i].trigger);
> +		mutex_unlock(&event_mutex);
> +		if (ret)
> +			pr_err("Failed to register trigger '%s' on event %s\n",
> +			       bootup_triggers[i].trigger,
> +			       bootup_triggers[i].event);
> +	}
> +#endif
> +
> +	return 0;
>  }
>  
>  struct ftrace_module_file_ops;
> @@ -3726,6 +3784,8 @@ static __init int event_trace_enable(void)
>  			list_add(&call->list, &ftrace_events);
>  	}
>  
> +	register_trigger_cmds();
> +
>  	/*
>  	 * We need the top trace array to have a working set of trace
>  	 * points at early init, before the debug files and directories
> @@ -3740,7 +3800,6 @@ static __init int event_trace_enable(void)
>  
>  	register_event_cmds();
>  
> -	register_trigger_cmds();
>  
>  	return 0;
>  }
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index f23144af5743..f6e4ee1e40b3 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -48,6 +48,9 @@ static void tp_rcu_get_state(enum tp_transition_sync sync)
>  {
>  	struct tp_transition_snapshot *snapshot = &tp_transition_snapshot[sync];
>  
> +	if (early_boot_irqs_disabled)
> +		return;
> +
>  	/* Keep the latest get_state snapshot. */
>  	snapshot->rcu = get_state_synchronize_rcu();
>  	snapshot->srcu = start_poll_synchronize_srcu(&tracepoint_srcu);
> @@ -58,6 +61,9 @@ static void tp_rcu_cond_sync(enum tp_transition_sync sync)
>  {
>  	struct tp_transition_snapshot *snapshot = &tp_transition_snapshot[sync];
>  
> +	if (early_boot_irqs_disabled)
> +		return;
> +
>  	if (!snapshot->ongoing)
>  		return;
>  	cond_synchronize_rcu(snapshot->rcu);
> -- 
> 2.35.1
>

Joel Fernandes Oct. 20, 2022, 2:43 p.m. UTC | #5

Hi Steven,

On Wed, Oct 19, 2022 at 8:01 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
>
> Allow triggers to be enabled at kernel boot up. For example:
>
>   trace_trigger="sched_switch.stacktrace if prev_state == 2"
>
> The above will enable the stacktrace trigger on top of the sched_switch
> event and only trigger if its prev_state is 2 (TASK_UNINTERRUPTIBLE). Then
> at boot up, a stacktrace will trigger and be recorded in the tracing ring
> buffer every time the sched_switch happens where the previous state is
> TASK_INTERRUPTIBLE.
>
> As this calls into tracepoint logic during very early boot (before
> interrupts are enabled), a check has to be done to see if early boot
> interrupts are still disabled, and if so, avoid any call to RCU
> synchronization, as that will enable interrupts and cause boot up issues.
>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---

rest_init() -> rcu_scheduler_starting() sets RCU_SCHEDULER_INIT

start_kernel() disables interrupts and does the dance you mentioned,
while setting that early_ flag. start_kernel() is called from arch
code.

From your crash, the trace trigger is happening before the early_ boot
flag is set to true.
rcu_blocking_is_gp() should return true at the point your trigger
fires because RCU_SCHEDULER_INACTIVE is still set.

So I would expect your synchronize_rcu() to not do anything.

There's comments in rcu_blocking_is_gp() like so:
   * During early boot, any blocking grace-period wait automatically
   * implies a grace period.

So it should be returning straight from here. I wonder where/why it's
re-enabling interrupts for you. What does rcu_blocking_is_gp() return
if you can print that, at the time of the crash?

Cheers,

 - J


>  .../admin-guide/kernel-parameters.txt         | 19 ++++++
>  include/linux/tracepoint.h                    |  4 ++
>  kernel/trace/trace.c                          |  3 +-
>  kernel/trace/trace_events.c                   | 63 ++++++++++++++++++-
>  kernel/tracepoint.c                           |  6 ++
>  5 files changed, 92 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a465d5242774..ccf91a4bf113 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6257,6 +6257,25 @@
>                         See also Documentation/trace/ftrace.rst "trace options"
>                         section.
>
> +       trace_trigger=[trigger-list]
> +                       [FTRACE] Add a event trigger on specific events.
> +                       Set a trigger on top of a specific event, with an optional
> +                       filter.
> +
> +                       The format is is "trace_trigger=<event>.<trigger>[ if <filter>],..."
> +                       Where more than one trigger may be specified that are comma deliminated.
> +
> +                       For example:
> +
> +                         trace_trigger="sched_switch.stacktrace if prev_state == 2"
> +
> +                       The above will enable the "stacktrace" trigger on the "sched_switch"
> +                       event but only trigger it if the "prev_state" of the "sched_switch"
> +                       event is "2" (TASK_UNINTERUPTIBLE).
> +
> +                       See also "Event triggers" in Documentation/trace/events.rst
> +
> +
>         traceoff_on_warning
>                         [FTRACE] enable this option to disable tracing when a
>                         warning is hit. This turns off "tracing_on". Tracing can
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 4b33b95eb8be..a5c6b5772897 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -90,6 +90,10 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
>  #ifdef CONFIG_TRACEPOINTS
>  static inline void tracepoint_synchronize_unregister(void)
>  {
> +       /* Early updates do not need synchronization */
> +       if (early_boot_irqs_disabled)
> +               return;
> +
>         synchronize_srcu(&tracepoint_srcu);
>         synchronize_rcu();
>  }
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 47a44b055a1d..c03fd7037add 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2749,7 +2749,8 @@ void trace_buffered_event_disable(void)
>         preempt_enable();
>
>         /* Wait for all current users to finish */
> -       synchronize_rcu();
> +       if (!early_boot_irqs_disabled)
> +               synchronize_rcu();
>
>         for_each_tracing_cpu(cpu) {
>                 free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 0356cae0cf74..06554939252c 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -2796,6 +2796,44 @@ trace_create_new_event(struct trace_event_call *call,
>         return file;
>  }
>
> +#ifdef CONFIG_HIST_TRIGGERS
> +#define MAX_BOOT_TRIGGERS 32
> +
> +static struct boot_triggers {
> +       const char              *event;
> +       char                    *trigger;
> +} bootup_triggers[MAX_BOOT_TRIGGERS];
> +
> +static char bootup_trigger_buf[COMMAND_LINE_SIZE];
> +static int nr_boot_triggers;
> +
> +static __init int setup_trace_triggers(char *str)
> +{
> +       char *trigger;
> +       char *buf;
> +       int i;
> +
> +       strlcpy(bootup_trigger_buf, str, COMMAND_LINE_SIZE);
> +       ring_buffer_expanded = true;
> +       disable_tracing_selftest("running event triggers");
> +
> +       buf = bootup_trigger_buf;
> +       for (i = 0; i < MAX_BOOT_TRIGGERS; i++) {
> +               trigger = strsep(&buf, ",");
> +               if (!trigger)
> +                       break;
> +               bootup_triggers[i].event = strsep(&trigger, ".");
> +               bootup_triggers[i].trigger = strsep(&trigger, ".");
> +               if (!bootup_triggers[i].trigger)
> +                       break;
> +       }
> +
> +       nr_boot_triggers = i;
> +       return 1;
> +}
> +__setup("trace_trigger=", setup_trace_triggers);
> +#endif
> +
>  /* Add an event to a trace directory */
>  static int
>  __trace_add_new_event(struct trace_event_call *call, struct trace_array *tr)
> @@ -2822,12 +2860,32 @@ __trace_early_add_new_event(struct trace_event_call *call,
>                             struct trace_array *tr)
>  {
>         struct trace_event_file *file;
> +       int ret;
> +       int i;
>
>         file = trace_create_new_event(call, tr);
>         if (!file)
>                 return -ENOMEM;
>
> -       return event_define_fields(call);
> +       ret = event_define_fields(call);
> +       if (ret)
> +               return ret;
> +
> +#ifdef CONFIG_HIST_TRIGGERS
> +       for (i = 0; i < nr_boot_triggers; i++) {
> +               if (strcmp(trace_event_name(call), bootup_triggers[i].event))
> +                       continue;
> +               mutex_lock(&event_mutex);
> +               ret = trigger_process_regex(file, bootup_triggers[i].trigger);
> +               mutex_unlock(&event_mutex);
> +               if (ret)
> +                       pr_err("Failed to register trigger '%s' on event %s\n",
> +                              bootup_triggers[i].trigger,
> +                              bootup_triggers[i].event);
> +       }
> +#endif
> +
> +       return 0;
>  }
>
>  struct ftrace_module_file_ops;
> @@ -3726,6 +3784,8 @@ static __init int event_trace_enable(void)
>                         list_add(&call->list, &ftrace_events);
>         }
>
> +       register_trigger_cmds();
> +
>         /*
>          * We need the top trace array to have a working set of trace
>          * points at early init, before the debug files and directories
> @@ -3740,7 +3800,6 @@ static __init int event_trace_enable(void)
>
>         register_event_cmds();
>
> -       register_trigger_cmds();
>
>         return 0;
>  }
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index f23144af5743..f6e4ee1e40b3 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -48,6 +48,9 @@ static void tp_rcu_get_state(enum tp_transition_sync sync)
>  {
>         struct tp_transition_snapshot *snapshot = &tp_transition_snapshot[sync];
>
> +       if (early_boot_irqs_disabled)
> +               return;
> +
>         /* Keep the latest get_state snapshot. */
>         snapshot->rcu = get_state_synchronize_rcu();
>         snapshot->srcu = start_poll_synchronize_srcu(&tracepoint_srcu);
> @@ -58,6 +61,9 @@ static void tp_rcu_cond_sync(enum tp_transition_sync sync)
>  {
>         struct tp_transition_snapshot *snapshot = &tp_transition_snapshot[sync];
>
> +       if (early_boot_irqs_disabled)
> +               return;
> +
>         if (!snapshot->ongoing)
>                 return;
>         cond_synchronize_rcu(snapshot->rcu);
> --
> 2.35.1
>

Joel Fernandes Oct. 20, 2022, 2:52 p.m. UTC | #6

On Thu, Oct 20, 2022 at 10:43 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Steven,
>
> On Wed, Oct 19, 2022 at 8:01 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> >
> > Allow triggers to be enabled at kernel boot up. For example:
> >
> >   trace_trigger="sched_switch.stacktrace if prev_state == 2"
> >
> > The above will enable the stacktrace trigger on top of the sched_switch
> > event and only trigger if its prev_state is 2 (TASK_UNINTERRUPTIBLE). Then
> > at boot up, a stacktrace will trigger and be recorded in the tracing ring
> > buffer every time the sched_switch happens where the previous state is
> > TASK_INTERRUPTIBLE.
> >
> > As this calls into tracepoint logic during very early boot (before
> > interrupts are enabled), a check has to be done to see if early boot
> > interrupts are still disabled, and if so, avoid any call to RCU
> > synchronization, as that will enable interrupts and cause boot up issues.
> >
> > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > ---
>
> rest_init() -> rcu_scheduler_starting() sets RCU_SCHEDULER_INIT
>
> start_kernel() disables interrupts and does the dance you mentioned,
> while setting that early_ flag. start_kernel() is called from arch
> code.
>
> From your crash, the trace trigger is happening before the early_ boot
> flag is set to true.

My apologies. I meant, before early_boot_irqs_disabled is set to false.

Steven Rostedt Oct. 20, 2022, 3:20 p.m. UTC | #7

On Thu, 20 Oct 2022 23:33:18 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Wed, 19 Oct 2022 20:01:37 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> > 
> > Allow triggers to be enabled at kernel boot up. For example:
> > 
> >   trace_trigger="sched_switch.stacktrace if prev_state == 2"
> > 
> > The above will enable the stacktrace trigger on top of the sched_switch
> > event and only trigger if its prev_state is 2 (TASK_UNINTERRUPTIBLE). Then
> > at boot up, a stacktrace will trigger and be recorded in the tracing ring
> > buffer every time the sched_switch happens where the previous state is
> > TASK_INTERRUPTIBLE.
> > 
> > As this calls into tracepoint logic during very early boot (before
> > interrupts are enabled), a check has to be done to see if early boot
> > interrupts are still disabled, and if so, avoid any call to RCU
> > synchronization, as that will enable interrupts and cause boot up issues.  
> 
> Just out of curiousity, can you do it by boot-time tracer?
> (Is it too late for your issue?)

Yeah, I'm looking at adding triggers very early. I could even add
"traceoff" trigger too here. I'm guessing that bootconf is done later in
boot up? Especially since I needed to fix RCU calls to get this working.

> 
> $ cat >> stacktrace.bconf
> ftrace.event.sched.sched_switch.actions = "stacktrace if prev_state == 2"
> ^D
> $ bootconfig -a stacktrace.bconf initrd.img
> 

Not to mention, I'm doing this for Chromebooks where it's easy to update
the command line (on dev devices), but not as easy to modify the kernel.

-- Steve

Paul E. McKenney Oct. 20, 2022, 3:59 p.m. UTC | #8

On Wed, Oct 19, 2022 at 08:07:45PM -0400, Steven Rostedt wrote:
> On Wed, 19 Oct 2022 20:01:37 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > @@ -90,6 +90,10 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
> >  #ifdef CONFIG_TRACEPOINTS
> >  static inline void tracepoint_synchronize_unregister(void)
> >  {
> > +	/* Early updates do not need synchronization */
> > +	if (early_boot_irqs_disabled)
> > +		return;
> > +
> >  	synchronize_srcu(&tracepoint_srcu);
> >  	synchronize_rcu();
> 
> I wonder if this check should be just moved to the RCU synchronization
> code? That is, if early_boot_irqs_disabled is set, do nothing, as there
> should be nothing to synchronize against.

There already is a similar check, but it follows the lockdep check.

Does the following patch help?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index ca4b5dcec675b..356ef70d5442c 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1267,14 +1267,15 @@ static void __synchronize_srcu(struct srcu_struct *ssp, bool do_norm)
 {
 	struct rcu_synchronize rcu;
 
+	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
+		return;
+
 	RCU_LOCKDEP_WARN(lockdep_is_held(ssp) ||
 			 lock_is_held(&rcu_bh_lock_map) ||
 			 lock_is_held(&rcu_lock_map) ||
 			 lock_is_held(&rcu_sched_lock_map),
 			 "Illegal synchronize_srcu() in same-type SRCU (or in RCU) read-side critical section");
 
-	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
-		return;
 	might_sleep();
 	check_init_srcu_struct(ssp);
 	init_completion(&rcu.completion);

Steven Rostedt Oct. 20, 2022, 4:33 p.m. UTC | #9

On Thu, 20 Oct 2022 08:59:00 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> On Wed, Oct 19, 2022 at 08:07:45PM -0400, Steven Rostedt wrote:
> > On Wed, 19 Oct 2022 20:01:37 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >   
> > > @@ -90,6 +90,10 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
> > >  #ifdef CONFIG_TRACEPOINTS
> > >  static inline void tracepoint_synchronize_unregister(void)
> > >  {
> > > +	/* Early updates do not need synchronization */
> > > +	if (early_boot_irqs_disabled)
> > > +		return;
> > > +
> > >  	synchronize_srcu(&tracepoint_srcu);
> > >  	synchronize_rcu();  
> > 
> > I wonder if this check should be just moved to the RCU synchronization
> > code? That is, if early_boot_irqs_disabled is set, do nothing, as there
> > should be nothing to synchronize against.  
> 
> There already is a similar check, but it follows the lockdep check.
> 
> Does the following patch help?

Not sure if it would.

I added this:


diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c03fd7037add..79ac31a6a87b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2748,9 +2748,12 @@ void trace_buffered_event_disable(void)
 			       disable_trace_buffered_event, NULL, 1);
 	preempt_enable();
 
+	if (!irqs_disabled())
+		printk("IRQS DISABLED!! before %s:%d\n", __func__, __LINE__);
 	/* Wait for all current users to finish */
-	if (!early_boot_irqs_disabled)
-		synchronize_rcu();
+	synchronize_rcu();
+	if (!irqs_disabled())
+		printk("IRQS DISABLED!! after %s:%d\n", __func__, __LINE__);
 
 	for_each_tracing_cpu(cpu) {
 		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));

Which produced this:


[    0.972867] ftrace: allocating 47021 entries in 184 pages
[    0.978285] ftrace section at ffffffffacef74c0 sorted properly
[    0.991153] ftrace: allocated 184 pages with 4 groups
[    0.996163] Dynamic Preempt: full
[    0.999354] rcu: Preemptible hierarchical RCU implementation.
[    1.005020] rcu:     RCU restricting CPUs from NR_CPUS=128 to nr_cpu_ids=8.
[    1.011680]  Tasks-RCU CPU stall warnings timeout set to 120000 (rcu_task_stall_timeout).
[    1.019813]  Trampoline variant of Tasks RCU enabled.
[    1.024831]  Rude variant of Tasks RCU enabled.
[    1.029330]  Tracing variant of Tasks RCU enabled.
[    1.034089] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
[    1.041699] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
[    1.052607] IRQS DISABLED!! after trace_buffered_event_disable:2756
[    1.058712] NR_IRQS: 8448, nr_irqs: 488, preallocated irqs: 16
[    1.064670] NO_HZ: Full dynticks CPUs: 3.
[    1.068501] rcu:     Offload RCU callbacks from CPUs: 3.
[    1.073519] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    1.080400] ------------[ cut here ]------------
[    1.084938] Interrupts were enabled early
[    1.088919] WARNING: CPU: 0 PID: 0 at init/main.c:1061 start_kernel+0x36f/0x4e8
[    1.096184] Modules linked in:
[    1.099213] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.0-rc1-test+ #468
[    1.106047] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
[    1.114960] RIP: 0010:start_kernel+0x36f/0x4e8
[    1.119373] Code: 48 27 53 e8 fd 7b 03 00 e8 55 72 e5 fe e8 6b 05 03 00 ff 15 2a f9 89 ff 0f ba e0 09 73 0e 48 c7 c7 98 25 31 ac e8 01 5b df fe <0f> 0b c6 05 c7 2c af ff 00 e8 ba 36 47 fe ff 15 14 f9 89 ff e8 56
[    1.138057] RSP: 0000:ffffffffac603f30 EFLAGS: 00010286
[    1.143248] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    1.150343] RDX: 0000000000000002 RSI: 00000000ffffdfff RDI: 0000000000000001
[    1.157438] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffffac603dd8
[    1.164537] R10: 0000000000000003 R11: ffffffffac734168 R12: 000000000000180a
[    1.171633] R13: 00000000010fecf0 R14: 0000000000000000 R15: 00000000d44e3018
[    1.178730] FS:  0000000000000000(0000) GS:ffff93ee1aa00000(0000) knlGS:0000000000000000
[    1.186778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.192489] CR2: ffff93ee1edff000 CR3: 000000002e60a001 CR4: 00000000000606f0
[    1.199584] Call Trace:
[    1.202009]  <TASK>


So I'm not sure how just updating srcu will do anything to prevent this.

-- Steve

> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index ca4b5dcec675b..356ef70d5442c 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -1267,14 +1267,15 @@ static void __synchronize_srcu(struct srcu_struct *ssp, bool do_norm)
>  {
>  	struct rcu_synchronize rcu;
>  
> +	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> +		return;
> +
>  	RCU_LOCKDEP_WARN(lockdep_is_held(ssp) ||
>  			 lock_is_held(&rcu_bh_lock_map) ||
>  			 lock_is_held(&rcu_lock_map) ||
>  			 lock_is_held(&rcu_sched_lock_map),
>  			 "Illegal synchronize_srcu() in same-type SRCU (or in RCU) read-side critical section");
>  
> -	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> -		return;
>  	might_sleep();
>  	check_init_srcu_struct(ssp);
>  	init_completion(&rcu.completion);

Joel Fernandes Oct. 20, 2022, 4:44 p.m. UTC | #10

> On Oct 20, 2022, at 12:33 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Thu, 20 Oct 2022 08:59:00 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
>>> On Wed, Oct 19, 2022 at 08:07:45PM -0400, Steven Rostedt wrote:
>>> On Wed, 19 Oct 2022 20:01:37 -0400
>>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>> 
>>>> @@ -90,6 +90,10 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
>>>> #ifdef CONFIG_TRACEPOINTS
>>>> static inline void tracepoint_synchronize_unregister(void)
>>>> {
>>>> +    /* Early updates do not need synchronization */
>>>> +    if (early_boot_irqs_disabled)
>>>> +        return;
>>>> +
>>>>    synchronize_srcu(&tracepoint_srcu);
>>>>    synchronize_rcu();  
>>> 
>>> I wonder if this check should be just moved to the RCU synchronization
>>> code? That is, if early_boot_irqs_disabled is set, do nothing, as there
>>> should be nothing to synchronize against.  
>> 
>> There already is a similar check, but it follows the lockdep check.
>> 
>> Does the following patch help?
> 
> Not sure if it would.
> 
> I added this:

I’m not working today, so I can’t look very closely. But could you remove
the might_sleep() from rcu_gp_is_blocking() and see if it fixes your issue?

I think that function call might be incompatible with early boot. Just a hunch.

  - Joel


> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index c03fd7037add..79ac31a6a87b 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2748,9 +2748,12 @@ void trace_buffered_event_disable(void)
>                   disable_trace_buffered_event, NULL, 1);
>    preempt_enable();
> 
> +    if (!irqs_disabled())
> +        printk("IRQS DISABLED!! before %s:%d\n", __func__, __LINE__);
>    /* Wait for all current users to finish */
> -    if (!early_boot_irqs_disabled)
> -        synchronize_rcu();
> +    synchronize_rcu();
> +    if (!irqs_disabled())
> +        printk("IRQS DISABLED!! after %s:%d\n", __func__, __LINE__);
> 
>    for_each_tracing_cpu(cpu) {
>        free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> 
> Which produced this:
> 
> 
> [    0.972867] ftrace: allocating 47021 entries in 184 pages
> [    0.978285] ftrace section at ffffffffacef74c0 sorted properly
> [    0.991153] ftrace: allocated 184 pages with 4 groups
> [    0.996163] Dynamic Preempt: full
> [    0.999354] rcu: Preemptible hierarchical RCU implementation.
> [    1.005020] rcu:     RCU restricting CPUs from NR_CPUS=128 to nr_cpu_ids=8.
> [    1.011680]  Tasks-RCU CPU stall warnings timeout set to 120000 (rcu_task_stall_timeout).
> [    1.019813]  Trampoline variant of Tasks RCU enabled.
> [    1.024831]  Rude variant of Tasks RCU enabled.
> [    1.029330]  Tracing variant of Tasks RCU enabled.
> [    1.034089] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
> [    1.041699] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
> [    1.052607] IRQS DISABLED!! after trace_buffered_event_disable:2756
> [    1.058712] NR_IRQS: 8448, nr_irqs: 488, preallocated irqs: 16
> [    1.064670] NO_HZ: Full dynticks CPUs: 3.
> [    1.068501] rcu:     Offload RCU callbacks from CPUs: 3.
> [    1.073519] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [    1.080400] ------------[ cut here ]------------
> [    1.084938] Interrupts were enabled early
> [    1.088919] WARNING: CPU: 0 PID: 0 at init/main.c:1061 start_kernel+0x36f/0x4e8
> [    1.096184] Modules linked in:
> [    1.099213] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.0-rc1-test+ #468
> [    1.106047] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
> [    1.114960] RIP: 0010:start_kernel+0x36f/0x4e8
> [    1.119373] Code: 48 27 53 e8 fd 7b 03 00 e8 55 72 e5 fe e8 6b 05 03 00 ff 15 2a f9 89 ff 0f ba e0 09 73 0e 48 c7 c7 98 25 31 ac e8 01 5b df fe <0f> 0b c6 05 c7 2c af ff 00 e8 ba 36 47 fe ff 15 14 f9 89 ff e8 56
> [    1.138057] RSP: 0000:ffffffffac603f30 EFLAGS: 00010286
> [    1.143248] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [    1.150343] RDX: 0000000000000002 RSI: 00000000ffffdfff RDI: 0000000000000001
> [    1.157438] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffffac603dd8
> [    1.164537] R10: 0000000000000003 R11: ffffffffac734168 R12: 000000000000180a
> [    1.171633] R13: 00000000010fecf0 R14: 0000000000000000 R15: 00000000d44e3018
> [    1.178730] FS:  0000000000000000(0000) GS:ffff93ee1aa00000(0000) knlGS:0000000000000000
> [    1.186778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.192489] CR2: ffff93ee1edff000 CR3: 000000002e60a001 CR4: 00000000000606f0
> [    1.199584] Call Trace:
> [    1.202009]  <TASK>
> 
> 
> So I'm not sure how just updating srcu will do anything to prevent this.
> 
> -- Steve
> 
>> 
>>                            Thanx, Paul
>> 
>> ------------------------------------------------------------------------
>> 
>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>> index ca4b5dcec675b..356ef70d5442c 100644
>> --- a/kernel/rcu/srcutree.c
>> +++ b/kernel/rcu/srcutree.c
>> @@ -1267,14 +1267,15 @@ static void __synchronize_srcu(struct srcu_struct *ssp, bool do_norm)
>> {
>>    struct rcu_synchronize rcu;
>> 
>> +    if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
>> +        return;
>> +
>>    RCU_LOCKDEP_WARN(lockdep_is_held(ssp) ||
>>             lock_is_held(&rcu_bh_lock_map) ||
>>             lock_is_held(&rcu_lock_map) ||
>>             lock_is_held(&rcu_sched_lock_map),
>>             "Illegal synchronize_srcu() in same-type SRCU (or in RCU) read-side critical section");
>> 
>> -    if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
>> -        return;
>>    might_sleep();
>>    check_init_srcu_struct(ssp);
>>    init_completion(&rcu.completion);
>

Steven Rostedt Oct. 20, 2022, 5:05 p.m. UTC | #11

On Thu, 20 Oct 2022 12:33:57 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index c03fd7037add..79ac31a6a87b 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2748,9 +2748,12 @@ void trace_buffered_event_disable(void)
>  			       disable_trace_buffered_event, NULL, 1);
>  	preempt_enable();
>  
> +	if (!irqs_disabled())
> +		printk("IRQS DISABLED!! before %s:%d\n", __func__, __LINE__);
>  	/* Wait for all current users to finish */
> -	if (!early_boot_irqs_disabled)
> -		synchronize_rcu();
> +	synchronize_rcu();
> +	if (!irqs_disabled())
> +		printk("IRQS DISABLED!! after %s:%d\n", __func__, __LINE__);
>  
>  	for_each_tracing_cpu(cpu) {
>  		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));


Continuing the above printk()s I found the culprit.

synchronize_rcu() {
    rcu_poll_gp_seq_start_unlocked() {
	struct rcu_node *rnp = rcu_get_root();

	if (rcu_init_invoked()) {
		lockdep_assert_irqs_enabled();
		raw_spin_lock_irq_rcu_node(rnp);
	}
	rcu_poll_gp_seq_start(snap);
	if (rcu_init_invoked())
		raw_spin_unlock_irq_rcu_node(rnp);  <<-- Interrupts enabled here
   }



-- Steve

Paul E. McKenney Oct. 20, 2022, 5:54 p.m. UTC | #12

On Thu, Oct 20, 2022 at 12:33:57PM -0400, Steven Rostedt wrote:
> On Thu, 20 Oct 2022 08:59:00 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > On Wed, Oct 19, 2022 at 08:07:45PM -0400, Steven Rostedt wrote:
> > > On Wed, 19 Oct 2022 20:01:37 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > >   
> > > > @@ -90,6 +90,10 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
> > > >  #ifdef CONFIG_TRACEPOINTS
> > > >  static inline void tracepoint_synchronize_unregister(void)
> > > >  {
> > > > +	/* Early updates do not need synchronization */
> > > > +	if (early_boot_irqs_disabled)
> > > > +		return;
> > > > +
> > > >  	synchronize_srcu(&tracepoint_srcu);
> > > >  	synchronize_rcu();  
> > > 
> > > I wonder if this check should be just moved to the RCU synchronization
> > > code? That is, if early_boot_irqs_disabled is set, do nothing, as there
> > > should be nothing to synchronize against.  
> > 
> > There already is a similar check, but it follows the lockdep check.
> > 
> > Does the following patch help?
> 
> Not sure if it would.
> 
> I added this:
> 
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index c03fd7037add..79ac31a6a87b 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2748,9 +2748,12 @@ void trace_buffered_event_disable(void)
>  			       disable_trace_buffered_event, NULL, 1);
>  	preempt_enable();
>  
> +	if (!irqs_disabled())
> +		printk("IRQS DISABLED!! before %s:%d\n", __func__, __LINE__);
>  	/* Wait for all current users to finish */
> -	if (!early_boot_irqs_disabled)
> -		synchronize_rcu();
> +	synchronize_rcu();
> +	if (!irqs_disabled())
> +		printk("IRQS DISABLED!! after %s:%d\n", __func__, __LINE__);
>  
>  	for_each_tracing_cpu(cpu) {
>  		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> 
> Which produced this:
> 
> 
> [    0.972867] ftrace: allocating 47021 entries in 184 pages
> [    0.978285] ftrace section at ffffffffacef74c0 sorted properly
> [    0.991153] ftrace: allocated 184 pages with 4 groups
> [    0.996163] Dynamic Preempt: full
> [    0.999354] rcu: Preemptible hierarchical RCU implementation.
> [    1.005020] rcu:     RCU restricting CPUs from NR_CPUS=128 to nr_cpu_ids=8.
> [    1.011680]  Tasks-RCU CPU stall warnings timeout set to 120000 (rcu_task_stall_timeout).
> [    1.019813]  Trampoline variant of Tasks RCU enabled.
> [    1.024831]  Rude variant of Tasks RCU enabled.
> [    1.029330]  Tracing variant of Tasks RCU enabled.
> [    1.034089] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
> [    1.041699] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
> [    1.052607] IRQS DISABLED!! after trace_buffered_event_disable:2756
> [    1.058712] NR_IRQS: 8448, nr_irqs: 488, preallocated irqs: 16
> [    1.064670] NO_HZ: Full dynticks CPUs: 3.
> [    1.068501] rcu:     Offload RCU callbacks from CPUs: 3.
> [    1.073519] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [    1.080400] ------------[ cut here ]------------
> [    1.084938] Interrupts were enabled early
> [    1.088919] WARNING: CPU: 0 PID: 0 at init/main.c:1061 start_kernel+0x36f/0x4e8
> [    1.096184] Modules linked in:
> [    1.099213] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.0-rc1-test+ #468
> [    1.106047] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
> [    1.114960] RIP: 0010:start_kernel+0x36f/0x4e8
> [    1.119373] Code: 48 27 53 e8 fd 7b 03 00 e8 55 72 e5 fe e8 6b 05 03 00 ff 15 2a f9 89 ff 0f ba e0 09 73 0e 48 c7 c7 98 25 31 ac e8 01 5b df fe <0f> 0b c6 05 c7 2c af ff 00 e8 ba 36 47 fe ff 15 14 f9 89 ff e8 56
> [    1.138057] RSP: 0000:ffffffffac603f30 EFLAGS: 00010286
> [    1.143248] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [    1.150343] RDX: 0000000000000002 RSI: 00000000ffffdfff RDI: 0000000000000001
> [    1.157438] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffffac603dd8
> [    1.164537] R10: 0000000000000003 R11: ffffffffac734168 R12: 000000000000180a
> [    1.171633] R13: 00000000010fecf0 R14: 0000000000000000 R15: 00000000d44e3018
> [    1.178730] FS:  0000000000000000(0000) GS:ffff93ee1aa00000(0000) knlGS:0000000000000000
> [    1.186778] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.192489] CR2: ffff93ee1edff000 CR3: 000000002e60a001 CR4: 00000000000606f0
> [    1.199584] Call Trace:
> [    1.202009]  <TASK>
> 
> 
> So I'm not sure how just updating srcu will do anything to prevent this.

Apologies, SRCU on the brain due to a different email thread...  :-/

							Thanx, Paul

> -- Steve
> 
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index ca4b5dcec675b..356ef70d5442c 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1267,14 +1267,15 @@ static void __synchronize_srcu(struct srcu_struct *ssp, bool do_norm)
> >  {
> >  	struct rcu_synchronize rcu;
> >  
> > +	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> > +		return;
> > +
> >  	RCU_LOCKDEP_WARN(lockdep_is_held(ssp) ||
> >  			 lock_is_held(&rcu_bh_lock_map) ||
> >  			 lock_is_held(&rcu_lock_map) ||
> >  			 lock_is_held(&rcu_sched_lock_map),
> >  			 "Illegal synchronize_srcu() in same-type SRCU (or in RCU) read-side critical section");
> >  
> > -	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> > -		return;
> >  	might_sleep();
> >  	check_init_srcu_struct(ssp);
> >  	init_completion(&rcu.completion);
>

Paul E. McKenney Oct. 20, 2022, 5:55 p.m. UTC | #13

On Thu, Oct 20, 2022 at 01:05:28PM -0400, Steven Rostedt wrote:
> On Thu, 20 Oct 2022 12:33:57 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index c03fd7037add..79ac31a6a87b 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -2748,9 +2748,12 @@ void trace_buffered_event_disable(void)
> >  			       disable_trace_buffered_event, NULL, 1);
> >  	preempt_enable();
> >  
> > +	if (!irqs_disabled())
> > +		printk("IRQS DISABLED!! before %s:%d\n", __func__, __LINE__);
> >  	/* Wait for all current users to finish */
> > -	if (!early_boot_irqs_disabled)
> > -		synchronize_rcu();
> > +	synchronize_rcu();
> > +	if (!irqs_disabled())
> > +		printk("IRQS DISABLED!! after %s:%d\n", __func__, __LINE__);
> >  
> >  	for_each_tracing_cpu(cpu) {
> >  		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> 
> 
> Continuing the above printk()s I found the culprit.
> 
> synchronize_rcu() {
>     rcu_poll_gp_seq_start_unlocked() {
> 	struct rcu_node *rnp = rcu_get_root();
> 
> 	if (rcu_init_invoked()) {
> 		lockdep_assert_irqs_enabled();
> 		raw_spin_lock_irq_rcu_node(rnp);
> 	}
> 	rcu_poll_gp_seq_start(snap);
> 	if (rcu_init_invoked())
> 		raw_spin_unlock_irq_rcu_node(rnp);  <<-- Interrupts enabled here
>    }

Thank you for digging into this!

Does the following help?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index abc615808b6e8..45ad9ef00d647 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1425,30 +1425,32 @@ static void rcu_poll_gp_seq_end(unsigned long *snap)
 // where caller does not hold the root rcu_node structure's lock.
 static void rcu_poll_gp_seq_start_unlocked(unsigned long *snap)
 {
+	unsigned long flags;
 	struct rcu_node *rnp = rcu_get_root();
 
 	if (rcu_init_invoked()) {
 		lockdep_assert_irqs_enabled();
-		raw_spin_lock_irq_rcu_node(rnp);
+		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	}
 	rcu_poll_gp_seq_start(snap);
 	if (rcu_init_invoked())
-		raw_spin_unlock_irq_rcu_node(rnp);
+		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
 // Make the polled API aware of the end of a grace period, but where
 // caller does not hold the root rcu_node structure's lock.
 static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 {
+	unsigned long flags;
 	struct rcu_node *rnp = rcu_get_root();
 
 	if (rcu_init_invoked()) {
 		lockdep_assert_irqs_enabled();
-		raw_spin_lock_irq_rcu_node(rnp);
+		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	}
 	rcu_poll_gp_seq_end(snap);
 	if (rcu_init_invoked())
-		raw_spin_unlock_irq_rcu_node(rnp);
+		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
 /*

Steven Rostedt Oct. 20, 2022, 6:53 p.m. UTC | #14

On Thu, 20 Oct 2022 10:55:12 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > synchronize_rcu() {
> >     rcu_poll_gp_seq_start_unlocked() {
> > 	struct rcu_node *rnp = rcu_get_root();
> > 
> > 	if (rcu_init_invoked()) {
> > 		lockdep_assert_irqs_enabled();
> > 		raw_spin_lock_irq_rcu_node(rnp);
> > 	}
> > 	rcu_poll_gp_seq_start(snap);
> > 	if (rcu_init_invoked())
> > 		raw_spin_unlock_irq_rcu_node(rnp);  <<-- Interrupts enabled here
> >    }  
> 
> Thank you for digging into this!
> 
> Does the following help?

It does indeed!

I can remove the early_boot_irqs_disabled checks if you can add this to
this rc cycle and mark it stable.

-- Steve

Paul E. McKenney Oct. 20, 2022, 10:21 p.m. UTC | #15

On Thu, Oct 20, 2022 at 02:53:40PM -0400, Steven Rostedt wrote:
> On Thu, 20 Oct 2022 10:55:12 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > synchronize_rcu() {
> > >     rcu_poll_gp_seq_start_unlocked() {
> > > 	struct rcu_node *rnp = rcu_get_root();
> > > 
> > > 	if (rcu_init_invoked()) {
> > > 		lockdep_assert_irqs_enabled();
> > > 		raw_spin_lock_irq_rcu_node(rnp);
> > > 	}
> > > 	rcu_poll_gp_seq_start(snap);
> > > 	if (rcu_init_invoked())
> > > 		raw_spin_unlock_irq_rcu_node(rnp);  <<-- Interrupts enabled here
> > >    }  
> > 
> > Thank you for digging into this!
> > 
> > Does the following help?
> 
> It does indeed!
> 
> I can remove the early_boot_irqs_disabled checks if you can add this to
> this rc cycle and mark it stable.

My plan is to push it in after a bit of testing.  The Fixes tag should
make the right -stable stuff happen.

							Thanx, Paul

------------------------------------------------------------------------

commit 952643575669332565c77066fb42d67987680f12
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu Oct 20 10:58:14 2022 -0700

    rcu: Keep synchronize_rcu() from enabling irqs in early boot
    
    Making polled RCU grace periods account for expedited grace periods
    required acquiring the leaf rcu_node structure's lock during early boot,
    but after rcu_init() was called.  This lock is irq-disabled, but the
    code incorrectly assumes that irqs are always disabled when invoking
    synchronize_rcu().  The exception is early boot before the scheduler has
    started, which means that upon return from synchronize_rcu(), irqs will
    be incorrectly enabled.
    
    This commit fixes this bug by using irqsave/irqrestore locking primitives.
    
    Fixes: bf95b2bc3e42 ("rcu: Switch polled grace-period APIs to ->gp_seq_polled")
    
    Reported-by: Steven Rostedt <rostedt@goodmis.org>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index abc615808b6e8..45ad9ef00d647 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1425,30 +1425,32 @@ static void rcu_poll_gp_seq_end(unsigned long *snap)
 // where caller does not hold the root rcu_node structure's lock.
 static void rcu_poll_gp_seq_start_unlocked(unsigned long *snap)
 {
+	unsigned long flags;
 	struct rcu_node *rnp = rcu_get_root();
 
 	if (rcu_init_invoked()) {
 		lockdep_assert_irqs_enabled();
-		raw_spin_lock_irq_rcu_node(rnp);
+		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	}
 	rcu_poll_gp_seq_start(snap);
 	if (rcu_init_invoked())
-		raw_spin_unlock_irq_rcu_node(rnp);
+		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
 // Make the polled API aware of the end of a grace period, but where
 // caller does not hold the root rcu_node structure's lock.
 static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 {
+	unsigned long flags;
 	struct rcu_node *rnp = rcu_get_root();
 
 	if (rcu_init_invoked()) {
 		lockdep_assert_irqs_enabled();
-		raw_spin_lock_irq_rcu_node(rnp);
+		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	}
 	rcu_poll_gp_seq_end(snap);
 	if (rcu_init_invoked())
-		raw_spin_unlock_irq_rcu_node(rnp);
+		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
 /*

Steven Rostedt Oct. 20, 2022, 10:23 p.m. UTC | #16

On Thu, 20 Oct 2022 15:21:08 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> My plan is to push it in after a bit of testing.  The Fixes tag should
> make the right -stable stuff happen.

Great, thanks!

I'll resend my patch without the synchronize_rcu() avoidance.

-- Steve

tracing: Add trace_trigger kernel command line option

Commit Message

Comments

Patch