diff mbox series

[v2] exit: move and extend sched_process_exit() tracepoint

Message ID 20250402180925.90914-1-andrii@kernel.org (mailing list archive)
State New
Headers show
Series [v2] exit: move and extend sched_process_exit() tracepoint | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Andrii Nakryiko April 2, 2025, 6:09 p.m. UTC
It is useful to be able to access current->mm at task exit to, say,
record a bunch of VMA information right before the task exits (e.g., for
stack symbolization reasons when dealing with short-lived processes that
exit in the middle of profiling session). Currently,
trace_sched_process_exit() is triggered after exit_mm() which resets
current->mm to NULL making this tracepoint unsuitable for inspecting
and recording task's mm_struct-related data when tracing process
lifetimes.

There is a particularly suitable place, though, right after
taskstats_exit() is called, but before we do exit_mm() and other
exit_*() resource teardowns. taskstats performs a similar kind of
accounting that some applications do with BPF, and so co-locating them
seems like a good fit. So that's where trace_sched_process_exit() is
moved with this patch.

Also, existing trace_sched_process_exit() tracepoint is notoriously
missing `group_dead` flag that is certainly useful in practice and some
of our production applications have to work around this. So plumb
`group_dead` through while at it, to have a richer and more complete
tracepoint.

Note that we can't use sched_process_template anymore, and so we use
TRACE_EVENT()-based tracepoint definition. But all the field names and
order, as well as assign and output logic remain intact. We just add one
extra field at the end in backwards-compatible way.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
 kernel/exit.c                |  2 +-
 2 files changed, 26 insertions(+), 4 deletions(-)

Comments

Steven Rostedt April 2, 2025, 6:36 p.m. UTC | #1
On Wed,  2 Apr 2025 11:09:25 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> It is useful to be able to access current->mm at task exit to, say,
> record a bunch of VMA information right before the task exits (e.g., for
> stack symbolization reasons when dealing with short-lived processes that
> exit in the middle of profiling session). Currently,
> trace_sched_process_exit() is triggered after exit_mm() which resets
> current->mm to NULL making this tracepoint unsuitable for inspecting
> and recording task's mm_struct-related data when tracing process
> lifetimes.
> 
> There is a particularly suitable place, though, right after
> taskstats_exit() is called, but before we do exit_mm() and other
> exit_*() resource teardowns. taskstats performs a similar kind of
> accounting that some applications do with BPF, and so co-locating them
> seems like a good fit. So that's where trace_sched_process_exit() is
> moved with this patch.
> 
> Also, existing trace_sched_process_exit() tracepoint is notoriously
> missing `group_dead` flag that is certainly useful in practice and some
> of our production applications have to work around this. So plumb
> `group_dead` through while at it, to have a richer and more complete
> tracepoint.
> 
> Note that we can't use sched_process_template anymore, and so we use
> TRACE_EVENT()-based tracepoint definition. But all the field names and
> order, as well as assign and output logic remain intact. We just add one
> extra field at the end in backwards-compatible way.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
>  kernel/exit.c                |  2 +-
>  2 files changed, 26 insertions(+), 4 deletions(-)

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve
Oleg Nesterov April 2, 2025, 7:29 p.m. UTC | #2
On 04/02, Andrii Nakryiko wrote:
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -937,12 +937,12 @@ void __noreturn do_exit(long code)
>  
>  	tsk->exit_code = code;
>  	taskstats_exit(tsk, group_dead);
> +	trace_sched_process_exit(tsk, group_dead);
>  
>  	exit_mm();
>  
>  	if (group_dead)
>  		acct_process();
> -	trace_sched_process_exit(tsk);

I see nothing wrong in this change.

(and I think that the current placement of trace_sched_process_exit() was
 more or less "random").

Acked-by: Oleg Nesterov <oleg@redhat.com>
diff mbox series

Patch

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c1..05a14f2b35c3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -328,9 +328,31 @@  DEFINE_EVENT(sched_process_template, sched_process_free,
 /*
  * Tracepoint for a task exiting:
  */
-DEFINE_EVENT(sched_process_template, sched_process_exit,
-	     TP_PROTO(struct task_struct *p),
-	     TP_ARGS(p));
+TRACE_EVENT(sched_process_exit,
+
+	TP_PROTO(struct task_struct *p, bool group_dead),
+
+	TP_ARGS(p, group_dead),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	int,	prio			)
+		__field(	bool,	group_dead		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
+		__entry->group_dead	= group_dead;
+	),
+
+	TP_printk("comm=%s pid=%d prio=%d group_dead=%s",
+		  __entry->comm, __entry->pid, __entry->prio,
+		  __entry->group_dead ? "true" : "false"
+	)
+);
 
 /*
  * Tracepoint for waiting on task to unschedule:
diff --git a/kernel/exit.c b/kernel/exit.c
index c2e6c7b7779f..4abd307b1586 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -937,12 +937,12 @@  void __noreturn do_exit(long code)
 
 	tsk->exit_code = code;
 	taskstats_exit(tsk, group_dead);
+	trace_sched_process_exit(tsk, group_dead);
 
 	exit_mm();
 
 	if (group_dead)
 		acct_process();
-	trace_sched_process_exit(tsk);
 
 	exit_sem(tsk);
 	exit_shm(tsk);