Message ID | 20250402180925.90914-1-andrii@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] exit: move and extend sched_process_exit() tracepoint | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Wed, 2 Apr 2025 11:09:25 -0700 Andrii Nakryiko <andrii@kernel.org> wrote: > It is useful to be able to access current->mm at task exit to, say, > record a bunch of VMA information right before the task exits (e.g., for > stack symbolization reasons when dealing with short-lived processes that > exit in the middle of profiling session). Currently, > trace_sched_process_exit() is triggered after exit_mm() which resets > current->mm to NULL making this tracepoint unsuitable for inspecting > and recording task's mm_struct-related data when tracing process > lifetimes. > > There is a particularly suitable place, though, right after > taskstats_exit() is called, but before we do exit_mm() and other > exit_*() resource teardowns. taskstats performs a similar kind of > accounting that some applications do with BPF, and so co-locating them > seems like a good fit. So that's where trace_sched_process_exit() is > moved with this patch. > > Also, existing trace_sched_process_exit() tracepoint is notoriously > missing `group_dead` flag that is certainly useful in practice and some > of our production applications have to work around this. So plumb > `group_dead` through while at it, to have a richer and more complete > tracepoint. > > Note that we can't use sched_process_template anymore, and so we use > TRACE_EVENT()-based tracepoint definition. But all the field names and > order, as well as assign and output logic remain intact. We just add one > extra field at the end in backwards-compatible way. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > include/trace/events/sched.h | 28 +++++++++++++++++++++++++--- > kernel/exit.c | 2 +- > 2 files changed, 26 insertions(+), 4 deletions(-) Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> -- Steve
On 04/02, Andrii Nakryiko wrote: > > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -937,12 +937,12 @@ void __noreturn do_exit(long code) > > tsk->exit_code = code; > taskstats_exit(tsk, group_dead); > + trace_sched_process_exit(tsk, group_dead); > > exit_mm(); > > if (group_dead) > acct_process(); > - trace_sched_process_exit(tsk); I see nothing wrong in this change. (and I think that the current placement of trace_sched_process_exit() was more or less "random"). Acked-by: Oleg Nesterov <oleg@redhat.com>
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 8994e97d86c1..05a14f2b35c3 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -328,9 +328,31 @@ DEFINE_EVENT(sched_process_template, sched_process_free, /* * Tracepoint for a task exiting: */ -DEFINE_EVENT(sched_process_template, sched_process_exit, - TP_PROTO(struct task_struct *p), - TP_ARGS(p)); +TRACE_EVENT(sched_process_exit, + + TP_PROTO(struct task_struct *p, bool group_dead), + + TP_ARGS(p, group_dead), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( int, prio ) + __field( bool, group_dead ) + ), + + TP_fast_assign( + memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __entry->pid = p->pid; + __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ + __entry->group_dead = group_dead; + ), + + TP_printk("comm=%s pid=%d prio=%d group_dead=%s", + __entry->comm, __entry->pid, __entry->prio, + __entry->group_dead ? "true" : "false" + ) +); /* * Tracepoint for waiting on task to unschedule: diff --git a/kernel/exit.c b/kernel/exit.c index c2e6c7b7779f..4abd307b1586 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -937,12 +937,12 @@ void __noreturn do_exit(long code) tsk->exit_code = code; taskstats_exit(tsk, group_dead); + trace_sched_process_exit(tsk, group_dead); exit_mm(); if (group_dead) acct_process(); - trace_sched_process_exit(tsk); exit_sem(tsk); exit_shm(tsk);
It is useful to be able to access current->mm at task exit to, say, record a bunch of VMA information right before the task exits (e.g., for stack symbolization reasons when dealing with short-lived processes that exit in the middle of profiling session). Currently, trace_sched_process_exit() is triggered after exit_mm() which resets current->mm to NULL making this tracepoint unsuitable for inspecting and recording task's mm_struct-related data when tracing process lifetimes. There is a particularly suitable place, though, right after taskstats_exit() is called, but before we do exit_mm() and other exit_*() resource teardowns. taskstats performs a similar kind of accounting that some applications do with BPF, and so co-locating them seems like a good fit. So that's where trace_sched_process_exit() is moved with this patch. Also, existing trace_sched_process_exit() tracepoint is notoriously missing `group_dead` flag that is certainly useful in practice and some of our production applications have to work around this. So plumb `group_dead` through while at it, to have a richer and more complete tracepoint. Note that we can't use sched_process_template anymore, and so we use TRACE_EVENT()-based tracepoint definition. But all the field names and order, as well as assign and output logic remain intact. We just add one extra field at the end in backwards-compatible way. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- include/trace/events/sched.h | 28 +++++++++++++++++++++++++--- kernel/exit.c | 2 +- 2 files changed, 26 insertions(+), 4 deletions(-)