Message ID | 20250401184021.2591443-1-andrii@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | exit: add trace_task_exit() tracepoint before current->mm is reset | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Tue, 1 Apr 2025 11:40:21 -0700 Andrii Nakryiko <andrii@kernel.org> wrote: Hi Andrii, > It is useful to be able to access current->mm to, say, record a bunch of > VMA information right before the task exits (e.g., for stack > symbolization reasons when dealing with short-lived processes that exit > in the middle of profiling session). We currently do have > trace_sched_process_exit() in the exit path, but it is called a bit too > late, after exit_mm() resets current->mm to NULL, which makes it > unsuitable for inspecting and recording task's mm_struct-related data > when tracing process lifetimes. My fear of adding another task exit trace event is that it will get a bit confusing as that we now have trace_sched_process_exit() and also trace_task_exit() with slightly different semantics. How about adding a trace_exit_mm()? Add that to the exit_mm() code? static void exit_mm(void) { struct mm_struct *mm = current->mm; exit_mm_release(current, mm); trace_exit_mm(mm); ?? > > There is a particularly suitable place, though, right after > taskstats_exit() is called, but before we do exit_mm(). taskstats > performs a similar kind of accounting that some applications do with > BPF, and so co-locating them seems like a good fit. > > Moving trace_sched_process_exit() a bit earlier would solve this problem > as well, and I'm open to that. But this might potentially change its > semantics a little, and so instead of risking that, I went for adding > a new trace_task_exit() tracepoint instead. Tracepoints have zero > overhead at runtime, unless actively traced, so this seems acceptable. > > Also, existing trace_sched_process_exit() tracepoint is notoriously > missing `group_dead` flag that is certainly useful in practice and some > of our production applications have to work around this. So plumb > `group_dead` through while at it, to have a richer and more complete > tracepoint. There should be no problem adding group_dead to the trace_sched_process_exit() trace event. Adding fields should never cause any user API breakage. -- Steve
On Tue, 1 Apr 2025 17:32:49 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > static void exit_mm(void) > { > struct mm_struct *mm = current->mm; > > exit_mm_release(current, mm); > trace_exit_mm(mm); > > ?? That should have been: static void exit_mm(void) { struct mm_struct *mm = current->mm; trace_exit_mm(mm); exit_mm_release(current, mm); -- Steve
On Tue, Apr 1, 2025 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Tue, 1 Apr 2025 11:40:21 -0700 > Andrii Nakryiko <andrii@kernel.org> wrote: > > Hi Andrii, > > > It is useful to be able to access current->mm to, say, record a bunch of > > VMA information right before the task exits (e.g., for stack > > symbolization reasons when dealing with short-lived processes that exit > > in the middle of profiling session). We currently do have > > trace_sched_process_exit() in the exit path, but it is called a bit too > > late, after exit_mm() resets current->mm to NULL, which makes it > > unsuitable for inspecting and recording task's mm_struct-related data > > when tracing process lifetimes. > > My fear of adding another task exit trace event is that it will get a > bit confusing as that we now have trace_sched_process_exit() and also > trace_task_exit() with slightly different semantics. > > How about adding a trace_exit_mm()? Add that to the exit_mm() code? This is kind of the worst of both worlds, no? We still have a new tracepoint, but this one can't tell if it's a `group_dead` situation or not... I can pass group_dead into exit_mm(), but it will be just for the sake of that new tracepoint. How bad would it be to just move trace_sched_process_exit() then? (and add group_dead there, as you mentioned)? > > static void exit_mm(void) > { > struct mm_struct *mm = current->mm; > > exit_mm_release(current, mm); > trace_exit_mm(mm); > > ?? > > > > > There is a particularly suitable place, though, right after > > taskstats_exit() is called, but before we do exit_mm(). taskstats > > performs a similar kind of accounting that some applications do with > > BPF, and so co-locating them seems like a good fit. > > > > Moving trace_sched_process_exit() a bit earlier would solve this problem > > as well, and I'm open to that. But this might potentially change its > > semantics a little, and so instead of risking that, I went for adding > > a new trace_task_exit() tracepoint instead. Tracepoints have zero > > overhead at runtime, unless actively traced, so this seems acceptable. > > > > Also, existing trace_sched_process_exit() tracepoint is notoriously > > missing `group_dead` flag that is certainly useful in practice and some > > of our production applications have to work around this. So plumb > > `group_dead` through while at it, to have a richer and more complete > > tracepoint. > > There should be no problem adding group_dead to the > trace_sched_process_exit() trace event. Adding fields should never cause > any user API breakage. > > -- Steve >
On Tue, 1 Apr 2025 15:04:11 -0700 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > How bad would it be to just move trace_sched_process_exit() then? (and > add group_dead there, as you mentioned)? I personally don't have an issue with that. In fact, the one place I used the sched_process_exit tracepoint, I had to change to use sched_process_free because it does too much after that. OK, let's just move the sched_process_exit tracepoint. It's in an arbitrary location anyway. -- Steve
On Tue, Apr 1, 2025 at 3:12 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Tue, 1 Apr 2025 15:04:11 -0700 > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > How bad would it be to just move trace_sched_process_exit() then? (and > > add group_dead there, as you mentioned)? > > I personally don't have an issue with that. In fact, the one place I used > the sched_process_exit tracepoint, I had to change to use > sched_process_free because it does too much after that. heh, I ran into that as well just recently and also had to use sched_process_free instead of sched_process_exit, because between exit and free we still can get sched_switch tracepoint trigger (so it's a bit too early to clean up whatever per-task state I maintain in BPF program). So yeah, I'm up for that as well, will send v2 just moving and extending the existing tracepoint. Thanks! > > OK, let's just move the sched_process_exit tracepoint. It's in an arbitrary > location anyway. > > -- Steve
On Tue 01-04-25 17:34:16, Steven Rostedt wrote: > On Tue, 1 Apr 2025 17:32:49 -0400 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > static void exit_mm(void) > > { > > struct mm_struct *mm = current->mm; > > > > exit_mm_release(current, mm); > > trace_exit_mm(mm); > > > > ?? > > That should have been: > > static void exit_mm(void) > { > struct mm_struct *mm = current->mm; > > trace_exit_mm(mm); > exit_mm_release(current, mm); If the primary usecase is to get an overview of the mm before exiting then this is more appropriate.
On Tue 01-04-25 15:04:11, Andrii Nakryiko wrote: > On Tue, Apr 1, 2025 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Tue, 1 Apr 2025 11:40:21 -0700 > > Andrii Nakryiko <andrii@kernel.org> wrote: > > > > Hi Andrii, > > > > > It is useful to be able to access current->mm to, say, record a bunch of > > > VMA information right before the task exits (e.g., for stack > > > symbolization reasons when dealing with short-lived processes that exit > > > in the middle of profiling session). We currently do have > > > trace_sched_process_exit() in the exit path, but it is called a bit too > > > late, after exit_mm() resets current->mm to NULL, which makes it > > > unsuitable for inspecting and recording task's mm_struct-related data > > > when tracing process lifetimes. > > > > My fear of adding another task exit trace event is that it will get a > > bit confusing as that we now have trace_sched_process_exit() and also > > trace_task_exit() with slightly different semantics. > > > > How about adding a trace_exit_mm()? Add that to the exit_mm() code? > > This is kind of the worst of both worlds, no? We still have a new > tracepoint, but this one can't tell if it's a `group_dead` situation > or not... I can pass group_dead into exit_mm(), but it will be just > for the sake of that new tracepoint. Is it important to tell the difference between thread and the whole process group exiting? Please keep in mind that even group exit doesn't really imply the mm is going away (clone allows CLONE_VM without CLONE_SIGNAL - i.e. mm could be shared outside of thread group).
On Tue, Apr 01, 2025 at 11:40:21AM -0700, Andrii Nakryiko wrote: > It is useful to be able to access current->mm to, say, record a bunch of > VMA information right before the task exits (e.g., for stack > symbolization reasons when dealing with short-lived processes that exit > in the middle of profiling session). We currently do have > trace_sched_process_exit() in the exit path, but it is called a bit too > late, after exit_mm() resets current->mm to NULL, which makes it > unsuitable for inspecting and recording task's mm_struct-related data > when tracing process lifetimes. > > There is a particularly suitable place, though, right after > taskstats_exit() is called, but before we do exit_mm(). taskstats > performs a similar kind of accounting that some applications do with > BPF, and so co-locating them seems like a good fit. > > Moving trace_sched_process_exit() a bit earlier would solve this problem > as well, and I'm open to that. I don't see a problem with moving it.
On Tue, Apr 01, 2025 at 03:17:01PM -0700, Andrii Nakryiko wrote: > On Tue, Apr 1, 2025 at 3:12 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Tue, 1 Apr 2025 15:04:11 -0700 > > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > > > How bad would it be to just move trace_sched_process_exit() then? (and > > > add group_dead there, as you mentioned)? > > > > I personally don't have an issue with that. In fact, the one place I used > > the sched_process_exit tracepoint, I had to change to use > > sched_process_free because it does too much after that. > > heh, I ran into that as well just recently and also had to use > sched_process_free instead of sched_process_exit, because between exit > and free we still can get sched_switch tracepoint trigger (so it's a > bit too early to clean up whatever per-task state I maintain in BPF > program). > > So yeah, I'm up for that as well, will send v2 just moving and > extending the existing tracepoint. Thanks! +1, it'd be great to have the group_dead info, we also need to have some workarounds for that thanks, jirka > > > > > OK, let's just move the sched_process_exit tracepoint. It's in an arbitrary > > location anyway. > > > > -- Steve >
On Wed, 2 Apr 2025 09:20:37 +0200 Michal Hocko <mhocko@suse.com> wrote: > Is it important to tell the difference between thread and the > whole process group exiting? > > Please keep in mind that even group exit doesn't really imply the mm is > going away (clone allows CLONE_VM without CLONE_SIGNAL - i.e. mm could > be shared outside of thread group). The main reason I'm OK with just updating the sched_process_exit() tracepoint is because it is in an arbitrary location. The process is exiting, but because the tracepoint is basically in the middle of the routine, it doesn't really give us any information about the actual exit. This tracepoint does give us if a task is exiting from an mm. You are correct, it doesn't tell us if the mm is going away. If that is the purpose, then here should be a tracepoint in the exit_mm() code or perhaps even in the mput() function. -- Steve
On Wed, Apr 2, 2025 at 12:20 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 01-04-25 15:04:11, Andrii Nakryiko wrote: > > On Tue, Apr 1, 2025 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > On Tue, 1 Apr 2025 11:40:21 -0700 > > > Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > Hi Andrii, > > > > > > > It is useful to be able to access current->mm to, say, record a bunch of > > > > VMA information right before the task exits (e.g., for stack > > > > symbolization reasons when dealing with short-lived processes that exit > > > > in the middle of profiling session). We currently do have > > > > trace_sched_process_exit() in the exit path, but it is called a bit too > > > > late, after exit_mm() resets current->mm to NULL, which makes it > > > > unsuitable for inspecting and recording task's mm_struct-related data > > > > when tracing process lifetimes. > > > > > > My fear of adding another task exit trace event is that it will get a > > > bit confusing as that we now have trace_sched_process_exit() and also > > > trace_task_exit() with slightly different semantics. > > > > > > How about adding a trace_exit_mm()? Add that to the exit_mm() code? > > > > This is kind of the worst of both worlds, no? We still have a new > > tracepoint, but this one can't tell if it's a `group_dead` situation > > or not... I can pass group_dead into exit_mm(), but it will be just > > for the sake of that new tracepoint. > > Is it important to tell the difference between thread and the > whole process group exiting? Yes, it often is important. In the sense that process group exiting would trigger extra information gathering/aggregation/sending compared to just process (thread) existing. Both are important to track, but it's useful to be able to differentiate. > > Please keep in mind that even group exit doesn't really imply the mm is > going away (clone allows CLONE_VM without CLONE_SIGNAL - i.e. mm could > be shared outside of thread group). Yep, I realize that, and as Steven said, for cases like that a dedicated mm-specific tracepoint might be useful. But I'm not really aware of use cases caring about mm_struct itself, in isolation. It's all usually in the context of thread/process exit, and mm-related information is one of a bunch of extra information that's useful at that point. It's just so that with current sched_process_exit() tracepoint placement we (e.g., BPF program) gets control a bit too late. But it seems like everyone is OK just shifting sched_process_exit() to before exit_mm(), which should cover the use cases I had in mind. We can always add mm-specific tracepoint when the need arises. Thanks everyone for discussion and suggestions! > -- > Michal Hocko > SUSE Labs
diff --git a/include/trace/events/task.h b/include/trace/events/task.h index af535b053033..98f4ec060073 100644 --- a/include/trace/events/task.h +++ b/include/trace/events/task.h @@ -53,6 +53,30 @@ TRACE_EVENT(task_rename, __entry->oldcomm, __entry->newcomm, __entry->oom_score_adj) ); +TRACE_EVENT(task_exit, + + TP_PROTO(struct task_struct *task, bool group_dead), + + TP_ARGS(task, group_dead), + + TP_STRUCT__entry( + __field( pid_t, pid) + __array( char, comm, TASK_COMM_LEN) + __field( bool, group_dead) + ), + + TP_fast_assign( + __entry->pid = task->pid; + memcpy(__entry->comm, task->comm, TASK_COMM_LEN); + __entry->group_dead = group_dead; + ), + + TP_printk("pid=%d comm=%s group_dead=%s", + __entry->pid, __entry->comm, + __entry->group_dead ? "true" : "false" + ) +); + /** * task_prctl_unknown - called on unknown prctl() option * @option: option passed diff --git a/kernel/exit.c b/kernel/exit.c index c2e6c7b7779f..8496fc07f9c8 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -54,6 +54,7 @@ #include <linux/init_task.h> #include <linux/perf_event.h> #include <trace/events/sched.h> +#include <trace/events/task.h> #include <linux/hw_breakpoint.h> #include <linux/oom.h> #include <linux/writeback.h> @@ -937,6 +938,7 @@ void __noreturn do_exit(long code) tsk->exit_code = code; taskstats_exit(tsk, group_dead); + trace_task_exit(tsk, group_dead); exit_mm();
It is useful to be able to access current->mm to, say, record a bunch of VMA information right before the task exits (e.g., for stack symbolization reasons when dealing with short-lived processes that exit in the middle of profiling session). We currently do have trace_sched_process_exit() in the exit path, but it is called a bit too late, after exit_mm() resets current->mm to NULL, which makes it unsuitable for inspecting and recording task's mm_struct-related data when tracing process lifetimes. There is a particularly suitable place, though, right after taskstats_exit() is called, but before we do exit_mm(). taskstats performs a similar kind of accounting that some applications do with BPF, and so co-locating them seems like a good fit. Moving trace_sched_process_exit() a bit earlier would solve this problem as well, and I'm open to that. But this might potentially change its semantics a little, and so instead of risking that, I went for adding a new trace_task_exit() tracepoint instead. Tracepoints have zero overhead at runtime, unless actively traced, so this seems acceptable. Also, existing trace_sched_process_exit() tracepoint is notoriously missing `group_dead` flag that is certainly useful in practice and some of our production applications have to work around this. So plumb `group_dead` through while at it, to have a richer and more complete tracepoint. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- include/trace/events/task.h | 24 ++++++++++++++++++++++++ kernel/exit.c | 2 ++ 2 files changed, 26 insertions(+)