Message ID | 20240602023754.25443-3-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | kernel: Avoid memcpy of task comm | expand |
On Sun, 2 Jun 2024 10:37:50 +0800 Yafang Shao <laoar.shao@gmail.com> wrote: > Using __get_task_comm() to read the task comm ensures that the name is > always NUL-terminated, regardless of the source string. This approach also > facilitates future extensions to the task comm. > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Masami Hiramatsu <mhiramat@kernel.org> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > --- > include/linux/tracepoint.h | 4 ++-- > include/trace/events/block.h | 10 +++++----- > include/trace/events/oom.h | 2 +- > include/trace/events/osnoise.h | 2 +- > include/trace/events/sched.h | 27 ++++++++++++++------------- > include/trace/events/signal.h | 2 +- > include/trace/events/task.h | 4 ++-- > 7 files changed, 26 insertions(+), 25 deletions(-) > [..] > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index 68973f650c26..2a9d7c62c58a 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -9,6 +9,7 @@ > #include <linux/sched/numa_balancing.h> > #include <linux/tracepoint.h> > #include <linux/binfmts.h> > +#include <linux/sched.h> > > /* > * Tracepoint for calling kthread_stop, performed to end a kthread: > @@ -25,7 +26,7 @@ TRACE_EVENT(sched_kthread_stop, > ), > > TP_fast_assign( > - memcpy(__entry->comm, t->comm, TASK_COMM_LEN); > + __get_task_comm(__entry->comm, TASK_COMM_LEN, t); > __entry->pid = t->pid; > ), > > @@ -152,7 +153,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, > ), > > TP_fast_assign( > - memcpy(__entry->comm, p->comm, TASK_COMM_LEN); > + __get_task_comm(__entry->comm, TASK_COMM_LEN, p); > __entry->pid = p->pid; > __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ > __entry->target_cpu = task_cpu(p); > @@ -239,11 +240,11 @@ TRACE_EVENT(sched_switch, > ), > > TP_fast_assign( > - memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN); > + __get_task_comm(__entry->next_comm, TASK_COMM_LEN, next); > __entry->prev_pid = prev->pid; > __entry->prev_prio = prev->prio; > __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev); > - memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN); > + __get_task_comm(__entry->prev_comm, TASK_COMM_LEN, prev); > __entry->next_pid = next->pid; > __entry->next_prio = next->prio; > /* XXX SCHED_DEADLINE */ sched_switch is special so we could probably hold off, but the rest should be converted to the normal way strings are processed in trace events. That is, to use __string(), __assign_str() and __get_str() for task->comm. I've been wanting to do that for a while, but thought that memcpy() was a bit faster than the need for strlen(). But this now needs to test the length of comm. This method will also allow comms to be recorded that are larger than 16 bytes (if we extend comm). TRACE_EVENT(sched_migrate_task, TP_PROTO(struct task_struct *p, int dest_cpu), TP_ARGS(p, dest_cpu), TP_STRUCT__entry( - __array( char, comm, TASK_COMM_LEN ) + __string( comm, strlen(comm) ) __field( pid_t, pid ) __field( int, prio ) __field( int, orig_cpu ) __field( int, dest_cpu ) ), TP_fast_assign( - memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __assign_str(comm); __entry->pid = p->pid; __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ __entry->orig_cpu = task_cpu(p); __entry->dest_cpu = dest_cpu; ), TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d", - __entry->comm, __entry->pid, __entry->prio, + __get_str(comm), __entry->pid, __entry->prio, __entry->orig_cpu, __entry->dest_cpu) ); -- Steve > @@ -286,7 +287,7 @@ TRACE_EVENT(sched_migrate_task, > ), > > TP_fast_assign( > - memcpy(__entry->comm, p->comm, TASK_COMM_LEN); > + __get_task_comm(__entry->comm, TASK_COMM_LEN, p); > __entry->pid = p->pid; > __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ > __entry->orig_cpu = task_cpu(p);
On Mon, 3 Jun 2024 at 14:19, Steven Rostedt <rostedt@goodmis.org> wrote: > > - __array( char, comm, TASK_COMM_LEN ) > + __string( comm, strlen(comm) ) Is this actually safe is 'comm[]' is being modified at the same time? The 'strlen()' will not be consistent with the string copy. Because that is very much the case. It's not a stable source. For example, strlen() may return 5. But by the time you then actually copy the data, the string may have changed, and there would not necessarily be a NUL character at comm[5] any more. It might be further in the string, or it might be earlier. Linus
On Mon, 3 Jun 2024 14:42:10 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 3 Jun 2024 at 14:19, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > - __array( char, comm, TASK_COMM_LEN ) > > + __string( comm, strlen(comm) ) > > Is this actually safe is 'comm[]' is being modified at the same time? > The 'strlen()' will not be consistent with the string copy. First, I realized that it should actually be: __string( comm, task->comm ) But your question is still a valid question, as the internal logic will call strlen() on the second parameter. > > Because that is very much the case. It's not a stable source. > > For example, strlen() may return 5. But by the time you then actually > copy the data, the string may have changed, and there would not > necessarily be a NUL character at comm[5] any more. It might be > further in the string, or it might be earlier. The logic behind __string() and __assign_str() will always add a NUL character. __string() is defined as: static inline const char *__string_src(const char *str) { if (!str) return EVENT_NULL_STR; return str; } #undef __dynamic_array #define __dynamic_array(type, item, len) \ __item_length = (len) * sizeof(type); \ __data_offsets->item = __data_size + \ offsetof(typeof(*entry), __data); \ __data_offsets->item |= __item_length << 16; \ __data_size += __item_length; #undef __string #define __string(item, src) __dynamic_array(char, item, \ strlen(__string_src(src)) + 1) \ __data_offsets->item##_ptr_ = src; The above will use the strlen(src) to specify the amount of memory to allocate on the ring buffer: "strlen(__string_src(src)) + 1)" This is stored on a special structure for the entry and used in the __assign_str() (the reason I removed the source to that macro during this merge window). #undef __assign_str #define __assign_str(dst) \ do { \ char *__str__ = __get_str(dst); \ int __len__ = __get_dynamic_array_len(dst) - 1; \ memcpy(__str__, __data_offsets.dst##_ptr_ ? : \ EVENT_NULL_STR, __len__); \ __str__[__len__] = '\0'; \ } while (0) The source of the string is copied via memcpy() using the length stored from the __string() macro (minus 1), and then '\0' is added to it to force the NUL character to be in the memory for the string. -- Steve
On Mon, 3 Jun 2024 at 15:18, Steven Rostedt <rostedt@goodmis.org> wrote: > > The logic behind __string() and __assign_str() will always add a NUL > character. Ok. But then you still end up with the issue that now the profiles are different, and you have a 8-byte pointer to dynamically allocated memory instead of just the simpler comm[TASK_COMM_LEN]. Is that actually a good idea for tracing? We're trying to fix the core code to be cleaner for places that may actually *care* (like 'ps'). Would we really want to touch this part of tracing? Linus
On Mon, 3 Jun 2024 15:23:48 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 3 Jun 2024 at 15:18, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > The logic behind __string() and __assign_str() will always add a NUL > > character. > > Ok. But then you still end up with the issue that now the profiles are > different, and you have a 8-byte pointer to dynamically allocated > memory instead of just the simpler comm[TASK_COMM_LEN]. It's actually a 4 byte meta data that holds it. __data_offsets->item##_ptr_ = src; The __data_offsets is a local helper structure that holds the information about where the string data will be in the ring buffer event, while the event is being recorded. The actual data in the ring buffer is a 4 byte word, where 2 bytes is for the size of the string and 2 bytes is for the offset into the event. If you have a task->comm = "ps", that will take up 12 bytes in the ring buffer. field: 2 bytes: for where in the event the "ps" is. 2 bytes: for the length of ps. Then after the data, you have 3 or 4 bytes to hold "ps\0". (the data always ends on a 4 byte alignment). The amount of data in the ring buffer to hold "ps" just went from 16 bytes down to 12 bytes, and nothing is truncated if we extend the size of comm. > > Is that actually a good idea for tracing? > > We're trying to fix the core code to be cleaner for places that may > actually *care* (like 'ps'). > > Would we really want to touch this part of tracing? Note, I've been wanting to get rid of the hard coded TASK_COMM_LEN from the events for a while. As I mentioned before, the only reason the memcpy exists is because it was added before the __string() logic was. Then it became somewhat of a habit to do that for everything that referenced task->comm. :-/ -- Steve
On Mon, 3 Jun 2024 at 15:36, Steven Rostedt <rostedt@goodmis.org> wrote: > > It's actually a 4 byte meta data that holds it. Ok, much better. > Note, I've been wanting to get rid of the hard coded TASK_COMM_LEN from the > events for a while. As I mentioned before, the only reason the memcpy exists > is because it was added before the __string() logic was. Then it became > somewhat of a habit to do that for everything that referenced task->comm. :-/ Ok, as long as it's an actual improvement, then ack. Linus
On Mon, 3 Jun 2024 18:37:42 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > Note, I've been wanting to get rid of the hard coded TASK_COMM_LEN from the > events for a while. As I mentioned before, the only reason the memcpy exists > is because it was added before the __string() logic was. Then it became > somewhat of a habit to do that for everything that referenced task->comm. :-/ My point is that if we are going to be changing the way we record task->comm in the events, might as well do the change I've been wanting to do for years. I'd hold off on the sched_switch event, as that's the one event (and perhaps sched_waking events) that user space may be have hardcoded how to read it. I would be interested in changing it, just to see what breaks, so I would know where to go fix things. But keep it a separate patch so that it could be easily reverted. -- Steve
On Tue, Jun 4, 2024 at 6:39 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 3 Jun 2024 18:37:42 -0400 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > Note, I've been wanting to get rid of the hard coded TASK_COMM_LEN from the > > events for a while. As I mentioned before, the only reason the memcpy exists > > is because it was added before the __string() logic was. Then it became > > somewhat of a habit to do that for everything that referenced task->comm. :-/ > > My point is that if we are going to be changing the way we record > task->comm in the events, might as well do the change I've been wanting to > do for years. I'd hold off on the sched_switch event, as that's the one > event (and perhaps sched_waking events) that user space may be have > hardcoded how to read it. > > I would be interested in changing it, just to see what breaks, so I would > know where to go fix things. But keep it a separate patch so that it could > be easily reverted. > I will drop the tracing part that includes this patch and the patch #6 in the next version, and leave it to you.
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 689b6d71590e..6381824d8107 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -519,10 +519,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) * * * * TP_fast_assign( - * memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN); + * __get_task_comm(__entry->next_comm, TASK_COMM_LEN, next); * __entry->prev_pid = prev->pid; * __entry->prev_prio = prev->prio; - * memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN); + * __get_task_comm(__entry->prev_comm, TASK_COMM_LEN, prev); * __entry->next_pid = next->pid; * __entry->next_prio = next->prio; * ), diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 0e128ad51460..6f8c5d0014e6 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -193,7 +193,7 @@ DECLARE_EVENT_CLASS(block_rq, blk_fill_rwbs(__entry->rwbs, rq->cmd_flags); __get_str(cmd)[0] = '\0'; - memcpy(__entry->comm, current->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, current); ), TP_printk("%d,%d %s %u (%s) %llu + %u [%s]", @@ -328,7 +328,7 @@ DECLARE_EVENT_CLASS(block_bio, __entry->sector = bio->bi_iter.bi_sector; __entry->nr_sector = bio_sectors(bio); blk_fill_rwbs(__entry->rwbs, bio->bi_opf); - memcpy(__entry->comm, current->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, current); ), TP_printk("%d,%d %s %llu + %u [%s]", @@ -415,7 +415,7 @@ TRACE_EVENT(block_plug, ), TP_fast_assign( - memcpy(__entry->comm, current->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, current); ), TP_printk("[%s]", __entry->comm) @@ -434,7 +434,7 @@ DECLARE_EVENT_CLASS(block_unplug, TP_fast_assign( __entry->nr_rq = depth; - memcpy(__entry->comm, current->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, current); ), TP_printk("[%s] %d", __entry->comm, __entry->nr_rq) @@ -485,7 +485,7 @@ TRACE_EVENT(block_split, __entry->sector = bio->bi_iter.bi_sector; __entry->new_sector = new_sector; blk_fill_rwbs(__entry->rwbs, bio->bi_opf); - memcpy(__entry->comm, current->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, current); ), TP_printk("%d,%d %s %llu / %llu [%s]", diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h index b799f3bcba82..f29be9ebcd4d 100644 --- a/include/trace/events/oom.h +++ b/include/trace/events/oom.h @@ -23,7 +23,7 @@ TRACE_EVENT(oom_score_adj_update, TP_fast_assign( __entry->pid = task->pid; - memcpy(__entry->comm, task->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, task); __entry->oom_score_adj = task->signal->oom_score_adj; ), diff --git a/include/trace/events/osnoise.h b/include/trace/events/osnoise.h index 82f741ec0f57..50f480655722 100644 --- a/include/trace/events/osnoise.h +++ b/include/trace/events/osnoise.h @@ -20,7 +20,7 @@ TRACE_EVENT(thread_noise, ), TP_fast_assign( - memcpy(__entry->comm, t->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, t); __entry->pid = t->pid; __entry->start = start; __entry->duration = duration; diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 68973f650c26..2a9d7c62c58a 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -9,6 +9,7 @@ #include <linux/sched/numa_balancing.h> #include <linux/tracepoint.h> #include <linux/binfmts.h> +#include <linux/sched.h> /* * Tracepoint for calling kthread_stop, performed to end a kthread: @@ -25,7 +26,7 @@ TRACE_EVENT(sched_kthread_stop, ), TP_fast_assign( - memcpy(__entry->comm, t->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, t); __entry->pid = t->pid; ), @@ -152,7 +153,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, ), TP_fast_assign( - memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, p); __entry->pid = p->pid; __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ __entry->target_cpu = task_cpu(p); @@ -239,11 +240,11 @@ TRACE_EVENT(sched_switch, ), TP_fast_assign( - memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN); + __get_task_comm(__entry->next_comm, TASK_COMM_LEN, next); __entry->prev_pid = prev->pid; __entry->prev_prio = prev->prio; __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev); - memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN); + __get_task_comm(__entry->prev_comm, TASK_COMM_LEN, prev); __entry->next_pid = next->pid; __entry->next_prio = next->prio; /* XXX SCHED_DEADLINE */ @@ -286,7 +287,7 @@ TRACE_EVENT(sched_migrate_task, ), TP_fast_assign( - memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, p); __entry->pid = p->pid; __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ __entry->orig_cpu = task_cpu(p); @@ -311,7 +312,7 @@ DECLARE_EVENT_CLASS(sched_process_template, ), TP_fast_assign( - memcpy(__entry->comm, p->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, p); __entry->pid = p->pid; __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ ), @@ -357,7 +358,7 @@ TRACE_EVENT(sched_process_wait, ), TP_fast_assign( - memcpy(__entry->comm, current->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, current); __entry->pid = pid_nr(pid); __entry->prio = current->prio; /* XXX SCHED_DEADLINE */ ), @@ -383,9 +384,9 @@ TRACE_EVENT(sched_process_fork, ), TP_fast_assign( - memcpy(__entry->parent_comm, parent->comm, TASK_COMM_LEN); + __get_task_comm(__entry->parent_comm, TASK_COMM_LEN, parent); __entry->parent_pid = parent->pid; - memcpy(__entry->child_comm, child->comm, TASK_COMM_LEN); + __get_task_comm(__entry->child_comm, TASK_COMM_LEN, child); __entry->child_pid = child->pid; ), @@ -481,7 +482,7 @@ DECLARE_EVENT_CLASS_SCHEDSTAT(sched_stat_template, ), TP_fast_assign( - memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, tsk); __entry->pid = tsk->pid; __entry->delay = delay; ), @@ -539,7 +540,7 @@ DECLARE_EVENT_CLASS(sched_stat_runtime, ), TP_fast_assign( - memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, tsk); __entry->pid = tsk->pid; __entry->runtime = runtime; ), @@ -571,7 +572,7 @@ TRACE_EVENT(sched_pi_setprio, ), TP_fast_assign( - memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, tsk); __entry->pid = tsk->pid; __entry->oldprio = tsk->prio; __entry->newprio = pi_task ? @@ -596,7 +597,7 @@ TRACE_EVENT(sched_process_hang, ), TP_fast_assign( - memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, tsk); __entry->pid = tsk->pid; ), diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h index 1db7e4b07c01..8f317a265392 100644 --- a/include/trace/events/signal.h +++ b/include/trace/events/signal.h @@ -67,7 +67,7 @@ TRACE_EVENT(signal_generate, TP_fast_assign( __entry->sig = sig; TP_STORE_SIGINFO(__entry, info); - memcpy(__entry->comm, task->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, task); __entry->pid = task->pid; __entry->group = group; __entry->result = result; diff --git a/include/trace/events/task.h b/include/trace/events/task.h index 47b527464d1a..77c14707460e 100644 --- a/include/trace/events/task.h +++ b/include/trace/events/task.h @@ -21,7 +21,7 @@ TRACE_EVENT(task_newtask, TP_fast_assign( __entry->pid = task->pid; - memcpy(__entry->comm, task->comm, TASK_COMM_LEN); + __get_task_comm(__entry->comm, TASK_COMM_LEN, task); __entry->clone_flags = clone_flags; __entry->oom_score_adj = task->signal->oom_score_adj; ), @@ -46,7 +46,7 @@ TRACE_EVENT(task_rename, TP_fast_assign( __entry->pid = task->pid; - memcpy(entry->oldcomm, task->comm, TASK_COMM_LEN); + __get_task_comm(entry->oldcomm, TASK_COMM_LEN, task); strscpy(entry->newcomm, comm, TASK_COMM_LEN); __entry->oom_score_adj = task->signal->oom_score_adj; ),
Using __get_task_comm() to read the task comm ensures that the name is always NUL-terminated, regardless of the source string. This approach also facilitates future extensions to the task comm. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> --- include/linux/tracepoint.h | 4 ++-- include/trace/events/block.h | 10 +++++----- include/trace/events/oom.h | 2 +- include/trace/events/osnoise.h | 2 +- include/trace/events/sched.h | 27 ++++++++++++++------------- include/trace/events/signal.h | 2 +- include/trace/events/task.h | 4 ++-- 7 files changed, 26 insertions(+), 25 deletions(-)