Message ID | 20240223114453.335809-5-tobias@waldekranz.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: switchdev: Tracepoints | expand |
On Fri, 23 Feb 2024 12:44:53 +0100 Tobias Waldekranz <tobias@waldekranz.com> wrote: > Add a basic set of tracepoints: > > - switchdev_defer: Fires whenever an operation is enqueued to the > switchdev workqueue for deferred delivery. > > - switchdev_call_{atomic,blocking}: Fires whenever a notification is > sent to the corresponding switchdev notifier chain. > > - switchdev_call_replay: Fires whenever a notification is sent to a > specific driver's notifier block, in response to a replay request. > > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > --- > include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++ > net/switchdev/switchdev.c | 71 +++++++++++++++++++++++++----- > 2 files changed, 135 insertions(+), 10 deletions(-) > create mode 100644 include/trace/events/switchdev.h > > diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h > new file mode 100644 > index 000000000000..dcaf6870d017 > --- /dev/null > +++ b/include/trace/events/switchdev.h > @@ -0,0 +1,74 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM switchdev > + > +#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_SWITCHDEV_H > + > +#include <linux/tracepoint.h> > +#include <net/switchdev.h> > + > +#define SWITCHDEV_TRACE_MSG_MAX 128 128 bytes is awfully big to waste on the ring buffer. What's the average size of a string? > + > +DECLARE_EVENT_CLASS(switchdev_call, > + TP_PROTO(unsigned long val, > + const struct switchdev_notifier_info *info, > + int err), > + > + TP_ARGS(val, info, err), > + > + TP_STRUCT__entry( > + __field(unsigned long, val) > + __string(dev, info->dev ? netdev_name(info->dev) : "(null)") > + __field(const struct switchdev_notifier_info *, info) > + __field(int, err) > + __array(char, msg, SWITCHDEV_TRACE_MSG_MAX) > + ), > + > + TP_fast_assign( > + __entry->val = val; > + __assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)"); > + __entry->info = info; > + __entry->err = err; > + switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX); Is it possible to just store the information in the trace event and then call the above function in the read stage? There's helpers to pass strings around (namely a struct trace_seq *p). It would require a plugin for libtraceevent if you want to expose it to user space tools for trace-cmd and perf though. Another possibility is if this event will not race with other events on he same CPU, you could create a per-cpu buffer, write into that, and then use __string() and __assign_str() to save it, as traces happen with preemption disabled. -- Steve > + ), > + > + TP_printk("dev %s %s -> %d", __get_str(dev), __entry->msg, __entry->err) > +); > + > +DEFINE_EVENT(switchdev_call, switchdev_defer, > + TP_PROTO(unsigned long val, > + const struct switchdev_notifier_info *info, > + int err), > + > + TP_ARGS(val, info, err) > +); > + > +DEFINE_EVENT(switchdev_call, switchdev_call_atomic, > + TP_PROTO(unsigned long val, > + const struct switchdev_notifier_info *info, > + int err), > + > + TP_ARGS(val, info, err) > +); > + > +DEFINE_EVENT(switchdev_call, switchdev_call_blocking, > + TP_PROTO(unsigned long val, > + const struct switchdev_notifier_info *info, > + int err), > + > + TP_ARGS(val, info, err) > +); > + > +DEFINE_EVENT(switchdev_call, switchdev_call_replay, > + TP_PROTO(unsigned long val, > + const struct switchdev_notifier_info *info, > + int err), > + > + TP_ARGS(val, info, err) > +); > + > +#endif /* _TRACE_SWITCHDEV_H */ > +
On fre, feb 23, 2024 at 10:38, Steven Rostedt <rostedt@goodmis.org> wrote: > On Fri, 23 Feb 2024 12:44:53 +0100 > Tobias Waldekranz <tobias@waldekranz.com> wrote: > >> Add a basic set of tracepoints: >> >> - switchdev_defer: Fires whenever an operation is enqueued to the >> switchdev workqueue for deferred delivery. >> >> - switchdev_call_{atomic,blocking}: Fires whenever a notification is >> sent to the corresponding switchdev notifier chain. >> >> - switchdev_call_replay: Fires whenever a notification is sent to a >> specific driver's notifier block, in response to a replay request. >> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> >> --- >> include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++ >> net/switchdev/switchdev.c | 71 +++++++++++++++++++++++++----- >> 2 files changed, 135 insertions(+), 10 deletions(-) >> create mode 100644 include/trace/events/switchdev.h >> >> diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h >> new file mode 100644 >> index 000000000000..dcaf6870d017 >> --- /dev/null >> +++ b/include/trace/events/switchdev.h >> @@ -0,0 +1,74 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#undef TRACE_SYSTEM >> +#define TRACE_SYSTEM switchdev >> + >> +#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ) >> +#define _TRACE_SWITCHDEV_H >> + >> +#include <linux/tracepoint.h> >> +#include <net/switchdev.h> >> + >> +#define SWITCHDEV_TRACE_MSG_MAX 128 > > 128 bytes is awfully big to waste on the ring buffer. What's the average > size of a string? I would say the typical message is around 60-80 bytes. Some common examples: PORT_OBJ_ADD PORT_VLAN(flags 0x0 orig br0) vid 1 flags 0x27 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 100 addr 33:33:ff:ff:00:09 The worst case I can think of currently is 95 characters: VXLAN_FDB_ADD_TO_DEVICE vid 1000 addr de:ad:be:ef:00:01 added_by_user is_local locked offloaded >> + >> +DECLARE_EVENT_CLASS(switchdev_call, >> + TP_PROTO(unsigned long val, >> + const struct switchdev_notifier_info *info, >> + int err), >> + >> + TP_ARGS(val, info, err), >> + >> + TP_STRUCT__entry( >> + __field(unsigned long, val) >> + __string(dev, info->dev ? netdev_name(info->dev) : "(null)") >> + __field(const struct switchdev_notifier_info *, info) >> + __field(int, err) >> + __array(char, msg, SWITCHDEV_TRACE_MSG_MAX) >> + ), >> + >> + TP_fast_assign( >> + __entry->val = val; >> + __assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)"); >> + __entry->info = info; >> + __entry->err = err; >> + switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX); > > Is it possible to just store the information in the trace event and then > call the above function in the read stage? There's helpers to pass strings > around (namely a struct trace_seq *p). I'm a complete novice when it comes to tracepoint internals. Am I right in assuming that TP_printk may execute at a much later time than TP_fast_assign? E.g., the object referenced by __entry->info may very well have been freed by that time? That at least seems to be what my naive refactor to replace __entry->msg with __get_buf() suggests :) If so, the layout of the switchdev_notifier_* structs makes it a bit cumbersome to clone the notification in the assign phase, as the size of a specific notification (e.g., switchdev_notifier_port_obj_info) is not known by the common notification (switchdev_notifier_info). In the case of switchdev objects, the problem repeats a second time. E.g., the size of switchdev_obj_port_vlan is not known by switchdev_obj. > It would require a plugin for libtraceevent if you want to expose it to > user space tools for trace-cmd and perf though. > > Another possibility is if this event will not race with other events on he > same CPU, you could create a per-cpu buffer, write into that, and then use > __string() and __assign_str() to save it, as traces happen with preemption > disabled. But bottom halves are still enabled I suppose? Notifications can be generated both from process context (e.g., users configuring the bridge with iproute2), and from bridge packet processing (e.g., adding new neighbors to the FDB). So I don't think that would work in this case. > -- Steve Thanks for taking the time! >> + ), >> + >> + TP_printk("dev %s %s -> %d", __get_str(dev), __entry->msg, __entry->err) >> +); >> + >> +DEFINE_EVENT(switchdev_call, switchdev_defer, >> + TP_PROTO(unsigned long val, >> + const struct switchdev_notifier_info *info, >> + int err), >> + >> + TP_ARGS(val, info, err) >> +); >> + >> +DEFINE_EVENT(switchdev_call, switchdev_call_atomic, >> + TP_PROTO(unsigned long val, >> + const struct switchdev_notifier_info *info, >> + int err), >> + >> + TP_ARGS(val, info, err) >> +); >> + >> +DEFINE_EVENT(switchdev_call, switchdev_call_blocking, >> + TP_PROTO(unsigned long val, >> + const struct switchdev_notifier_info *info, >> + int err), >> + >> + TP_ARGS(val, info, err) >> +); >> + >> +DEFINE_EVENT(switchdev_call, switchdev_call_replay, >> + TP_PROTO(unsigned long val, >> + const struct switchdev_notifier_info *info, >> + int err), >> + >> + TP_ARGS(val, info, err) >> +); >> + >> +#endif /* _TRACE_SWITCHDEV_H */ >> +
On Fri, 2024-02-23 at 10:38 -0500, Steven Rostedt wrote: > On Fri, 23 Feb 2024 12:44:53 +0100 > Tobias Waldekranz <tobias@waldekranz.com> wrote: > > > Add a basic set of tracepoints: > > > > - switchdev_defer: Fires whenever an operation is enqueued to the > > switchdev workqueue for deferred delivery. > > > > - switchdev_call_{atomic,blocking}: Fires whenever a notification is > > sent to the corresponding switchdev notifier chain. > > > > - switchdev_call_replay: Fires whenever a notification is sent to a > > specific driver's notifier block, in response to a replay request. > > > > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > > --- > > include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++ > > net/switchdev/switchdev.c | 71 +++++++++++++++++++++++++----- > > 2 files changed, 135 insertions(+), 10 deletions(-) > > create mode 100644 include/trace/events/switchdev.h > > > > diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h > > new file mode 100644 > > index 000000000000..dcaf6870d017 > > --- /dev/null > > +++ b/include/trace/events/switchdev.h > > @@ -0,0 +1,74 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +#undef TRACE_SYSTEM > > +#define TRACE_SYSTEM switchdev > > + > > +#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ) > > +#define _TRACE_SWITCHDEV_H > > + > > +#include <linux/tracepoint.h> > > +#include <net/switchdev.h> > > + > > +#define SWITCHDEV_TRACE_MSG_MAX 128 > > 128 bytes is awfully big to waste on the ring buffer. What's the average > size of a string? > > > + > > +DECLARE_EVENT_CLASS(switchdev_call, > > + TP_PROTO(unsigned long val, > > + const struct switchdev_notifier_info *info, > > + int err), > > + > > + TP_ARGS(val, info, err), > > + > > + TP_STRUCT__entry( > > + __field(unsigned long, val) > > + __string(dev, info->dev ? netdev_name(info->dev) : "(null)") > > + __field(const struct switchdev_notifier_info *, info) > > + __field(int, err) > > + __array(char, msg, SWITCHDEV_TRACE_MSG_MAX) > > + ), > > + > > + TP_fast_assign( > > + __entry->val = val; > > + __assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)"); > > + __entry->info = info; > > + __entry->err = err; > > + switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX); > > Is it possible to just store the information in the trace event and then > call the above function in the read stage? I agree with Steven: it looks like that with the above code the tracepoint itself will become measurably costily in terms of CPU cycles: we want to avoid that. Perhaps using different tracepoints with different notifier_block type would help? so that each trace point could just copy a few specific fields. Cheers, Paolo
On tis, feb 27, 2024 at 11:04, Paolo Abeni <pabeni@redhat.com> wrote: > On Fri, 2024-02-23 at 10:38 -0500, Steven Rostedt wrote: >> On Fri, 23 Feb 2024 12:44:53 +0100 >> Tobias Waldekranz <tobias@waldekranz.com> wrote: >> >> > Add a basic set of tracepoints: >> > >> > - switchdev_defer: Fires whenever an operation is enqueued to the >> > switchdev workqueue for deferred delivery. >> > >> > - switchdev_call_{atomic,blocking}: Fires whenever a notification is >> > sent to the corresponding switchdev notifier chain. >> > >> > - switchdev_call_replay: Fires whenever a notification is sent to a >> > specific driver's notifier block, in response to a replay request. >> > >> > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> >> > --- >> > include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++ >> > net/switchdev/switchdev.c | 71 +++++++++++++++++++++++++----- >> > 2 files changed, 135 insertions(+), 10 deletions(-) >> > create mode 100644 include/trace/events/switchdev.h >> > >> > diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h >> > new file mode 100644 >> > index 000000000000..dcaf6870d017 >> > --- /dev/null >> > +++ b/include/trace/events/switchdev.h >> > @@ -0,0 +1,74 @@ >> > +/* SPDX-License-Identifier: GPL-2.0 */ >> > +#undef TRACE_SYSTEM >> > +#define TRACE_SYSTEM switchdev >> > + >> > +#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ) >> > +#define _TRACE_SWITCHDEV_H >> > + >> > +#include <linux/tracepoint.h> >> > +#include <net/switchdev.h> >> > + >> > +#define SWITCHDEV_TRACE_MSG_MAX 128 >> >> 128 bytes is awfully big to waste on the ring buffer. What's the average >> size of a string? >> >> > + >> > +DECLARE_EVENT_CLASS(switchdev_call, >> > + TP_PROTO(unsigned long val, >> > + const struct switchdev_notifier_info *info, >> > + int err), >> > + >> > + TP_ARGS(val, info, err), >> > + >> > + TP_STRUCT__entry( >> > + __field(unsigned long, val) >> > + __string(dev, info->dev ? netdev_name(info->dev) : "(null)") >> > + __field(const struct switchdev_notifier_info *, info) >> > + __field(int, err) >> > + __array(char, msg, SWITCHDEV_TRACE_MSG_MAX) >> > + ), >> > + >> > + TP_fast_assign( >> > + __entry->val = val; >> > + __assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)"); >> > + __entry->info = info; >> > + __entry->err = err; >> > + switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX); >> >> Is it possible to just store the information in the trace event and then >> call the above function in the read stage? > > I agree with Steven: it looks like that with the above code the > tracepoint itself will become measurably costily in terms of CPU > cycles: we want to avoid that. > > Perhaps using different tracepoints with different notifier_block type > would help? so that each trace point could just copy a few specific > fields. This can be done, but you will end up having to duplicate the decoding and formatting logic from switchdev-str.c, with the additional hurdle of having to figure out the sizes of all referenced objects in order to create flattened versions of every notification type. What I like about the current approach is that when new notification and object types are added, switchdev_notifier_str will automatically be able to decode them and give you some rough idea of what is going on, even if no new message specific decoding logic is added. It is also reusable by drivers that might want to decode notifications or objects in error messages. Would some variant of (how I understand) Steven's suggestion to instead store the formatted message in a dynamic array (__assign_str()), rather than in the tracepoint entry, be acceptable?
On Wed, 28 Feb 2024 11:47:24 +0100 Tobias Waldekranz <tobias@waldekranz.com> wrote: > >> > + TP_fast_assign( > >> > + __entry->val = val; > >> > + __assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)"); > >> > + __entry->info = info; > >> > + __entry->err = err; > >> > + switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX); > >> > >> Is it possible to just store the information in the trace event and then > >> call the above function in the read stage? > > > > I agree with Steven: it looks like that with the above code the > > tracepoint itself will become measurably costily in terms of CPU > > cycles: we want to avoid that. > > > > Perhaps using different tracepoints with different notifier_block type > > would help? so that each trace point could just copy a few specific > > fields. > > This can be done, but you will end up having to duplicate the decoding > and formatting logic from switchdev-str.c, with the additional hurdle of > having to figure out the sizes of all referenced objects in order to > create flattened versions of every notification type. Would it help if you could pass a trace_seq to it? The TP_printk() has a "magical" trace_seq variable that trace events can use in the TP_printk() called "p". Look at: include/trace/events/libata.h: const char *libata_trace_parse_status(struct trace_seq*, unsigned char); #define __parse_status(s) libata_trace_parse_status(p, s) Where we have: const char * libata_trace_parse_status(struct trace_seq *p, unsigned char status) { const char *ret = trace_seq_buffer_ptr(p); trace_seq_printf(p, "{ "); if (status & ATA_BUSY) trace_seq_printf(p, "BUSY "); if (status & ATA_DRDY) trace_seq_printf(p, "DRDY "); if (status & ATA_DF) trace_seq_printf(p, "DF "); if (status & ATA_DSC) trace_seq_printf(p, "DSC "); if (status & ATA_DRQ) trace_seq_printf(p, "DRQ "); if (status & ATA_CORR) trace_seq_printf(p, "CORR "); if (status & ATA_SENSE) trace_seq_printf(p, "SENSE "); if (status & ATA_ERR) trace_seq_printf(p, "ERR "); trace_seq_putc(p, '}'); trace_seq_putc(p, 0); return ret; } The "trace_seq p" is a pointer to trace_seq descriptor that can build strings, and then you can use it to print a custom string in the trace output. > > What I like about the current approach is that when new notification and > object types are added, switchdev_notifier_str will automatically be > able to decode them and give you some rough idea of what is going on, > even if no new message specific decoding logic is added. It is also > reusable by drivers that might want to decode notifications or objects > in error messages. > > Would some variant of (how I understand) Steven's suggestion to instead > store the formatted message in a dynamic array (__assign_str()), rather > than in the tracepoint entry, be acceptable? Matters if you could adapt using a trace_seq for the output. Or at least use a seq_buf, as that's what is under the covers of trace_seq. If you rather just use seq_buf, the above could pretty much be the same by passing in: &p->seq. -- Steve
On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 28 Feb 2024 11:47:24 +0100 > Tobias Waldekranz <tobias@waldekranz.com> wrote: > >> >> > + TP_fast_assign( >> >> > + __entry->val = val; >> >> > + __assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)"); >> >> > + __entry->info = info; >> >> > + __entry->err = err; >> >> > + switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX); >> >> >> >> Is it possible to just store the information in the trace event and then >> >> call the above function in the read stage? >> > >> > I agree with Steven: it looks like that with the above code the >> > tracepoint itself will become measurably costily in terms of CPU >> > cycles: we want to avoid that. >> > >> > Perhaps using different tracepoints with different notifier_block type >> > would help? so that each trace point could just copy a few specific >> > fields. >> >> This can be done, but you will end up having to duplicate the decoding >> and formatting logic from switchdev-str.c, with the additional hurdle of >> having to figure out the sizes of all referenced objects in order to >> create flattened versions of every notification type. > > Would it help if you could pass a trace_seq to it? The TP_printk() has a > "magical" trace_seq variable that trace events can use in the TP_printk() > called "p". > > Look at: > > include/trace/events/libata.h: > > const char *libata_trace_parse_status(struct trace_seq*, unsigned char); > #define __parse_status(s) libata_trace_parse_status(p, s) > > Where we have: > > const char * > libata_trace_parse_status(struct trace_seq *p, unsigned char status) > { > const char *ret = trace_seq_buffer_ptr(p); > > trace_seq_printf(p, "{ "); > if (status & ATA_BUSY) > trace_seq_printf(p, "BUSY "); > if (status & ATA_DRDY) > trace_seq_printf(p, "DRDY "); > if (status & ATA_DF) > trace_seq_printf(p, "DF "); > if (status & ATA_DSC) > trace_seq_printf(p, "DSC "); > if (status & ATA_DRQ) > trace_seq_printf(p, "DRQ "); > if (status & ATA_CORR) > trace_seq_printf(p, "CORR "); > if (status & ATA_SENSE) > trace_seq_printf(p, "SENSE "); > if (status & ATA_ERR) > trace_seq_printf(p, "ERR "); > trace_seq_putc(p, '}'); > trace_seq_putc(p, 0); > > return ret; > } > > The "trace_seq p" is a pointer to trace_seq descriptor that can build > strings, and then you can use it to print a custom string in the trace > output. Yes I managed to decode the hidden variable :) I also found trace_seq_acquire() (and its macro alter ego __get_buf()), which would let me keep the generic stringer functions. So far, so good. I think the foundational problem remains though: TP_printk() is not executed until a user reads from the trace_pipe; at which point the object referenced by __entry->info may already be dead and buried. Right? >> >> What I like about the current approach is that when new notification and >> object types are added, switchdev_notifier_str will automatically be >> able to decode them and give you some rough idea of what is going on, >> even if no new message specific decoding logic is added. It is also >> reusable by drivers that might want to decode notifications or objects >> in error messages. >> >> Would some variant of (how I understand) Steven's suggestion to instead >> store the formatted message in a dynamic array (__assign_str()), rather >> than in the tracepoint entry, be acceptable? > > Matters if you could adapt using a trace_seq for the output. Or at least > use a seq_buf, as that's what is under the covers of trace_seq. If you > rather just use seq_buf, the above could pretty much be the same by passing > in: &p->seq. > > -- Steve
On Mon, 04 Mar 2024 23:40:49 +0100 Tobias Waldekranz <tobias@waldekranz.com> wrote: > On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 28 Feb 2024 11:47:24 +0100 > > Tobias Waldekranz <tobias@waldekranz.com> wrote: > > > > The "trace_seq p" is a pointer to trace_seq descriptor that can build > > strings, and then you can use it to print a custom string in the trace > > output. > > Yes I managed to decode the hidden variable :) I also found > trace_seq_acquire() (and its macro alter ego __get_buf()), which would > let me keep the generic stringer functions. So far, so good. > > I think the foundational problem remains though: TP_printk() is not > executed until a user reads from the trace_pipe; at which point the > object referenced by __entry->info may already be dead and > buried. Right? Correct. You would need to load all the information into the event data itself, at the time of the event is triggered, that is needed to determine how to display it. -- Steve
On ons, mar 06, 2024 at 10:15, Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 04 Mar 2024 23:40:49 +0100 > Tobias Waldekranz <tobias@waldekranz.com> wrote: > >> On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote: >> > On Wed, 28 Feb 2024 11:47:24 +0100 >> > Tobias Waldekranz <tobias@waldekranz.com> wrote: >> > >> > The "trace_seq p" is a pointer to trace_seq descriptor that can build >> > strings, and then you can use it to print a custom string in the trace >> > output. >> >> Yes I managed to decode the hidden variable :) I also found >> trace_seq_acquire() (and its macro alter ego __get_buf()), which would >> let me keep the generic stringer functions. So far, so good. >> >> I think the foundational problem remains though: TP_printk() is not >> executed until a user reads from the trace_pipe; at which point the >> object referenced by __entry->info may already be dead and >> buried. Right? > > Correct. You would need to load all the information into the event data > itself, at the time of the event is triggered, that is needed to determine > how to display it. Given that that is quite gnarly to do for the events I'm trying to trace, because of the complex object graph, would it be acceptable to format the message in the assign phase and store it as dynamic data? I.e., what (I think) you suggested at the end of your first response. My thinking is: - Managing a duplicate (flattened) object graph, exclusively for use by these tracepoints, increases the effort to keep the tracing in sync with new additions to switchdev; which I think will result in developers simply avoiding it altogether. In other words: I'd rather have somewhat inefficient but simple flashlight, rather than a very efficient one that no one knows how to change the batteries in. - This is typically not a very hot path. Most events are triggered by user configuration. Otherwise when new neighbors are discovered. - __entry->info is still there for use by raw tracepoint consumers from userspace.
On Wed, 06 Mar 2024 21:02:06 +0100 Tobias Waldekranz <tobias@waldekranz.com> wrote: > On ons, mar 06, 2024 at 10:15, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 04 Mar 2024 23:40:49 +0100 > > Tobias Waldekranz <tobias@waldekranz.com> wrote: > > > >> On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote: > >> > On Wed, 28 Feb 2024 11:47:24 +0100 > >> > Tobias Waldekranz <tobias@waldekranz.com> wrote: > >> > > >> > The "trace_seq p" is a pointer to trace_seq descriptor that can build > >> > strings, and then you can use it to print a custom string in the trace > >> > output. > >> > >> Yes I managed to decode the hidden variable :) I also found > >> trace_seq_acquire() (and its macro alter ego __get_buf()), which would > >> let me keep the generic stringer functions. So far, so good. > >> > >> I think the foundational problem remains though: TP_printk() is not > >> executed until a user reads from the trace_pipe; at which point the > >> object referenced by __entry->info may already be dead and > >> buried. Right? > > > > Correct. You would need to load all the information into the event data > > itself, at the time of the event is triggered, that is needed to determine > > how to display it. > > Given that that is quite gnarly to do for the events I'm trying to > trace, because of the complex object graph, would it be acceptable to > format the message in the assign phase and store it as dynamic data? > I.e., what (I think) you suggested at the end of your first response. It's really up to what you want to do ;-) > > My thinking is: > > - Managing a duplicate (flattened) object graph, exclusively for use by > these tracepoints, increases the effort to keep the tracing in sync > with new additions to switchdev; which I think will result in > developers simply avoiding it altogether. In other words: I'd rather > have somewhat inefficient but simple flashlight, rather than a very > efficient one that no one knows how to change the batteries in. > > - This is typically not a very hot path. Most events are triggered by > user configuration. Otherwise when new neighbors are discovered. > > - __entry->info is still there for use by raw tracepoint consumers from > userspace. How big is this info? -- Steve
On ons, mar 06, 2024 at 16:46, Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 06 Mar 2024 21:02:06 +0100 > Tobias Waldekranz <tobias@waldekranz.com> wrote: > >> On ons, mar 06, 2024 at 10:15, Steven Rostedt <rostedt@goodmis.org> wrote: >> > On Mon, 04 Mar 2024 23:40:49 +0100 >> > Tobias Waldekranz <tobias@waldekranz.com> wrote: >> > >> >> On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote: >> >> > On Wed, 28 Feb 2024 11:47:24 +0100 >> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote: >> >> > >> >> > The "trace_seq p" is a pointer to trace_seq descriptor that can build >> >> > strings, and then you can use it to print a custom string in the trace >> >> > output. >> >> >> >> Yes I managed to decode the hidden variable :) I also found >> >> trace_seq_acquire() (and its macro alter ego __get_buf()), which would >> >> let me keep the generic stringer functions. So far, so good. >> >> >> >> I think the foundational problem remains though: TP_printk() is not >> >> executed until a user reads from the trace_pipe; at which point the >> >> object referenced by __entry->info may already be dead and >> >> buried. Right? >> > >> > Correct. You would need to load all the information into the event data >> > itself, at the time of the event is triggered, that is needed to determine >> > how to display it. >> >> Given that that is quite gnarly to do for the events I'm trying to >> trace, because of the complex object graph, would it be acceptable to >> format the message in the assign phase and store it as dynamic data? >> I.e., what (I think) you suggested at the end of your first response. > > It's really up to what you want to do ;-) Alright. I'll interpret that as "there's a >0% chance that I'll give you an Acked-by on something like that" :) >> >> My thinking is: >> >> - Managing a duplicate (flattened) object graph, exclusively for use by >> these tracepoints, increases the effort to keep the tracing in sync >> with new additions to switchdev; which I think will result in >> developers simply avoiding it altogether. In other words: I'd rather >> have somewhat inefficient but simple flashlight, rather than a very >> efficient one that no one knows how to change the batteries in. >> >> - This is typically not a very hot path. Most events are triggered by >> user configuration. Otherwise when new neighbors are discovered. >> >> - __entry->info is still there for use by raw tracepoint consumers from >> userspace. > > How big is this info? The common struct (switchdev_notifier_info) is 24B at the moment. Depending on __entry->val, the size of the enclosing notification (e.g. switchdev_notifier_port_obj_info) is between 40-64B. This pattern may then repeat again inside the concrete notifier, where you have a pointer to a common object (e.g. switchdev_obj, 48B) whose outer size (e.g. switchdev_obj_port_vlan, 56B) is determined by an accompanying enum.
On Wed, 06 Mar 2024 23:45:57 +0100 Tobias Waldekranz <tobias@waldekranz.com> wrote: > > How big is this info? > > The common struct (switchdev_notifier_info) is 24B at the > moment. Depending on __entry->val, the size of the enclosing > notification (e.g. switchdev_notifier_port_obj_info) is between > 40-64B. This pattern may then repeat again inside the concrete notifier, > where you have a pointer to a common object (e.g. switchdev_obj, 48B) > whose outer size (e.g. switchdev_obj_port_vlan, 56B) is determined by an > accompanying enum. As long as it's under 1K, you should be good. -- Steve
diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h new file mode 100644 index 000000000000..dcaf6870d017 --- /dev/null +++ b/include/trace/events/switchdev.h @@ -0,0 +1,74 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM switchdev + +#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_SWITCHDEV_H + +#include <linux/tracepoint.h> +#include <net/switchdev.h> + +#define SWITCHDEV_TRACE_MSG_MAX 128 + +DECLARE_EVENT_CLASS(switchdev_call, + TP_PROTO(unsigned long val, + const struct switchdev_notifier_info *info, + int err), + + TP_ARGS(val, info, err), + + TP_STRUCT__entry( + __field(unsigned long, val) + __string(dev, info->dev ? netdev_name(info->dev) : "(null)") + __field(const struct switchdev_notifier_info *, info) + __field(int, err) + __array(char, msg, SWITCHDEV_TRACE_MSG_MAX) + ), + + TP_fast_assign( + __entry->val = val; + __assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)"); + __entry->info = info; + __entry->err = err; + switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX); + ), + + TP_printk("dev %s %s -> %d", __get_str(dev), __entry->msg, __entry->err) +); + +DEFINE_EVENT(switchdev_call, switchdev_defer, + TP_PROTO(unsigned long val, + const struct switchdev_notifier_info *info, + int err), + + TP_ARGS(val, info, err) +); + +DEFINE_EVENT(switchdev_call, switchdev_call_atomic, + TP_PROTO(unsigned long val, + const struct switchdev_notifier_info *info, + int err), + + TP_ARGS(val, info, err) +); + +DEFINE_EVENT(switchdev_call, switchdev_call_blocking, + TP_PROTO(unsigned long val, + const struct switchdev_notifier_info *info, + int err), + + TP_ARGS(val, info, err) +); + +DEFINE_EVENT(switchdev_call, switchdev_call_replay, + TP_PROTO(unsigned long val, + const struct switchdev_notifier_info *info, + int err), + + TP_ARGS(val, info, err) +); + +#endif /* _TRACE_SWITCHDEV_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c index f73249269a87..74d715166981 100644 --- a/net/switchdev/switchdev.c +++ b/net/switchdev/switchdev.c @@ -19,6 +19,9 @@ #include <linux/rtnetlink.h> #include <net/switchdev.h> +#define CREATE_TRACE_POINTS +#include <trace/events/switchdev.h> + static bool switchdev_obj_eq(const struct switchdev_obj *a, const struct switchdev_obj *b) { @@ -180,8 +183,20 @@ static void switchdev_port_attr_set_deferred(struct net_device *dev, static int switchdev_port_attr_set_defer(struct net_device *dev, const struct switchdev_attr *attr) { - return switchdev_deferred_enqueue(dev, attr, sizeof(*attr), - switchdev_port_attr_set_deferred); + int err; + + err = switchdev_deferred_enqueue(dev, attr, sizeof(*attr), + switchdev_port_attr_set_deferred); + + if (trace_switchdev_defer_enabled()) { + struct switchdev_notifier_port_attr_info attr_info = { + .info.dev = dev, + .attr = attr, + }; + + trace_switchdev_defer(SWITCHDEV_PORT_ATTR_SET, &attr_info.info, err); + } + return err; } /** @@ -263,8 +278,20 @@ static void switchdev_port_obj_add_deferred(struct net_device *dev, static int switchdev_port_obj_add_defer(struct net_device *dev, const struct switchdev_obj *obj) { - return switchdev_deferred_enqueue(dev, obj, switchdev_obj_size(obj), - switchdev_port_obj_add_deferred); + int err; + + err = switchdev_deferred_enqueue(dev, obj, switchdev_obj_size(obj), + switchdev_port_obj_add_deferred); + + if (trace_switchdev_defer_enabled()) { + struct switchdev_notifier_port_obj_info obj_info = { + .info.dev = dev, + .obj = obj, + }; + + trace_switchdev_defer(SWITCHDEV_PORT_OBJ_ADD, &obj_info.info, err); + } + return err; } /** @@ -313,8 +340,20 @@ static void switchdev_port_obj_del_deferred(struct net_device *dev, static int switchdev_port_obj_del_defer(struct net_device *dev, const struct switchdev_obj *obj) { - return switchdev_deferred_enqueue(dev, obj, switchdev_obj_size(obj), - switchdev_port_obj_del_deferred); + int err; + + err = switchdev_deferred_enqueue(dev, obj, switchdev_obj_size(obj), + switchdev_port_obj_del_deferred); + + if (trace_switchdev_defer_enabled()) { + struct switchdev_notifier_port_obj_info obj_info = { + .info.dev = dev, + .obj = obj, + }; + + trace_switchdev_defer(SWITCHDEV_PORT_OBJ_DEL, &obj_info.info, err); + } + return err; } /** @@ -394,7 +433,11 @@ EXPORT_SYMBOL_GPL(switchdev_port_obj_act_is_deferred); int switchdev_call_replay(struct notifier_block *nb, unsigned long type, struct switchdev_notifier_info *info) { - return nb->notifier_call(nb, type, info); + int ret; + + ret = nb->notifier_call(nb, type, info); + trace_switchdev_call_replay(type, info, notifier_to_errno(ret)); + return ret; } EXPORT_SYMBOL_GPL(switchdev_call_replay); @@ -437,9 +480,13 @@ int call_switchdev_notifiers(unsigned long val, struct net_device *dev, struct switchdev_notifier_info *info, struct netlink_ext_ack *extack) { + int ret; + info->dev = dev; info->extack = extack; - return atomic_notifier_call_chain(&switchdev_notif_chain, val, info); + ret = atomic_notifier_call_chain(&switchdev_notif_chain, val, info); + trace_switchdev_call_atomic(val, info, notifier_to_errno(ret)); + return ret; } EXPORT_SYMBOL_GPL(call_switchdev_notifiers); @@ -463,10 +510,14 @@ int call_switchdev_blocking_notifiers(unsigned long val, struct net_device *dev, struct switchdev_notifier_info *info, struct netlink_ext_ack *extack) { + int ret; + info->dev = dev; info->extack = extack; - return blocking_notifier_call_chain(&switchdev_blocking_notif_chain, - val, info); + ret = blocking_notifier_call_chain(&switchdev_blocking_notif_chain, + val, info); + trace_switchdev_call_blocking(val, info, notifier_to_errno(ret)); + return ret; } EXPORT_SYMBOL_GPL(call_switchdev_blocking_notifiers);
Add a basic set of tracepoints: - switchdev_defer: Fires whenever an operation is enqueued to the switchdev workqueue for deferred delivery. - switchdev_call_{atomic,blocking}: Fires whenever a notification is sent to the corresponding switchdev notifier chain. - switchdev_call_replay: Fires whenever a notification is sent to a specific driver's notifier block, in response to a replay request. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> --- include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++ net/switchdev/switchdev.c | 71 +++++++++++++++++++++++++----- 2 files changed, 135 insertions(+), 10 deletions(-) create mode 100644 include/trace/events/switchdev.h