diff mbox series

[v3,net-next,4/4] net: switchdev: Add tracepoints

Message ID 20240223114453.335809-5-tobias@waldekranz.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series net: switchdev: Tracepoints | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 942 this patch: 942
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 3 maintainers not CCed: pabeni@redhat.com mathieu.desnoyers@efficios.com edumazet@google.com
netdev/build_clang success Errors and warnings before: 958 this patch: 958
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 959 this patch: 959
netdev/checkpatch warning CHECK: Alignment should match open parenthesis CHECK: Lines should not end with a '(' WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-02-23--18-00 (tests: 1457)

Commit Message

Tobias Waldekranz Feb. 23, 2024, 11:44 a.m. UTC
Add a basic set of tracepoints:

- switchdev_defer: Fires whenever an operation is enqueued to the
  switchdev workqueue for deferred delivery.

- switchdev_call_{atomic,blocking}: Fires whenever a notification is
  sent to the corresponding switchdev notifier chain.

- switchdev_call_replay: Fires whenever a notification is sent to a
  specific driver's notifier block, in response to a replay request.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++
 net/switchdev/switchdev.c        | 71 +++++++++++++++++++++++++-----
 2 files changed, 135 insertions(+), 10 deletions(-)
 create mode 100644 include/trace/events/switchdev.h

Comments

Steven Rostedt Feb. 23, 2024, 3:38 p.m. UTC | #1
On Fri, 23 Feb 2024 12:44:53 +0100
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> Add a basic set of tracepoints:
> 
> - switchdev_defer: Fires whenever an operation is enqueued to the
>   switchdev workqueue for deferred delivery.
> 
> - switchdev_call_{atomic,blocking}: Fires whenever a notification is
>   sent to the corresponding switchdev notifier chain.
> 
> - switchdev_call_replay: Fires whenever a notification is sent to a
>   specific driver's notifier block, in response to a replay request.
> 
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> ---
>  include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++
>  net/switchdev/switchdev.c        | 71 +++++++++++++++++++++++++-----
>  2 files changed, 135 insertions(+), 10 deletions(-)
>  create mode 100644 include/trace/events/switchdev.h
> 
> diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h
> new file mode 100644
> index 000000000000..dcaf6870d017
> --- /dev/null
> +++ b/include/trace/events/switchdev.h
> @@ -0,0 +1,74 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM	switchdev
> +
> +#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_SWITCHDEV_H
> +
> +#include <linux/tracepoint.h>
> +#include <net/switchdev.h>
> +
> +#define SWITCHDEV_TRACE_MSG_MAX 128

128 bytes is awfully big to waste on the ring buffer. What's the average
size of a string?

> +
> +DECLARE_EVENT_CLASS(switchdev_call,
> +	TP_PROTO(unsigned long val,
> +		 const struct switchdev_notifier_info *info,
> +		 int err),
> +
> +	TP_ARGS(val, info, err),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, val)
> +		__string(dev, info->dev ? netdev_name(info->dev) : "(null)")
> +		__field(const struct switchdev_notifier_info *, info)
> +		__field(int, err)
> +		__array(char, msg, SWITCHDEV_TRACE_MSG_MAX)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->val = val;
> +		__assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)");
> +		__entry->info = info;
> +		__entry->err = err;
> +		switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX);

Is it possible to just store the information in the trace event and then
call the above function in the read stage? There's helpers to pass strings
around (namely a struct trace_seq *p).

It would require a plugin for libtraceevent if you want to expose it to
user space tools for trace-cmd and perf though.

Another possibility is if this event will not race with other events on he
same CPU, you could create a per-cpu buffer, write into that, and then use
__string() and __assign_str() to save it, as traces happen with preemption
disabled.

-- Steve

> +	),
> +
> +	TP_printk("dev %s %s -> %d", __get_str(dev), __entry->msg, __entry->err)
> +);
> +
> +DEFINE_EVENT(switchdev_call, switchdev_defer,
> +	TP_PROTO(unsigned long val,
> +		 const struct switchdev_notifier_info *info,
> +		 int err),
> +
> +	TP_ARGS(val, info, err)
> +);
> +
> +DEFINE_EVENT(switchdev_call, switchdev_call_atomic,
> +	TP_PROTO(unsigned long val,
> +		 const struct switchdev_notifier_info *info,
> +		 int err),
> +
> +	TP_ARGS(val, info, err)
> +);
> +
> +DEFINE_EVENT(switchdev_call, switchdev_call_blocking,
> +	TP_PROTO(unsigned long val,
> +		 const struct switchdev_notifier_info *info,
> +		 int err),
> +
> +	TP_ARGS(val, info, err)
> +);
> +
> +DEFINE_EVENT(switchdev_call, switchdev_call_replay,
> +	TP_PROTO(unsigned long val,
> +		 const struct switchdev_notifier_info *info,
> +		 int err),
> +
> +	TP_ARGS(val, info, err)
> +);
> +
> +#endif /* _TRACE_SWITCHDEV_H */
> +
Tobias Waldekranz Feb. 26, 2024, 1:05 p.m. UTC | #2
On fre, feb 23, 2024 at 10:38, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 23 Feb 2024 12:44:53 +0100
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> Add a basic set of tracepoints:
>> 
>> - switchdev_defer: Fires whenever an operation is enqueued to the
>>   switchdev workqueue for deferred delivery.
>> 
>> - switchdev_call_{atomic,blocking}: Fires whenever a notification is
>>   sent to the corresponding switchdev notifier chain.
>> 
>> - switchdev_call_replay: Fires whenever a notification is sent to a
>>   specific driver's notifier block, in response to a replay request.
>> 
>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> ---
>>  include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++
>>  net/switchdev/switchdev.c        | 71 +++++++++++++++++++++++++-----
>>  2 files changed, 135 insertions(+), 10 deletions(-)
>>  create mode 100644 include/trace/events/switchdev.h
>> 
>> diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h
>> new file mode 100644
>> index 000000000000..dcaf6870d017
>> --- /dev/null
>> +++ b/include/trace/events/switchdev.h
>> @@ -0,0 +1,74 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM	switchdev
>> +
>> +#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_SWITCHDEV_H
>> +
>> +#include <linux/tracepoint.h>
>> +#include <net/switchdev.h>
>> +
>> +#define SWITCHDEV_TRACE_MSG_MAX 128
>
> 128 bytes is awfully big to waste on the ring buffer. What's the average
> size of a string?

I would say the typical message is around 60-80 bytes. Some common examples:

    PORT_OBJ_ADD PORT_VLAN(flags 0x0 orig br0) vid 1 flags 0x27
    PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 100 addr 33:33:ff:ff:00:09

The worst case I can think of currently is 95 characters:

    VXLAN_FDB_ADD_TO_DEVICE vid 1000 addr de:ad:be:ef:00:01 added_by_user is_local locked offloaded

>> +
>> +DECLARE_EVENT_CLASS(switchdev_call,
>> +	TP_PROTO(unsigned long val,
>> +		 const struct switchdev_notifier_info *info,
>> +		 int err),
>> +
>> +	TP_ARGS(val, info, err),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(unsigned long, val)
>> +		__string(dev, info->dev ? netdev_name(info->dev) : "(null)")
>> +		__field(const struct switchdev_notifier_info *, info)
>> +		__field(int, err)
>> +		__array(char, msg, SWITCHDEV_TRACE_MSG_MAX)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->val = val;
>> +		__assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)");
>> +		__entry->info = info;
>> +		__entry->err = err;
>> +		switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX);
>
> Is it possible to just store the information in the trace event and then
> call the above function in the read stage? There's helpers to pass strings
> around (namely a struct trace_seq *p).

I'm a complete novice when it comes to tracepoint internals. Am I right
in assuming that TP_printk may execute at a much later time than
TP_fast_assign? E.g., the object referenced by __entry->info may very
well have been freed by that time? That at least seems to be what my
naive refactor to replace __entry->msg with __get_buf() suggests :)

If so, the layout of the switchdev_notifier_* structs makes it a bit
cumbersome to clone the notification in the assign phase, as the size of
a specific notification (e.g., switchdev_notifier_port_obj_info) is not
known by the common notification (switchdev_notifier_info). In the case
of switchdev objects, the problem repeats a second time. E.g., the size
of switchdev_obj_port_vlan is not known by switchdev_obj.

> It would require a plugin for libtraceevent if you want to expose it to
> user space tools for trace-cmd and perf though.
>
> Another possibility is if this event will not race with other events on he
> same CPU, you could create a per-cpu buffer, write into that, and then use
> __string() and __assign_str() to save it, as traces happen with preemption
> disabled.

But bottom halves are still enabled I suppose? Notifications can be
generated both from process context (e.g., users configuring the bridge
with iproute2), and from bridge packet processing (e.g., adding new
neighbors to the FDB). So I don't think that would work in this case.

> -- Steve

Thanks for taking the time!

>> +	),
>> +
>> +	TP_printk("dev %s %s -> %d", __get_str(dev), __entry->msg, __entry->err)
>> +);
>> +
>> +DEFINE_EVENT(switchdev_call, switchdev_defer,
>> +	TP_PROTO(unsigned long val,
>> +		 const struct switchdev_notifier_info *info,
>> +		 int err),
>> +
>> +	TP_ARGS(val, info, err)
>> +);
>> +
>> +DEFINE_EVENT(switchdev_call, switchdev_call_atomic,
>> +	TP_PROTO(unsigned long val,
>> +		 const struct switchdev_notifier_info *info,
>> +		 int err),
>> +
>> +	TP_ARGS(val, info, err)
>> +);
>> +
>> +DEFINE_EVENT(switchdev_call, switchdev_call_blocking,
>> +	TP_PROTO(unsigned long val,
>> +		 const struct switchdev_notifier_info *info,
>> +		 int err),
>> +
>> +	TP_ARGS(val, info, err)
>> +);
>> +
>> +DEFINE_EVENT(switchdev_call, switchdev_call_replay,
>> +	TP_PROTO(unsigned long val,
>> +		 const struct switchdev_notifier_info *info,
>> +		 int err),
>> +
>> +	TP_ARGS(val, info, err)
>> +);
>> +
>> +#endif /* _TRACE_SWITCHDEV_H */
>> +
Paolo Abeni Feb. 27, 2024, 10:04 a.m. UTC | #3
On Fri, 2024-02-23 at 10:38 -0500, Steven Rostedt wrote:
> On Fri, 23 Feb 2024 12:44:53 +0100
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
> 
> > Add a basic set of tracepoints:
> > 
> > - switchdev_defer: Fires whenever an operation is enqueued to the
> >   switchdev workqueue for deferred delivery.
> > 
> > - switchdev_call_{atomic,blocking}: Fires whenever a notification is
> >   sent to the corresponding switchdev notifier chain.
> > 
> > - switchdev_call_replay: Fires whenever a notification is sent to a
> >   specific driver's notifier block, in response to a replay request.
> > 
> > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> > ---
> >  include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++
> >  net/switchdev/switchdev.c        | 71 +++++++++++++++++++++++++-----
> >  2 files changed, 135 insertions(+), 10 deletions(-)
> >  create mode 100644 include/trace/events/switchdev.h
> > 
> > diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h
> > new file mode 100644
> > index 000000000000..dcaf6870d017
> > --- /dev/null
> > +++ b/include/trace/events/switchdev.h
> > @@ -0,0 +1,74 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM	switchdev
> > +
> > +#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_SWITCHDEV_H
> > +
> > +#include <linux/tracepoint.h>
> > +#include <net/switchdev.h>
> > +
> > +#define SWITCHDEV_TRACE_MSG_MAX 128
> 
> 128 bytes is awfully big to waste on the ring buffer. What's the average
> size of a string?
> 
> > +
> > +DECLARE_EVENT_CLASS(switchdev_call,
> > +	TP_PROTO(unsigned long val,
> > +		 const struct switchdev_notifier_info *info,
> > +		 int err),
> > +
> > +	TP_ARGS(val, info, err),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(unsigned long, val)
> > +		__string(dev, info->dev ? netdev_name(info->dev) : "(null)")
> > +		__field(const struct switchdev_notifier_info *, info)
> > +		__field(int, err)
> > +		__array(char, msg, SWITCHDEV_TRACE_MSG_MAX)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->val = val;
> > +		__assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)");
> > +		__entry->info = info;
> > +		__entry->err = err;
> > +		switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX);
> 
> Is it possible to just store the information in the trace event and then
> call the above function in the read stage?

I agree with Steven: it looks like that with the above code the
tracepoint itself will become measurably costily in terms of CPU
cycles: we want to avoid that.

Perhaps using different tracepoints with different notifier_block type
would help? so that each trace point could just copy a few specific
fields.

Cheers,

Paolo
Tobias Waldekranz Feb. 28, 2024, 10:47 a.m. UTC | #4
On tis, feb 27, 2024 at 11:04, Paolo Abeni <pabeni@redhat.com> wrote:
> On Fri, 2024-02-23 at 10:38 -0500, Steven Rostedt wrote:
>> On Fri, 23 Feb 2024 12:44:53 +0100
>> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>> 
>> > Add a basic set of tracepoints:
>> > 
>> > - switchdev_defer: Fires whenever an operation is enqueued to the
>> >   switchdev workqueue for deferred delivery.
>> > 
>> > - switchdev_call_{atomic,blocking}: Fires whenever a notification is
>> >   sent to the corresponding switchdev notifier chain.
>> > 
>> > - switchdev_call_replay: Fires whenever a notification is sent to a
>> >   specific driver's notifier block, in response to a replay request.
>> > 
>> > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> > ---
>> >  include/trace/events/switchdev.h | 74 ++++++++++++++++++++++++++++++++
>> >  net/switchdev/switchdev.c        | 71 +++++++++++++++++++++++++-----
>> >  2 files changed, 135 insertions(+), 10 deletions(-)
>> >  create mode 100644 include/trace/events/switchdev.h
>> > 
>> > diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h
>> > new file mode 100644
>> > index 000000000000..dcaf6870d017
>> > --- /dev/null
>> > +++ b/include/trace/events/switchdev.h
>> > @@ -0,0 +1,74 @@
>> > +/* SPDX-License-Identifier: GPL-2.0 */
>> > +#undef TRACE_SYSTEM
>> > +#define TRACE_SYSTEM	switchdev
>> > +
>> > +#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ)
>> > +#define _TRACE_SWITCHDEV_H
>> > +
>> > +#include <linux/tracepoint.h>
>> > +#include <net/switchdev.h>
>> > +
>> > +#define SWITCHDEV_TRACE_MSG_MAX 128
>> 
>> 128 bytes is awfully big to waste on the ring buffer. What's the average
>> size of a string?
>> 
>> > +
>> > +DECLARE_EVENT_CLASS(switchdev_call,
>> > +	TP_PROTO(unsigned long val,
>> > +		 const struct switchdev_notifier_info *info,
>> > +		 int err),
>> > +
>> > +	TP_ARGS(val, info, err),
>> > +
>> > +	TP_STRUCT__entry(
>> > +		__field(unsigned long, val)
>> > +		__string(dev, info->dev ? netdev_name(info->dev) : "(null)")
>> > +		__field(const struct switchdev_notifier_info *, info)
>> > +		__field(int, err)
>> > +		__array(char, msg, SWITCHDEV_TRACE_MSG_MAX)
>> > +	),
>> > +
>> > +	TP_fast_assign(
>> > +		__entry->val = val;
>> > +		__assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)");
>> > +		__entry->info = info;
>> > +		__entry->err = err;
>> > +		switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX);
>> 
>> Is it possible to just store the information in the trace event and then
>> call the above function in the read stage?
>
> I agree with Steven: it looks like that with the above code the
> tracepoint itself will become measurably costily in terms of CPU
> cycles: we want to avoid that.
>
> Perhaps using different tracepoints with different notifier_block type
> would help? so that each trace point could just copy a few specific
> fields.

This can be done, but you will end up having to duplicate the decoding
and formatting logic from switchdev-str.c, with the additional hurdle of
having to figure out the sizes of all referenced objects in order to
create flattened versions of every notification type.

What I like about the current approach is that when new notification and
object types are added, switchdev_notifier_str will automatically be
able to decode them and give you some rough idea of what is going on,
even if no new message specific decoding logic is added. It is also
reusable by drivers that might want to decode notifications or objects
in error messages.

Would some variant of (how I understand) Steven's suggestion to instead
store the formatted message in a dynamic array (__assign_str()), rather
than in the tracepoint entry, be acceptable?
Steven Rostedt Feb. 28, 2024, 2:56 p.m. UTC | #5
On Wed, 28 Feb 2024 11:47:24 +0100
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> >> > +	TP_fast_assign(
> >> > +		__entry->val = val;
> >> > +		__assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)");
> >> > +		__entry->info = info;
> >> > +		__entry->err = err;
> >> > +		switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX);  
> >> 
> >> Is it possible to just store the information in the trace event and then
> >> call the above function in the read stage?  
> >
> > I agree with Steven: it looks like that with the above code the
> > tracepoint itself will become measurably costily in terms of CPU
> > cycles: we want to avoid that.
> >
> > Perhaps using different tracepoints with different notifier_block type
> > would help? so that each trace point could just copy a few specific
> > fields.  
> 
> This can be done, but you will end up having to duplicate the decoding
> and formatting logic from switchdev-str.c, with the additional hurdle of
> having to figure out the sizes of all referenced objects in order to
> create flattened versions of every notification type.

Would it help if you could pass a trace_seq to it? The TP_printk() has a
"magical" trace_seq variable that trace events can use in the TP_printk()
called "p".

Look at:

  include/trace/events/libata.h:

const char *libata_trace_parse_status(struct trace_seq*, unsigned char);
#define __parse_status(s) libata_trace_parse_status(p, s)

Where we have:

const char *
libata_trace_parse_status(struct trace_seq *p, unsigned char status)
{
	const char *ret = trace_seq_buffer_ptr(p);

	trace_seq_printf(p, "{ ");
	if (status & ATA_BUSY)
		trace_seq_printf(p, "BUSY ");
	if (status & ATA_DRDY)
		trace_seq_printf(p, "DRDY ");
	if (status & ATA_DF)
		trace_seq_printf(p, "DF ");
	if (status & ATA_DSC)
		trace_seq_printf(p, "DSC ");
	if (status & ATA_DRQ)
		trace_seq_printf(p, "DRQ ");
	if (status & ATA_CORR)
		trace_seq_printf(p, "CORR ");
	if (status & ATA_SENSE)
		trace_seq_printf(p, "SENSE ");
	if (status & ATA_ERR)
		trace_seq_printf(p, "ERR ");
	trace_seq_putc(p, '}');
	trace_seq_putc(p, 0);

	return ret;
}

The "trace_seq p" is a pointer to trace_seq descriptor that can build
strings, and then you can use it to print a custom string in the trace
output.



> 
> What I like about the current approach is that when new notification and
> object types are added, switchdev_notifier_str will automatically be
> able to decode them and give you some rough idea of what is going on,
> even if no new message specific decoding logic is added. It is also
> reusable by drivers that might want to decode notifications or objects
> in error messages.
> 
> Would some variant of (how I understand) Steven's suggestion to instead
> store the formatted message in a dynamic array (__assign_str()), rather
> than in the tracepoint entry, be acceptable?

Matters if you could adapt using a trace_seq for the output. Or at least
use a seq_buf, as that's what is under the covers of trace_seq. If you
rather just use seq_buf, the above could pretty much be the same by passing
in: &p->seq.

-- Steve
Tobias Waldekranz March 4, 2024, 10:40 p.m. UTC | #6
On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 28 Feb 2024 11:47:24 +0100
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> >> > +	TP_fast_assign(
>> >> > +		__entry->val = val;
>> >> > +		__assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)");
>> >> > +		__entry->info = info;
>> >> > +		__entry->err = err;
>> >> > +		switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX);  
>> >> 
>> >> Is it possible to just store the information in the trace event and then
>> >> call the above function in the read stage?  
>> >
>> > I agree with Steven: it looks like that with the above code the
>> > tracepoint itself will become measurably costily in terms of CPU
>> > cycles: we want to avoid that.
>> >
>> > Perhaps using different tracepoints with different notifier_block type
>> > would help? so that each trace point could just copy a few specific
>> > fields.  
>> 
>> This can be done, but you will end up having to duplicate the decoding
>> and formatting logic from switchdev-str.c, with the additional hurdle of
>> having to figure out the sizes of all referenced objects in order to
>> create flattened versions of every notification type.
>
> Would it help if you could pass a trace_seq to it? The TP_printk() has a
> "magical" trace_seq variable that trace events can use in the TP_printk()
> called "p".
>
> Look at:
>
>   include/trace/events/libata.h:
>
> const char *libata_trace_parse_status(struct trace_seq*, unsigned char);
> #define __parse_status(s) libata_trace_parse_status(p, s)
>
> Where we have:
>
> const char *
> libata_trace_parse_status(struct trace_seq *p, unsigned char status)
> {
> 	const char *ret = trace_seq_buffer_ptr(p);
>
> 	trace_seq_printf(p, "{ ");
> 	if (status & ATA_BUSY)
> 		trace_seq_printf(p, "BUSY ");
> 	if (status & ATA_DRDY)
> 		trace_seq_printf(p, "DRDY ");
> 	if (status & ATA_DF)
> 		trace_seq_printf(p, "DF ");
> 	if (status & ATA_DSC)
> 		trace_seq_printf(p, "DSC ");
> 	if (status & ATA_DRQ)
> 		trace_seq_printf(p, "DRQ ");
> 	if (status & ATA_CORR)
> 		trace_seq_printf(p, "CORR ");
> 	if (status & ATA_SENSE)
> 		trace_seq_printf(p, "SENSE ");
> 	if (status & ATA_ERR)
> 		trace_seq_printf(p, "ERR ");
> 	trace_seq_putc(p, '}');
> 	trace_seq_putc(p, 0);
>
> 	return ret;
> }
>
> The "trace_seq p" is a pointer to trace_seq descriptor that can build
> strings, and then you can use it to print a custom string in the trace
> output.

Yes I managed to decode the hidden variable :) I also found
trace_seq_acquire() (and its macro alter ego __get_buf()), which would
let me keep the generic stringer functions. So far, so good.

I think the foundational problem remains though: TP_printk() is not
executed until a user reads from the trace_pipe; at which point the
object referenced by __entry->info may already be dead and
buried. Right?

>> 
>> What I like about the current approach is that when new notification and
>> object types are added, switchdev_notifier_str will automatically be
>> able to decode them and give you some rough idea of what is going on,
>> even if no new message specific decoding logic is added. It is also
>> reusable by drivers that might want to decode notifications or objects
>> in error messages.
>> 
>> Would some variant of (how I understand) Steven's suggestion to instead
>> store the formatted message in a dynamic array (__assign_str()), rather
>> than in the tracepoint entry, be acceptable?
>
> Matters if you could adapt using a trace_seq for the output. Or at least
> use a seq_buf, as that's what is under the covers of trace_seq. If you
> rather just use seq_buf, the above could pretty much be the same by passing
> in: &p->seq.
>
> -- Steve
Steven Rostedt March 6, 2024, 3:15 p.m. UTC | #7
On Mon, 04 Mar 2024 23:40:49 +0100
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Wed, 28 Feb 2024 11:47:24 +0100
> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >  
> > The "trace_seq p" is a pointer to trace_seq descriptor that can build
> > strings, and then you can use it to print a custom string in the trace
> > output.  
> 
> Yes I managed to decode the hidden variable :) I also found
> trace_seq_acquire() (and its macro alter ego __get_buf()), which would
> let me keep the generic stringer functions. So far, so good.
> 
> I think the foundational problem remains though: TP_printk() is not
> executed until a user reads from the trace_pipe; at which point the
> object referenced by __entry->info may already be dead and
> buried. Right?

Correct. You would need to load all the information into the event data
itself, at the time of the event is triggered, that is needed to determine
how to display it.

-- Steve
Tobias Waldekranz March 6, 2024, 8:02 p.m. UTC | #8
On ons, mar 06, 2024 at 10:15, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 04 Mar 2024 23:40:49 +0100
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > On Wed, 28 Feb 2024 11:47:24 +0100
>> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>> >  
>> > The "trace_seq p" is a pointer to trace_seq descriptor that can build
>> > strings, and then you can use it to print a custom string in the trace
>> > output.  
>> 
>> Yes I managed to decode the hidden variable :) I also found
>> trace_seq_acquire() (and its macro alter ego __get_buf()), which would
>> let me keep the generic stringer functions. So far, so good.
>> 
>> I think the foundational problem remains though: TP_printk() is not
>> executed until a user reads from the trace_pipe; at which point the
>> object referenced by __entry->info may already be dead and
>> buried. Right?
>
> Correct. You would need to load all the information into the event data
> itself, at the time of the event is triggered, that is needed to determine
> how to display it.

Given that that is quite gnarly to do for the events I'm trying to
trace, because of the complex object graph, would it be acceptable to
format the message in the assign phase and store it as dynamic data?
I.e., what (I think) you suggested at the end of your first response.

My thinking is:

- Managing a duplicate (flattened) object graph, exclusively for use by
  these tracepoints, increases the effort to keep the tracing in sync
  with new additions to switchdev; which I think will result in
  developers simply avoiding it altogether. In other words: I'd rather
  have somewhat inefficient but simple flashlight, rather than a very
  efficient one that no one knows how to change the batteries in.

- This is typically not a very hot path. Most events are triggered by
  user configuration. Otherwise when new neighbors are discovered.

- __entry->info is still there for use by raw tracepoint consumers from
  userspace.
Steven Rostedt March 6, 2024, 9:46 p.m. UTC | #9
On Wed, 06 Mar 2024 21:02:06 +0100
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> On ons, mar 06, 2024 at 10:15, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Mon, 04 Mar 2024 23:40:49 +0100
> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >  
> >> On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote:  
> >> > On Wed, 28 Feb 2024 11:47:24 +0100
> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >> >  
> >> > The "trace_seq p" is a pointer to trace_seq descriptor that can build
> >> > strings, and then you can use it to print a custom string in the trace
> >> > output.    
> >> 
> >> Yes I managed to decode the hidden variable :) I also found
> >> trace_seq_acquire() (and its macro alter ego __get_buf()), which would
> >> let me keep the generic stringer functions. So far, so good.
> >> 
> >> I think the foundational problem remains though: TP_printk() is not
> >> executed until a user reads from the trace_pipe; at which point the
> >> object referenced by __entry->info may already be dead and
> >> buried. Right?  
> >
> > Correct. You would need to load all the information into the event data
> > itself, at the time of the event is triggered, that is needed to determine
> > how to display it.  
> 
> Given that that is quite gnarly to do for the events I'm trying to
> trace, because of the complex object graph, would it be acceptable to
> format the message in the assign phase and store it as dynamic data?
> I.e., what (I think) you suggested at the end of your first response.

It's really up to what you want to do ;-)

> 
> My thinking is:
> 
> - Managing a duplicate (flattened) object graph, exclusively for use by
>   these tracepoints, increases the effort to keep the tracing in sync
>   with new additions to switchdev; which I think will result in
>   developers simply avoiding it altogether. In other words: I'd rather
>   have somewhat inefficient but simple flashlight, rather than a very
>   efficient one that no one knows how to change the batteries in.
> 
> - This is typically not a very hot path. Most events are triggered by
>   user configuration. Otherwise when new neighbors are discovered.
> 
> - __entry->info is still there for use by raw tracepoint consumers from
>   userspace.

How big is this info?

-- Steve
Tobias Waldekranz March 6, 2024, 10:45 p.m. UTC | #10
On ons, mar 06, 2024 at 16:46, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 06 Mar 2024 21:02:06 +0100
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> On ons, mar 06, 2024 at 10:15, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > On Mon, 04 Mar 2024 23:40:49 +0100
>> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>> >  
>> >> On ons, feb 28, 2024 at 09:56, Steven Rostedt <rostedt@goodmis.org> wrote:  
>> >> > On Wed, 28 Feb 2024 11:47:24 +0100
>> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>> >> >  
>> >> > The "trace_seq p" is a pointer to trace_seq descriptor that can build
>> >> > strings, and then you can use it to print a custom string in the trace
>> >> > output.    
>> >> 
>> >> Yes I managed to decode the hidden variable :) I also found
>> >> trace_seq_acquire() (and its macro alter ego __get_buf()), which would
>> >> let me keep the generic stringer functions. So far, so good.
>> >> 
>> >> I think the foundational problem remains though: TP_printk() is not
>> >> executed until a user reads from the trace_pipe; at which point the
>> >> object referenced by __entry->info may already be dead and
>> >> buried. Right?  
>> >
>> > Correct. You would need to load all the information into the event data
>> > itself, at the time of the event is triggered, that is needed to determine
>> > how to display it.  
>> 
>> Given that that is quite gnarly to do for the events I'm trying to
>> trace, because of the complex object graph, would it be acceptable to
>> format the message in the assign phase and store it as dynamic data?
>> I.e., what (I think) you suggested at the end of your first response.
>
> It's really up to what you want to do ;-)

Alright. I'll interpret that as "there's a >0% chance that I'll give you
an Acked-by on something like that" :)

>> 
>> My thinking is:
>> 
>> - Managing a duplicate (flattened) object graph, exclusively for use by
>>   these tracepoints, increases the effort to keep the tracing in sync
>>   with new additions to switchdev; which I think will result in
>>   developers simply avoiding it altogether. In other words: I'd rather
>>   have somewhat inefficient but simple flashlight, rather than a very
>>   efficient one that no one knows how to change the batteries in.
>> 
>> - This is typically not a very hot path. Most events are triggered by
>>   user configuration. Otherwise when new neighbors are discovered.
>> 
>> - __entry->info is still there for use by raw tracepoint consumers from
>>   userspace.
>
> How big is this info?

The common struct (switchdev_notifier_info) is 24B at the
moment. Depending on __entry->val, the size of the enclosing
notification (e.g. switchdev_notifier_port_obj_info) is between
40-64B. This pattern may then repeat again inside the concrete notifier,
where you have a pointer to a common object (e.g. switchdev_obj, 48B)
whose outer size (e.g. switchdev_obj_port_vlan, 56B) is determined by an
accompanying enum.
Steven Rostedt March 6, 2024, 11:33 p.m. UTC | #11
On Wed, 06 Mar 2024 23:45:57 +0100
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> > How big is this info?  
> 
> The common struct (switchdev_notifier_info) is 24B at the
> moment. Depending on __entry->val, the size of the enclosing
> notification (e.g. switchdev_notifier_port_obj_info) is between
> 40-64B. This pattern may then repeat again inside the concrete notifier,
> where you have a pointer to a common object (e.g. switchdev_obj, 48B)
> whose outer size (e.g. switchdev_obj_port_vlan, 56B) is determined by an
> accompanying enum.

As long as it's under 1K, you should be good.

-- Steve
diff mbox series

Patch

diff --git a/include/trace/events/switchdev.h b/include/trace/events/switchdev.h
new file mode 100644
index 000000000000..dcaf6870d017
--- /dev/null
+++ b/include/trace/events/switchdev.h
@@ -0,0 +1,74 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM	switchdev
+
+#if !defined(_TRACE_SWITCHDEV_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_SWITCHDEV_H
+
+#include <linux/tracepoint.h>
+#include <net/switchdev.h>
+
+#define SWITCHDEV_TRACE_MSG_MAX 128
+
+DECLARE_EVENT_CLASS(switchdev_call,
+	TP_PROTO(unsigned long val,
+		 const struct switchdev_notifier_info *info,
+		 int err),
+
+	TP_ARGS(val, info, err),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, val)
+		__string(dev, info->dev ? netdev_name(info->dev) : "(null)")
+		__field(const struct switchdev_notifier_info *, info)
+		__field(int, err)
+		__array(char, msg, SWITCHDEV_TRACE_MSG_MAX)
+	),
+
+	TP_fast_assign(
+		__entry->val = val;
+		__assign_str(dev, info->dev ? netdev_name(info->dev) : "(null)");
+		__entry->info = info;
+		__entry->err = err;
+		switchdev_notifier_str(val, info, __entry->msg, SWITCHDEV_TRACE_MSG_MAX);
+	),
+
+	TP_printk("dev %s %s -> %d", __get_str(dev), __entry->msg, __entry->err)
+);
+
+DEFINE_EVENT(switchdev_call, switchdev_defer,
+	TP_PROTO(unsigned long val,
+		 const struct switchdev_notifier_info *info,
+		 int err),
+
+	TP_ARGS(val, info, err)
+);
+
+DEFINE_EVENT(switchdev_call, switchdev_call_atomic,
+	TP_PROTO(unsigned long val,
+		 const struct switchdev_notifier_info *info,
+		 int err),
+
+	TP_ARGS(val, info, err)
+);
+
+DEFINE_EVENT(switchdev_call, switchdev_call_blocking,
+	TP_PROTO(unsigned long val,
+		 const struct switchdev_notifier_info *info,
+		 int err),
+
+	TP_ARGS(val, info, err)
+);
+
+DEFINE_EVENT(switchdev_call, switchdev_call_replay,
+	TP_PROTO(unsigned long val,
+		 const struct switchdev_notifier_info *info,
+		 int err),
+
+	TP_ARGS(val, info, err)
+);
+
+#endif /* _TRACE_SWITCHDEV_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index f73249269a87..74d715166981 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -19,6 +19,9 @@ 
 #include <linux/rtnetlink.h>
 #include <net/switchdev.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/switchdev.h>
+
 static bool switchdev_obj_eq(const struct switchdev_obj *a,
 			     const struct switchdev_obj *b)
 {
@@ -180,8 +183,20 @@  static void switchdev_port_attr_set_deferred(struct net_device *dev,
 static int switchdev_port_attr_set_defer(struct net_device *dev,
 					 const struct switchdev_attr *attr)
 {
-	return switchdev_deferred_enqueue(dev, attr, sizeof(*attr),
-					  switchdev_port_attr_set_deferred);
+	int err;
+
+	err = switchdev_deferred_enqueue(dev, attr, sizeof(*attr),
+					 switchdev_port_attr_set_deferred);
+
+	if (trace_switchdev_defer_enabled()) {
+		struct switchdev_notifier_port_attr_info attr_info = {
+			.info.dev = dev,
+			.attr = attr,
+		};
+
+		trace_switchdev_defer(SWITCHDEV_PORT_ATTR_SET, &attr_info.info, err);
+	}
+	return err;
 }
 
 /**
@@ -263,8 +278,20 @@  static void switchdev_port_obj_add_deferred(struct net_device *dev,
 static int switchdev_port_obj_add_defer(struct net_device *dev,
 					const struct switchdev_obj *obj)
 {
-	return switchdev_deferred_enqueue(dev, obj, switchdev_obj_size(obj),
-					  switchdev_port_obj_add_deferred);
+	int err;
+
+	err = switchdev_deferred_enqueue(dev, obj, switchdev_obj_size(obj),
+					 switchdev_port_obj_add_deferred);
+
+	if (trace_switchdev_defer_enabled()) {
+		struct switchdev_notifier_port_obj_info obj_info = {
+			.info.dev = dev,
+			.obj = obj,
+		};
+
+		trace_switchdev_defer(SWITCHDEV_PORT_OBJ_ADD, &obj_info.info, err);
+	}
+	return err;
 }
 
 /**
@@ -313,8 +340,20 @@  static void switchdev_port_obj_del_deferred(struct net_device *dev,
 static int switchdev_port_obj_del_defer(struct net_device *dev,
 					const struct switchdev_obj *obj)
 {
-	return switchdev_deferred_enqueue(dev, obj, switchdev_obj_size(obj),
-					  switchdev_port_obj_del_deferred);
+	int err;
+
+	err = switchdev_deferred_enqueue(dev, obj, switchdev_obj_size(obj),
+					 switchdev_port_obj_del_deferred);
+
+	if (trace_switchdev_defer_enabled()) {
+		struct switchdev_notifier_port_obj_info obj_info = {
+			.info.dev = dev,
+			.obj = obj,
+		};
+
+		trace_switchdev_defer(SWITCHDEV_PORT_OBJ_DEL, &obj_info.info, err);
+	}
+	return err;
 }
 
 /**
@@ -394,7 +433,11 @@  EXPORT_SYMBOL_GPL(switchdev_port_obj_act_is_deferred);
 int switchdev_call_replay(struct notifier_block *nb, unsigned long type,
 			  struct switchdev_notifier_info *info)
 {
-	return nb->notifier_call(nb, type, info);
+	int ret;
+
+	ret = nb->notifier_call(nb, type, info);
+	trace_switchdev_call_replay(type, info, notifier_to_errno(ret));
+	return ret;
 }
 EXPORT_SYMBOL_GPL(switchdev_call_replay);
 
@@ -437,9 +480,13 @@  int call_switchdev_notifiers(unsigned long val, struct net_device *dev,
 			     struct switchdev_notifier_info *info,
 			     struct netlink_ext_ack *extack)
 {
+	int ret;
+
 	info->dev = dev;
 	info->extack = extack;
-	return atomic_notifier_call_chain(&switchdev_notif_chain, val, info);
+	ret = atomic_notifier_call_chain(&switchdev_notif_chain, val, info);
+	trace_switchdev_call_atomic(val, info, notifier_to_errno(ret));
+	return ret;
 }
 EXPORT_SYMBOL_GPL(call_switchdev_notifiers);
 
@@ -463,10 +510,14 @@  int call_switchdev_blocking_notifiers(unsigned long val, struct net_device *dev,
 				      struct switchdev_notifier_info *info,
 				      struct netlink_ext_ack *extack)
 {
+	int ret;
+
 	info->dev = dev;
 	info->extack = extack;
-	return blocking_notifier_call_chain(&switchdev_blocking_notif_chain,
-					    val, info);
+	ret = blocking_notifier_call_chain(&switchdev_blocking_notif_chain,
+					   val, info);
+	trace_switchdev_call_blocking(val, info, notifier_to_errno(ret));
+	return ret;
 }
 EXPORT_SYMBOL_GPL(call_switchdev_blocking_notifiers);