Message ID | 20221007154145.1877054-1-vschneid@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Generic IPI sending tracepoint | expand |
Hi Valentin, On Fri, Oct 07, 2022 at 04:41:40PM +0100, Valentin Schneider wrote: > Background > ========== > > Detecting IPI *reception* is relatively easy, e.g. using > trace_irq_handler_{entry,exit} or even just function-trace > flush_smp_call_function_queue() for SMP calls. > > Figuring out their *origin*, is trickier as there is no generic tracepoint tied > to e.g. smp_call_function(): > > o AFAIA x86 has no tracepoint tied to sending IPIs, only receiving them > (cf. trace_call_function{_single}_entry()). > o arm/arm64 do have trace_ipi_raise(), which gives us the target cpus but also a > mostly useless string (smp_calls will all be "Function call interrupts"). > o Other architectures don't seem to have any IPI-sending related tracepoint. > > I believe one reason those tracepoints used by arm/arm64 ended up as they were > is because these archs used to handle IPIs differently from regular interrupts > (the IRQ driver would directly invoke an IPI-handling routine), which meant they > never showed up in trace_irq_handler_{entry, exit}. The trace_ipi_{entry,exit} > tracepoints gave a way to trace IPI reception but those have become redundant as > of: > > 56afcd3dbd19 ("ARM: Allow IPIs to be handled as normal interrupts") > d3afc7f12987 ("arm64: Allow IPIs to be handled as normal interrupts") > > which gave IPIs a "proper" handler function used through > generic_handle_domain_irq(), which makes them show up via > trace_irq_handler_{entry, exit}. > > Changing stuff up > ================= > > Per the above, it would make sense to reshuffle trace_ipi_raise() and move it > into generic code. This also came up during Daniel's talk on Osnoise at the CPU > isolation MC of LPC 2022 [1]. > > Now, to be useful, such a tracepoint needs to export: > o targeted CPU(s) > o calling context > > The only way to get the calling context with trace_ipi_raise() is to trigger a > stack dump, e.g. $(trace-cmd -e ipi* -T echo 42). > > As for the targeted CPUs, the existing tracepoint does export them, albeit in > cpumask form, which is quite inconvenient from a tooling perspective. For > instance, as far as I'm aware, it's not possible to do event filtering on a > cpumask via trace-cmd. https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html -f filter Specify a filter for the previous event. This must come after a -e. This will filter what events get recorded based on the content of the event. Filtering is passed to the kernel directly so what filtering is allowed may depend on what version of the kernel you have. Basically, it will let you use C notation to check if an event should be processed or not. ==, >=, <=, >, <, &, |, && and || The above are usually safe to use to compare fields. This looks overkill to me (consider large number of bits set in mask). +#define trace_ipi_send_cpumask(callsite, mask) do { \ + if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \ + int cpu; \ + for_each_cpu(cpu, mask) \ + trace_ipi_send_cpu(callsite, cpu); \ + } \ +} while (0) > > Because of the above points, this is introducing a new tracepoint. > > Patches > ======= > > This results in having trace events for: > > o smp_call_function*() > o smp_send_reschedule() > o irq_work_queue*() > > This is incomplete, just looking at arm64 there's more IPI types that aren't covered: > > IPI_CPU_STOP, > IPI_CPU_CRASH_STOP, > IPI_TIMER, > IPI_WAKEUP, > > ... But it feels like a good starting point. Can't you have a single tracepoint (or variant with cpumask) that would cover such cases as well? Maybe (as parameters for tracepoint): * type (reschedule, smp_call_function, timer, wakeup, ...). * function address: valid for smp_call_function, irq_work_queue types. > Another thing worth mentioning is that depending on the callsite, the _RET_IP_ > fed to the tracepoint is not always useful - generic_exec_single() doesn't tell > you much about the actual callback being sent via IPI, so there might be value > in exploding the single tracepoint into at least one variant for smp_calls. Not sure i grasp what you mean by "exploding the single tracepoint...", but yes knowing the function or irq work function is very useful. > > Links > ===== > > [1]: https://youtu.be/5gT57y4OzBM?t=14234 > > Valentin Schneider (5): > trace: Add trace_ipi_send_{cpu, cpumask} > sched, smp: Trace send_call_function_single_ipi() > smp: Add a multi-CPU variant to send_call_function_single_ipi() > irq_work: Trace calls to arch_irq_work_raise() > treewide: Rename and trace arch-definitions of smp_send_reschedule() > > arch/alpha/kernel/smp.c | 2 +- > arch/arc/kernel/smp.c | 2 +- > arch/arm/kernel/smp.c | 5 +---- > arch/arm64/kernel/smp.c | 3 +-- > arch/csky/kernel/smp.c | 2 +- > arch/hexagon/kernel/smp.c | 2 +- > arch/ia64/kernel/smp.c | 4 ++-- > arch/loongarch/include/asm/smp.h | 2 +- > arch/mips/include/asm/smp.h | 2 +- > arch/openrisc/kernel/smp.c | 2 +- > arch/parisc/kernel/smp.c | 4 ++-- > arch/powerpc/kernel/smp.c | 4 ++-- > arch/riscv/kernel/smp.c | 4 ++-- > arch/s390/kernel/smp.c | 2 +- > arch/sh/kernel/smp.c | 2 +- > arch/sparc/kernel/smp_32.c | 2 +- > arch/sparc/kernel/smp_64.c | 2 +- > arch/x86/include/asm/smp.h | 2 +- > arch/xtensa/kernel/smp.c | 2 +- > include/linux/smp.h | 1 + > include/trace/events/ipi.h | 27 +++++++++++++++++++++++++++ > kernel/irq_work.c | 12 +++++++++++- > kernel/sched/core.c | 7 +++++-- > kernel/smp.c | 18 +++++++++++++++++- > 24 files changed, 84 insertions(+), 31 deletions(-) > > -- > 2.31.1 > >
On Fri, 7 Oct 2022 17:01:33 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote: > > As for the targeted CPUs, the existing tracepoint does export them, albeit in > > cpumask form, which is quite inconvenient from a tooling perspective. For > > instance, as far as I'm aware, it's not possible to do event filtering on a > > cpumask via trace-cmd. > > https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html > > -f filter > Specify a filter for the previous event. This must come after > a -e. This will filter what events get recorded based on the > content of the event. Filtering is passed to the kernel > directly so what filtering is allowed may depend on what > version of the kernel you have. Basically, it will let you > use C notation to check if an event should be processed or > not. > > ==, >=, <=, >, <, &, |, && and || > > The above are usually safe to use to compare fields. We could always add an "isset(x)" filter ;-) -- Steve
+Cc Douglas On 07/10/22 17:01, Marcelo Tosatti wrote: > Hi Valentin, > > On Fri, Oct 07, 2022 at 04:41:40PM +0100, Valentin Schneider wrote: >> Background >> ========== >> >> As for the targeted CPUs, the existing tracepoint does export them, albeit in >> cpumask form, which is quite inconvenient from a tooling perspective. For >> instance, as far as I'm aware, it's not possible to do event filtering on a >> cpumask via trace-cmd. > > https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html > > -f filter > Specify a filter for the previous event. This must come after > a -e. This will filter what events get recorded based on the > content of the event. Filtering is passed to the kernel > directly so what filtering is allowed may depend on what > version of the kernel you have. Basically, it will let you > use C notation to check if an event should be processed or > not. > > ==, >=, <=, >, <, &, |, && and || > > The above are usually safe to use to compare fields. > > This looks overkill to me (consider large number of bits set in mask). > > +#define trace_ipi_send_cpumask(callsite, mask) do { \ > + if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \ > + int cpu; \ > + for_each_cpu(cpu, mask) \ > + trace_ipi_send_cpu(callsite, cpu); \ > + } \ > +} while (0) > Indeed, I expected pushback on this :-) I went for this due to how much simpler an int is to process/use compared to a cpumask. There is the trigger example I listed above, but the consumption of the trace event itself as well. Consider this event collected on an arm64 QEMU instance (output from trace-cmd) <...>-234 [001] 37.251567: ipi_raise: target_mask=00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004 (Function call interrupts) That sort of formatting has been an issue downstream for things like LISA [1] where events are aggregated into Pandas tables, and we need to play silly games for performance reason because bitmasks aren't a native Python type. I had a look at libtraceevent to see how this data is exposed and if the answer would be better tooling: tep_get_field_val() just yields an unsigned long long of value 0x200018, which AFAICT is just the [length, offset] thing associated with dynamic arrays. Not really usable, and I don't see anything exported in the lib to extract and use those values. tep_get_field_raw() is better, it handles the dynamic array for us and yields a pointer to the cpumask array at the tail of the record. With that it's easy to get an output such as: cpumask[size=32]=[4,0,0,0,]. Still, this isn't a native type for many programming languages. In contrast, this is immediately readable and consumable by userspace tools <...>-234 [001] 37.250882: ipi_send_cpu: callsite=__smp_call_single_queue+0x5c target_cpu=2 Thinking out loud, it makes way more sense to record a cpumask in the tracepoint, but perhaps we could have a postprocessing step to transform those into N events each targeting a single CPU? [1]: https://github.com/ARM-software/lisa/blob/37b51243a94b27ea031ff62bb4ce818a59a7f6ef/lisa/trace.py#L4756 > >> >> Because of the above points, this is introducing a new tracepoint. >> >> Patches >> ======= >> >> This results in having trace events for: >> >> o smp_call_function*() >> o smp_send_reschedule() >> o irq_work_queue*() >> >> This is incomplete, just looking at arm64 there's more IPI types that aren't covered: >> >> IPI_CPU_STOP, >> IPI_CPU_CRASH_STOP, >> IPI_TIMER, >> IPI_WAKEUP, >> >> ... But it feels like a good starting point. > > Can't you have a single tracepoint (or variant with cpumask) that would > cover such cases as well? > > Maybe (as parameters for tracepoint): > > * type (reschedule, smp_call_function, timer, wakeup, ...). > > * function address: valid for smp_call_function, irq_work_queue > types. > That's a good point, I wasn't sure about having a parameter serving as discriminant for another, but the function address would be either valid or NULL which is fine. So perhaps: o callsite (i.e. _RET_IP_), serves as type o address of callback tied to IPI, if any o target CPUs >> Another thing worth mentioning is that depending on the callsite, the _RET_IP_ >> fed to the tracepoint is not always useful - generic_exec_single() doesn't tell >> you much about the actual callback being sent via IPI, so there might be value >> in exploding the single tracepoint into at least one variant for smp_calls. > > Not sure i grasp what you mean by "exploding the single tracepoint...", > but yes knowing the function or irq work function is very useful. > Sorry; I meant having several "specialized" tracepoints instead of a single one.
On 10/11/22 18:17, Valentin Schneider wrote: > Thinking out loud, it makes way more sense to record a cpumask in the > tracepoint, but perhaps we could have a postprocessing step to transform > those into N events each targeting a single CPU? My approach on the tracers/rtla is to make the simple things in kernel, and beautify things in user-space. You could keep the tracepoint as a mask, and then make it pretty, like cpus=3-5,8 in user-space. For example with a trace-cmd/perf loadable plugin, libtracefs helper. For rtla I was thinking to make a new tool to parse them. and make it pretty there. -- Daniel
On 11/10/22 18:22, Daniel Bristot de Oliveira wrote: > On 10/11/22 18:17, Valentin Schneider wrote: >> Thinking out loud, it makes way more sense to record a cpumask in the >> tracepoint, but perhaps we could have a postprocessing step to transform >> those into N events each targeting a single CPU? > > My approach on the tracers/rtla is to make the simple things in kernel, and beautify > things in user-space. > > You could keep the tracepoint as a mask, and then make it pretty, like cpus=3-5,8 > in user-space. For example with a trace-cmd/perf loadable plugin, libtracefs helper. > That's a nice idea, the one downside I see is that means registering an event handler for all events with cpumasks rather than directly targeting cpumask fields, but that doesn't look too horrible. I'll dig a bit in that direction. > For rtla I was thinking to make a new tool to parse them. and make it pretty there. > > -- Daniel
On Tue, 11 Oct 2022 17:17:04 +0100 Valentin Schneider <vschneid@redhat.com> wrote: > tep_get_field_val() just yields an unsigned long long of value 0x200018, > which AFAICT is just the [length, offset] thing associated with dynamic > arrays. Not really usable, and I don't see anything exported in the lib to > extract and use those values. Correct. > > tep_get_field_raw() is better, it handles the dynamic array for us and > yields a pointer to the cpumask array at the tail of the record. With that > it's easy to get an output such as: cpumask[size=32]=[4,0,0,0,]. Still, > this isn't a native type for many programming languages. Yeah, this is the interface that I would have used. And it would likely require some kind of wrapper to make it into what you prefer. Note, I've been complaining for some time now how much I hate the libtraceevent interface, and want to rewrite it. (I just spoke to someone that wants to implement it in Rust). But the question comes down to how to make it backward compatible. Perhaps we don't and just up the major version number (libtraceevent 2.0). What would you recommend as an API to process cpumasks better? -- Steve
On Tue, 11 Oct 2022 17:40:26 +0100 Valentin Schneider <vschneid@redhat.com> wrote: > > You could keep the tracepoint as a mask, and then make it pretty, like cpus=3-5,8 > > in user-space. For example with a trace-cmd/perf loadable plugin, libtracefs helper. > > > > That's a nice idea, the one downside I see is that means registering an > event handler for all events with cpumasks rather than directly targeting > cpumask fields, but that doesn't look too horrible. I'll dig a bit in that > direction. We could just make all all dynamic array's of unsigned long use that format? I don't know of any other event that has dynamic arrays of unsigned longs. And doing a search doesn't come up with any. -- Steve