Message ID | 20240715160712.127117-3-adrian.hunter@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [V9,01/13] perf/x86/intel/pt: Fix sampling synchronization | expand |
On Mon, Jul 15, 2024 at 07:07:01PM +0300, Adrian Hunter wrote: > Hardware traces, such as instruction traces, can produce a vast amount of > trace data, so being able to reduce tracing to more specific circumstances > can be useful. > > The ability to pause or resume tracing when another event happens, can do > that. > > Add ability for an event to "pause" or "resume" AUX area tracing. > > Add aux_pause bit to perf_event_attr to indicate that, if the event > happens, the associated AUX area tracing should be paused. Ditto > aux_resume. Do not allow aux_pause and aux_resume to be set together. > > Add aux_start_paused bit to perf_event_attr to indicate to an AUX area > event that it should start in a "paused" state. > > Add aux_paused to struct hw_perf_event for AUX area events to keep track of > the "paused" state. aux_paused is initialized to aux_start_paused. > > Add PERF_EF_PAUSE and PERF_EF_RESUME modes for ->stop() and ->start() > callbacks. Call as needed, during __perf_event_output(). Add Why in __perf_event_output() rather than in __perf_event_overflow(). Specifically, before bpf_overflow_handler(). That is, what do we want BPF to be able to do here? To me it seems strange that BPF would be able to affect this functionality. You want this pause/resume to happen irrespective of how the rest of the event is processed, no? > aux_in_pause_resume to struct perf_buffer to prevent races with the NMI > handler. Pause/resume in NMI context will miss out if it coincides with > another pause/resume. I'm struggling here. That variable is only ever used inside that one function. So it must be self-recursion. Are you thinking something like: swevent_overflow() ... event_aux_pause() <NMI> event_overflow() ... event_aux_pause() ? Where two events in the group, one software and one hardware, are both trying to control the AUX thing? Because I don't think the PT-PMI ever gets here. > To use aux_pause or aux_resume, an event must be in a group with the AUX > area event as the group leader. > @@ -402,6 +411,15 @@ struct pmu { > * > * ->start() with PERF_EF_RELOAD will reprogram the counter > * value, must be preceded by a ->stop() with PERF_EF_UPDATE. > + * > + * ->stop() with PERF_EF_PAUSE will stop as simply as possible. Will not > + * overlap another ->stop() with PERF_EF_PAUSE nor ->start() with > + * PERF_EF_RESUME. > + * > + * ->start() with PERF_EF_RESUME will start as simply as possible but > + * only if the counter is not otherwise stopped. Will not overlap > + * another ->start() with PERF_EF_RESUME nor ->stop() with > + * PERF_EF_PAUSE. > */ > void (*start) (struct perf_event *event, int flags); > void (*stop) (struct perf_event *event, int flags); Notably, they *can* race with ->stop/start without EF_PAUSE/RESUME.
On 18/07/24 12:38, Peter Zijlstra wrote: > On Mon, Jul 15, 2024 at 07:07:01PM +0300, Adrian Hunter wrote: >> Hardware traces, such as instruction traces, can produce a vast amount of >> trace data, so being able to reduce tracing to more specific circumstances >> can be useful. >> >> The ability to pause or resume tracing when another event happens, can do >> that. >> >> Add ability for an event to "pause" or "resume" AUX area tracing. >> >> Add aux_pause bit to perf_event_attr to indicate that, if the event >> happens, the associated AUX area tracing should be paused. Ditto >> aux_resume. Do not allow aux_pause and aux_resume to be set together. >> >> Add aux_start_paused bit to perf_event_attr to indicate to an AUX area >> event that it should start in a "paused" state. >> >> Add aux_paused to struct hw_perf_event for AUX area events to keep track of >> the "paused" state. aux_paused is initialized to aux_start_paused. >> >> Add PERF_EF_PAUSE and PERF_EF_RESUME modes for ->stop() and ->start() >> callbacks. Call as needed, during __perf_event_output(). Add > > Why in __perf_event_output() rather than in __perf_event_overflow(). > Specifically, before bpf_overflow_handler(). > > That is, what do we want BPF to be able to do here? To me it seems > strange that BPF would be able to affect this functionality. You want > this pause/resume to happen irrespective of how the rest of the event is > processed, no? The thought was to have the output match up with pause/resume, but it doesn't really make much difference. Putting it before the BPF handler is reasonable. > >> aux_in_pause_resume to struct perf_buffer to prevent races with the NMI >> handler. Pause/resume in NMI context will miss out if it coincides with >> another pause/resume. > > I'm struggling here. That variable is only ever used inside that one > function. So it must be self-recursion. Are you thinking something like: > > swevent_overflow() > ... > event_aux_pause() > <NMI> > event_overflow() > ... > event_aux_pause() > > ? > > Where two events in the group, one software and one hardware, are both > trying to control the AUX thing? Exactly that yes. > Because I don't think the PT-PMI ever > gets here. No it doesn't. AUX pause/resume is something a non-AUX event in the group does to the AUX event which is the group leader. > >> To use aux_pause or aux_resume, an event must be in a group with the AUX >> area event as the group leader. > > >> @@ -402,6 +411,15 @@ struct pmu { >> * >> * ->start() with PERF_EF_RELOAD will reprogram the counter >> * value, must be preceded by a ->stop() with PERF_EF_UPDATE. >> + * >> + * ->stop() with PERF_EF_PAUSE will stop as simply as possible. Will not >> + * overlap another ->stop() with PERF_EF_PAUSE nor ->start() with >> + * PERF_EF_RESUME. >> + * >> + * ->start() with PERF_EF_RESUME will start as simply as possible but >> + * only if the counter is not otherwise stopped. Will not overlap >> + * another ->start() with PERF_EF_RESUME nor ->stop() with >> + * PERF_EF_PAUSE. >> */ >> void (*start) (struct perf_event *event, int flags); >> void (*stop) (struct perf_event *event, int flags); > > Notably, they *can* race with ->stop/start without EF_PAUSE/RESUME. Yes that would be worth adding to the comments.
On Thu, Jul 18, 2024 at 02:19:03PM +0300, Adrian Hunter wrote: > On 18/07/24 12:38, Peter Zijlstra wrote: > > On Mon, Jul 15, 2024 at 07:07:01PM +0300, Adrian Hunter wrote: > >> Hardware traces, such as instruction traces, can produce a vast amount of > >> trace data, so being able to reduce tracing to more specific circumstances > >> can be useful. > >> > >> The ability to pause or resume tracing when another event happens, can do > >> that. > >> > >> Add ability for an event to "pause" or "resume" AUX area tracing. > >> > >> Add aux_pause bit to perf_event_attr to indicate that, if the event > >> happens, the associated AUX area tracing should be paused. Ditto > >> aux_resume. Do not allow aux_pause and aux_resume to be set together. > >> > >> Add aux_start_paused bit to perf_event_attr to indicate to an AUX area > >> event that it should start in a "paused" state. > >> > >> Add aux_paused to struct hw_perf_event for AUX area events to keep track of > >> the "paused" state. aux_paused is initialized to aux_start_paused. > >> > >> Add PERF_EF_PAUSE and PERF_EF_RESUME modes for ->stop() and ->start() > >> callbacks. Call as needed, during __perf_event_output(). Add > > > > Why in __perf_event_output() rather than in __perf_event_overflow(). > > Specifically, before bpf_overflow_handler(). > > > > That is, what do we want BPF to be able to do here? To me it seems > > strange that BPF would be able to affect this functionality. You want > > this pause/resume to happen irrespective of how the rest of the event is > > processed, no? > > The thought was to have the output match up with pause/resume, but it > doesn't really make much difference. > > Putting it before the BPF handler is reasonable. OK, let me do that and make a few more edits and see if I can stare at that next patch some.
On Thu, Jul 18, 2024 at 01:51:26PM +0200, Peter Zijlstra wrote: > OK, let me do that and make a few more edits and see if I can stare at > that next patch some. I pushed out a stack of patches into queue.git perf/core Could you please double check I didn't wreck anything?
On 18/07/24 15:58, Peter Zijlstra wrote: > On Thu, Jul 18, 2024 at 01:51:26PM +0200, Peter Zijlstra wrote: > >> OK, let me do that and make a few more edits and see if I can stare at >> that next patch some. > > I pushed out a stack of patches into queue.git perf/core > Could you please double check I didn't wreck anything? Looks fine, and seems to work OK in a brief test. Thank you! :-)
On Thu, Jul 18, 2024 at 06:06:16PM +0300, Adrian Hunter wrote: > On 18/07/24 15:58, Peter Zijlstra wrote: > > On Thu, Jul 18, 2024 at 01:51:26PM +0200, Peter Zijlstra wrote: > > > >> OK, let me do that and make a few more edits and see if I can stare at > >> that next patch some. > > > > I pushed out a stack of patches into queue.git perf/core > > Could you please double check I didn't wreck anything? > > Looks fine, and seems to work OK in a brief test. > > Thank you! :-) So should I go ahead and pick the tooling patches since the kernel bits are merged? - Arnaldo
On 26/07/24 17:41, Arnaldo Carvalho de Melo wrote: > On Thu, Jul 18, 2024 at 06:06:16PM +0300, Adrian Hunter wrote: >> On 18/07/24 15:58, Peter Zijlstra wrote: >>> On Thu, Jul 18, 2024 at 01:51:26PM +0200, Peter Zijlstra wrote: >>> >>>> OK, let me do that and make a few more edits and see if I can stare at >>>> that next patch some. >>> >>> I pushed out a stack of patches into queue.git perf/core >>> Could you please double check I didn't wreck anything? >> >> Looks fine, and seems to work OK in a brief test. >> >> Thank you! :-) > > So should I go ahead and pick the tooling patches since the kernel bits > are merged? Not exactly merged. Probably should wait until it is in tip at least.
On Fri, Jul 26, 2024 at 07:00:55PM +0300, Adrian Hunter wrote: > On 26/07/24 17:41, Arnaldo Carvalho de Melo wrote: > > On Thu, Jul 18, 2024 at 06:06:16PM +0300, Adrian Hunter wrote: > >> On 18/07/24 15:58, Peter Zijlstra wrote: > >>> On Thu, Jul 18, 2024 at 01:51:26PM +0200, Peter Zijlstra wrote: > >>> > >>>> OK, let me do that and make a few more edits and see if I can stare at > >>>> that next patch some. > >>> > >>> I pushed out a stack of patches into queue.git perf/core > >>> Could you please double check I didn't wreck anything? > >> > >> Looks fine, and seems to work OK in a brief test. > >> > >> Thank you! :-) > > > > So should I go ahead and pick the tooling patches since the kernel bits > > are merged? > > Not exactly merged. Probably should wait until it is in tip at least. Ok, so I get just these, as you asked on another message: acme@x1:~/git/perf-tools-next$ git log --oneline -5 perf-tools-next/tmp.perf-tools-next 9140fec01b2de8d3 (HEAD -> perf-tools-next, perf-tools-next/tmp.perf-tools-next, perf-tools-next.korg/tmp.perf-tools-next, number/perf-tools-next, acme.korg/tmp.perf-tools-next) perf tools: Enable evsel__is_aux_event() to work for S390_CPUMSF c3b7dba6ea81a5b1 perf tools: Enable evsel__is_aux_event() to work for ARM/ARM64 866400c0b08ef9d9 perf scripts python cs-etm: Restore first sample log in verbose mode 08ee74eb03e37191 perf cs-etm: Output 0 instead of 0xdeadbeef when exception packets are flushed c22dd7ec2b2808b2 perf inject: Convert comma to semicolon acme@x1:~/git/perf-tools-next$ It'll go to perf-tools-next once 6.11-rc1 is out. - Arnaldo
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index a5304ae8c654..2a3fa5e3f925 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -222,6 +222,12 @@ struct hw_perf_event { int state; + /* + * For AUX area events, aux_paused cannot be a state flag because it can + * be updated asynchronously to state. + */ + unsigned int aux_paused; + /* * The last observed hardware counter value, updated with a * local64_cmpxchg() such that pmu::read() can be called nested. @@ -291,6 +297,7 @@ struct perf_event_pmu_context; #define PERF_PMU_CAP_NO_EXCLUDE 0x0040 #define PERF_PMU_CAP_AUX_OUTPUT 0x0080 #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100 +#define PERF_PMU_CAP_AUX_PAUSE 0x0200 struct perf_output_handle; @@ -363,6 +370,8 @@ struct pmu { #define PERF_EF_START 0x01 /* start the counter when adding */ #define PERF_EF_RELOAD 0x02 /* reload the counter when starting */ #define PERF_EF_UPDATE 0x04 /* update the counter when stopping */ +#define PERF_EF_PAUSE 0x08 /* AUX area event, pause tracing */ +#define PERF_EF_RESUME 0x10 /* AUX area event, resume tracing */ /* * Adds/Removes a counter to/from the PMU, can be done inside a @@ -402,6 +411,15 @@ struct pmu { * * ->start() with PERF_EF_RELOAD will reprogram the counter * value, must be preceded by a ->stop() with PERF_EF_UPDATE. + * + * ->stop() with PERF_EF_PAUSE will stop as simply as possible. Will not + * overlap another ->stop() with PERF_EF_PAUSE nor ->start() with + * PERF_EF_RESUME. + * + * ->start() with PERF_EF_RESUME will start as simply as possible but + * only if the counter is not otherwise stopped. Will not overlap + * another ->start() with PERF_EF_RESUME nor ->stop() with + * PERF_EF_PAUSE. */ void (*start) (struct perf_event *event, int flags); void (*stop) (struct perf_event *event, int flags); diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 3a64499b0f5d..0c557f0a17b3 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -511,7 +511,16 @@ struct perf_event_attr { __u16 sample_max_stack; __u16 __reserved_2; __u32 aux_sample_size; - __u32 __reserved_3; + + union { + __u32 aux_action; + struct { + __u32 aux_start_paused : 1, /* start AUX area tracing paused */ + aux_pause : 1, /* on overflow, pause AUX area tracing */ + aux_resume : 1, /* on overflow, resume AUX area tracing */ + __reserved_3 : 29; + }; + }; /* * User provided data if sigtrap=1, passed back to user via diff --git a/kernel/events/core.c b/kernel/events/core.c index 8f908f077935..3400056f2cfe 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2097,7 +2097,8 @@ static void perf_put_aux_event(struct perf_event *event) static bool perf_need_aux_event(struct perf_event *event) { - return !!event->attr.aux_output || !!event->attr.aux_sample_size; + return event->attr.aux_output || event->attr.aux_sample_size || + event->attr.aux_pause || event->attr.aux_resume; } static int perf_get_aux_event(struct perf_event *event, @@ -2122,6 +2123,10 @@ static int perf_get_aux_event(struct perf_event *event, !perf_aux_output_match(event, group_leader)) return 0; + if ((event->attr.aux_pause || event->attr.aux_resume) && + !(group_leader->pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE)) + return 0; + if (event->attr.aux_sample_size && !group_leader->pmu->snapshot_aux) return 0; @@ -7878,6 +7883,51 @@ void perf_prepare_header(struct perf_event_header *header, WARN_ON_ONCE(header->size & 7); } +static void __perf_event_aux_pause(struct perf_event *event, bool pause) +{ + if (pause) { + if (!event->hw.aux_paused) { + event->hw.aux_paused = 1; + event->pmu->stop(event, PERF_EF_PAUSE); + } + } else { + if (event->hw.aux_paused) { + event->hw.aux_paused = 0; + event->pmu->start(event, PERF_EF_RESUME); + } + } +} + +static void perf_event_aux_pause(struct perf_event *event, bool pause) +{ + struct perf_buffer *rb; + unsigned long flags; + + if (WARN_ON_ONCE(!event)) + return; + + rb = ring_buffer_get(event); + if (!rb) + return; + + local_irq_save(flags); + /* + * Guard against NMI, NMI loses here. The critical section is demarked by + * rb->aux_in_pause_resume == 1. An NMI in the critical section will not + * process a pause/resume. + */ + if (READ_ONCE(rb->aux_in_pause_resume)) + goto out_restore; + WRITE_ONCE(rb->aux_in_pause_resume, 1); + barrier(); + __perf_event_aux_pause(event, pause); + barrier(); + WRITE_ONCE(rb->aux_in_pause_resume, 0); +out_restore: + local_irq_restore(flags); + ring_buffer_put(rb); +} + static __always_inline int __perf_event_output(struct perf_event *event, struct perf_sample_data *data, @@ -7891,6 +7941,9 @@ __perf_event_output(struct perf_event *event, struct perf_event_header header; int err; + if (event->attr.aux_pause) + perf_event_aux_pause(event->aux_event, true); + /* protect the callchain buffers */ rcu_read_lock(); @@ -7907,6 +7960,10 @@ __perf_event_output(struct perf_event *event, exit: rcu_read_unlock(); + + if (event->attr.aux_resume) + perf_event_aux_pause(event->aux_event, false); + return err; } @@ -12060,11 +12117,24 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, } if (event->attr.aux_output && - !(pmu->capabilities & PERF_PMU_CAP_AUX_OUTPUT)) { + (!(pmu->capabilities & PERF_PMU_CAP_AUX_OUTPUT) || + event->attr.aux_pause || event->attr.aux_resume)) { err = -EOPNOTSUPP; goto err_pmu; } + if (event->attr.aux_pause && event->attr.aux_resume) { + err = -EINVAL; + goto err_pmu; + } + + if (event->attr.aux_start_paused && + !(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE)) { + err = -EOPNOTSUPP; + goto err_pmu; + } + event->hw.aux_paused = event->attr.aux_start_paused; + if (cgroup_fd != -1) { err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader); if (err) @@ -12860,7 +12930,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu, * Grouping is not supported for kernel events, neither is 'AUX', * make sure the caller's intentions are adjusted. */ - if (attr->aux_output) + if (attr->aux_output || attr->aux_action) return ERR_PTR(-EINVAL); event = perf_event_alloc(attr, cpu, task, NULL, NULL, diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 5150d5f84c03..3320f78117dc 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -51,6 +51,7 @@ struct perf_buffer { void (*free_aux)(void *); refcount_t aux_refcount; int aux_in_sampling; + int aux_in_pause_resume; void **aux_pages; void *aux_priv;