Message ID | 20241107160721.1401614-2-deepak.surti@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | A mechanism for efficient support for per-function metrics | expand |
On Thu, Nov 07, 2024 at 04:07:18PM +0000, Deepak Surti wrote: > From: Ben Gainey <ben.gainey@arm.com> > > This change modifies perf_event_attr to add a second, alternative > sample period field, and modifies the core perf overflow handling > such that when specified an event will alternate between two sample > periods. > > Currently, perf does not provide a mechanism for decoupling the period > over which counters are counted from the period between samples. This is > problematic for building a tool to measure per-function metrics derived > from a sampled counter group. Ideally such a tool wants a very small > sample window in order to correctly attribute the metrics to a given > function, but prefers a larger sample period that provides representative > coverage without excessive probe effect, triggering throttling, or > generating excessive amounts of data. > > By alternating between a long and short sample_period and subsequently > discarding the long samples, tools may decouple the period between > samples that the tool cares about from the window of time over which > interesting counts are collected. Do you have a link to a paper or something that explains this method? > + /* > + * Indicates that the alternative_sample_period is used > + */ > + bool using_alternative_sample_period; I typically prefer variables names that are shorter. > @@ -9822,6 +9825,26 @@ static int __perf_event_overflow(struct perf_event *event, > !bpf_overflow_handler(event, data, regs)) > return ret; > > + /* > + * Swap the sample period to the alternative period > + */ > + if (event->attr.alternative_sample_period) { > + bool using_alt = hwc->using_alternative_sample_period; > + u64 sample_period = (using_alt ? event->attr.sample_period > + : event->attr.alternative_sample_period); > + > + hwc->sample_period = sample_period; > + hwc->using_alternative_sample_period = !using_alt; > + > + if (local64_read(&hwc->period_left) > 0) { > + event->pmu->stop(event, PERF_EF_UPDATE); > + > + local64_set(&hwc->period_left, 0); > + > + event->pmu->start(event, PERF_EF_RELOAD); > + } This is quite terrible :-( Getting here means we just went through the effort of programming the period and you'll pretty much always hit that 'period_left > 0' case. Why do we need this case at all? If you don't do this, then the next overflow will pick the period you just wrote to hwc->sample_period (although you might want to audit all arch implementations). Looking at it again, that truncation to 0 is just plain wrong -- always. Why are you doing this?
On Thu, 2024-11-14 at 16:01 +0100, Peter Zijlstra wrote: > On Thu, Nov 07, 2024 at 04:07:18PM +0000, Deepak Surti wrote: > > From: Ben Gainey <ben.gainey@arm.com> > > > > This change modifies perf_event_attr to add a second, alternative > > sample period field, and modifies the core perf overflow handling > > such that when specified an event will alternate between two sample > > periods. > > > > Currently, perf does not provide a mechanism for decoupling the > > period > > over which counters are counted from the period between samples. > > This is > > problematic for building a tool to measure per-function metrics > > derived > > from a sampled counter group. Ideally such a tool wants a very > > small > > sample window in order to correctly attribute the metrics to a > > given > > function, but prefers a larger sample period that provides > > representative > > coverage without excessive probe effect, triggering throttling, or > > generating excessive amounts of data. > > > > By alternating between a long and short sample_period and > > subsequently > > discarding the long samples, tools may decouple the period between > > samples that the tool cares about from the window of time over > > which > > interesting counts are collected. > > Do you have a link to a paper or something that explains this method? Ben had originally authored this as an internal doc but I can look at publishing externally. Is there anything in particular about this method that you are interested in? > > > > + /* > > + * Indicates that the alternative_sample_period is used > > + */ > > + bool using_alternative_sample_p > > eriod; > > I typically prefer variables names that are shorter. Acknowledged. Will do it in version 2 of the patch. > > > > @@ -9822,6 +9825,26 @@ static int __perf_event_overflow(struct > > perf_event *event, > > !bpf_overflow_handler(event, data, regs)) > > return ret; > > > > + /* > > + * Swap the sample period to the alternative period > > + */ > > + if (event->attr.alternative_sample_period) { > > + bool using_alt = hwc- > > >using_alternative_sample_period; > > + u64 sample_period = (using_alt ? event- > > >attr.sample_period > > + : event- > > >attr.alternative_sample_period); > > + > > + hwc->sample_period = sample_period; > > + hwc->using_alternative_sample_period = !using_alt; > > + > > + if (local64_read(&hwc->period_left) > 0) { > > + event->pmu->stop(event, PERF_EF_UPDATE); > > + > > + local64_set(&hwc->period_left, 0); > > + > > + event->pmu->start(event, PERF_EF_RELOAD); > > + } > > This is quite terrible :-( > > Getting here means we just went through the effort of programming the > period and you'll pretty much always hit that 'period_left > 0' case. > > Why do we need this case at all? If you don't do this, then the next > overflow will pick the period you just wrote to hwc->sample_period > (although you might want to audit all arch implementations). > > Looking at it again, that truncation to 0 is just plain wrong -- > always. > Why are you doing this? This was due to Ben's lack of familiarity with the codebase when this was originally written; this replicates what the IOCTL handler does to change the sample period. But you are right this is inefficient; we've tested with this removed and for SW events and arm pmu events it appears to be fine. A quick review of the other architecture overflow handlers tend to all follow the same pattern so it's probably safe, but we will do some more validation on that before version 2 of the patch. Thanks, Deepak > > >
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index fb908843f209..9910b3c115d0 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -270,6 +270,11 @@ struct hw_perf_event { */ u64 freq_time_stamp; u64 freq_count_stamp; + + /* + * Indicates that the alternative_sample_period is used + */ + bool using_alternative_sample_period; #endif }; diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 4842c36fdf80..bedae424ba36 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -379,6 +379,7 @@ enum perf_event_read_format { #define PERF_ATTR_SIZE_VER6 120 /* add: aux_sample_size */ #define PERF_ATTR_SIZE_VER7 128 /* add: sig_data */ #define PERF_ATTR_SIZE_VER8 136 /* add: config3 */ +#define PERF_ATTR_SIZE_VER9 144 /* add: alternative_sample_period */ /* * Hardware event_id to monitor via a performance monitoring event: @@ -522,6 +523,8 @@ struct perf_event_attr { __u64 sig_data; __u64 config3; /* extension of config2 */ + + __u64 alternative_sample_period; }; /* diff --git a/kernel/events/core.c b/kernel/events/core.c index df27d08a7232..579b04af70a2 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4185,6 +4185,8 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo s64 period, sample_period; s64 delta; + WARN_ON_ONCE(hwc->using_alternative_sample_period); + period = perf_calculate_period(event, nsec, count); delta = (s64)(period - hwc->sample_period); @@ -9806,6 +9808,7 @@ static int __perf_event_overflow(struct perf_event *event, int throttle, struct perf_sample_data *data, struct pt_regs *regs) { + struct hw_perf_event *hwc = &event->hw; int events = atomic_read(&event->event_limit); int ret = 0; @@ -9822,6 +9825,26 @@ static int __perf_event_overflow(struct perf_event *event, !bpf_overflow_handler(event, data, regs)) return ret; + /* + * Swap the sample period to the alternative period + */ + if (event->attr.alternative_sample_period) { + bool using_alt = hwc->using_alternative_sample_period; + u64 sample_period = (using_alt ? event->attr.sample_period + : event->attr.alternative_sample_period); + + hwc->sample_period = sample_period; + hwc->using_alternative_sample_period = !using_alt; + + if (local64_read(&hwc->period_left) > 0) { + event->pmu->stop(event, PERF_EF_UPDATE); + + local64_set(&hwc->period_left, 0); + + event->pmu->start(event, PERF_EF_RELOAD); + } + } + /* * XXX event_limit might not quite work as expected on inherited * events @@ -12244,6 +12267,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, local64_set(&hwc->period_left, hwc->sample_period); + /* + * alternative_sample_period cannot be used with freq + */ + if (attr->freq && attr->alternative_sample_period) + goto err_ns; + /* * We do not support PERF_SAMPLE_READ on inherited events unless * PERF_SAMPLE_TID is also selected, which allows inherited events to @@ -12700,9 +12729,19 @@ SYSCALL_DEFINE5(perf_event_open, if (attr.freq) { if (attr.sample_freq > sysctl_perf_event_sample_rate) return -EINVAL; + if (attr.alternative_sample_period) + return -EINVAL; } else { if (attr.sample_period & (1ULL << 63)) return -EINVAL; + if (attr.alternative_sample_period) { + if (!attr.sample_period) + return -EINVAL; + if (attr.alternative_sample_period & (1ULL << 63)) + return -EINVAL; + if (attr.alternative_sample_period == attr.sample_period) + attr.alternative_sample_period = 0; + } } /* Only privileged users can get physical addresses */