diff mbox series

[v1,1/4] perf: Allow periodic events to alternate between two sample periods

Message ID 20241107160721.1401614-2-deepak.surti@arm.com (mailing list archive)
State New, archived
Headers show
Series A mechanism for efficient support for per-function metrics | expand

Commit Message

Deepak Surti Nov. 7, 2024, 4:07 p.m. UTC
From: Ben Gainey <ben.gainey@arm.com>

This change modifies perf_event_attr to add a second, alternative
sample period field, and modifies the core perf overflow handling
such that when specified an event will alternate between two sample
periods.

Currently, perf does not provide a  mechanism for decoupling the period
over which counters are counted from the period between samples. This is
problematic for building a tool to measure per-function metrics derived
from a sampled counter group. Ideally such a tool wants a very small
sample window in order to correctly attribute the metrics to a given
function, but prefers a larger sample period that provides representative
coverage without excessive probe effect, triggering throttling, or
generating excessive amounts of data.

By alternating between a long and short sample_period and subsequently
discarding the long samples, tools may decouple the period between
samples that the tool cares about from the window of time over which
interesting counts are collected.

It is expected that typically tools would use this feature with the
cycles or instructions events as an approximation for time, but no
restrictions are applied to which events this can be applied to.

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
---
 include/linux/perf_event.h      |  5 +++++
 include/uapi/linux/perf_event.h |  3 +++
 kernel/events/core.c            | 39 +++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

Comments

Peter Zijlstra Nov. 14, 2024, 3:01 p.m. UTC | #1
On Thu, Nov 07, 2024 at 04:07:18PM +0000, Deepak Surti wrote:
> From: Ben Gainey <ben.gainey@arm.com>
> 
> This change modifies perf_event_attr to add a second, alternative
> sample period field, and modifies the core perf overflow handling
> such that when specified an event will alternate between two sample
> periods.
> 
> Currently, perf does not provide a  mechanism for decoupling the period
> over which counters are counted from the period between samples. This is
> problematic for building a tool to measure per-function metrics derived
> from a sampled counter group. Ideally such a tool wants a very small
> sample window in order to correctly attribute the metrics to a given
> function, but prefers a larger sample period that provides representative
> coverage without excessive probe effect, triggering throttling, or
> generating excessive amounts of data.
> 
> By alternating between a long and short sample_period and subsequently
> discarding the long samples, tools may decouple the period between
> samples that the tool cares about from the window of time over which
> interesting counts are collected.

Do you have a link to a paper or something that explains this method?


> +	/*
> +	 * Indicates that the alternative_sample_period is used
> +	 */
> +	bool				using_alternative_sample_period;

I typically prefer variables names that are shorter.


> @@ -9822,6 +9825,26 @@ static int __perf_event_overflow(struct perf_event *event,
>  	    !bpf_overflow_handler(event, data, regs))
>  		return ret;
>  
> +	/*
> +	 * Swap the sample period to the alternative period
> +	 */
> +	if (event->attr.alternative_sample_period) {
> +		bool using_alt = hwc->using_alternative_sample_period;
> +		u64 sample_period = (using_alt ? event->attr.sample_period
> +					       : event->attr.alternative_sample_period);
> +
> +		hwc->sample_period = sample_period;
> +		hwc->using_alternative_sample_period = !using_alt;
> +
> +		if (local64_read(&hwc->period_left) > 0) {
> +			event->pmu->stop(event, PERF_EF_UPDATE);
> +
> +			local64_set(&hwc->period_left, 0);
> +
> +			event->pmu->start(event, PERF_EF_RELOAD);
> +		}

This is quite terrible :-(

Getting here means we just went through the effort of programming the
period and you'll pretty much always hit that 'period_left > 0' case.

Why do we need this case at all? If you don't do this, then the next
overflow will pick the period you just wrote to hwc->sample_period
(although you might want to audit all arch implementations).

Looking at it again, that truncation to 0 is just plain wrong -- always.
Why are you doing this?
Deepak Surti Nov. 25, 2024, 5:11 p.m. UTC | #2
On Thu, 2024-11-14 at 16:01 +0100, Peter Zijlstra wrote:
> On Thu, Nov 07, 2024 at 04:07:18PM +0000, Deepak Surti wrote:
> > From: Ben Gainey <ben.gainey@arm.com>
> > 
> > This change modifies perf_event_attr to add a second, alternative
> > sample period field, and modifies the core perf overflow handling
> > such that when specified an event will alternate between two sample
> > periods.
> > 
> > Currently, perf does not provide a  mechanism for decoupling the
> > period
> > over which counters are counted from the period between samples.
> > This is
> > problematic for building a tool to measure per-function metrics
> > derived
> > from a sampled counter group. Ideally such a tool wants a very
> > small
> > sample window in order to correctly attribute the metrics to a
> > given
> > function, but prefers a larger sample period that provides
> > representative
> > coverage without excessive probe effect, triggering throttling, or
> > generating excessive amounts of data.
> > 
> > By alternating between a long and short sample_period and
> > subsequently
> > discarding the long samples, tools may decouple the period between
> > samples that the tool cares about from the window of time over
> > which
> > interesting counts are collected.
> 
> Do you have a link to a paper or something that explains this method?

Ben had originally authored this as an internal doc but I can look at
publishing externally. Is there anything in particular about this
method that you are interested in?

> 
> 
> > +	/*
> > +	 * Indicates that the alternative_sample_period is used
> > +	 */
> > +	bool				using_alternative_sample_p
> > eriod;
> 
> I typically prefer variables names that are shorter.

Acknowledged. Will do it in version 2 of the patch.

> 
> 
> > @@ -9822,6 +9825,26 @@ static int __perf_event_overflow(struct
> > perf_event *event,
> >  	    !bpf_overflow_handler(event, data, regs))
> >  		return ret;
> >  
> > +	/*
> > +	 * Swap the sample period to the alternative period
> > +	 */
> > +	if (event->attr.alternative_sample_period) {
> > +		bool using_alt = hwc-
> > >using_alternative_sample_period;
> > +		u64 sample_period = (using_alt ? event-
> > >attr.sample_period
> > +					       : event-
> > >attr.alternative_sample_period);
> > +
> > +		hwc->sample_period = sample_period;
> > +		hwc->using_alternative_sample_period = !using_alt;
> > +
> > +		if (local64_read(&hwc->period_left) > 0) {
> > +			event->pmu->stop(event, PERF_EF_UPDATE);
> > +
> > +			local64_set(&hwc->period_left, 0);
> > +
> > +			event->pmu->start(event, PERF_EF_RELOAD);
> > +		}
> 
> This is quite terrible :-(
> 
> Getting here means we just went through the effort of programming the
> period and you'll pretty much always hit that 'period_left > 0' case.
> 
> Why do we need this case at all? If you don't do this, then the next
> overflow will pick the period you just wrote to hwc->sample_period
> (although you might want to audit all arch implementations).
> 
> Looking at it again, that truncation to 0 is just plain wrong --
> always.
> Why are you doing this?

This was due to Ben's lack of familiarity with the codebase when this
was originally written; this replicates what the IOCTL handler does to
change the sample period.

But you are right this is inefficient; we've tested with this removed
and for SW events and arm pmu events it appears to be fine.

A quick review of the other architecture overflow handlers tend to all
follow the same pattern so it's probably safe, but we will do some more
validation on that before version 2 of the patch.

Thanks,
Deepak
> 
> 
>
diff mbox series

Patch

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fb908843f209..9910b3c115d0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -270,6 +270,11 @@  struct hw_perf_event {
 	 */
 	u64				freq_time_stamp;
 	u64				freq_count_stamp;
+
+	/*
+	 * Indicates that the alternative_sample_period is used
+	 */
+	bool				using_alternative_sample_period;
 #endif
 };
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4842c36fdf80..bedae424ba36 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -379,6 +379,7 @@  enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7	128	/* add: sig_data */
 #define PERF_ATTR_SIZE_VER8	136	/* add: config3 */
+#define PERF_ATTR_SIZE_VER9	144	/* add: alternative_sample_period */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -522,6 +523,8 @@  struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+	__u64	alternative_sample_period;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index df27d08a7232..579b04af70a2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4185,6 +4185,8 @@  static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
 	s64 period, sample_period;
 	s64 delta;
 
+	WARN_ON_ONCE(hwc->using_alternative_sample_period);
+
 	period = perf_calculate_period(event, nsec, count);
 
 	delta = (s64)(period - hwc->sample_period);
@@ -9806,6 +9808,7 @@  static int __perf_event_overflow(struct perf_event *event,
 				 int throttle, struct perf_sample_data *data,
 				 struct pt_regs *regs)
 {
+	struct hw_perf_event *hwc = &event->hw;
 	int events = atomic_read(&event->event_limit);
 	int ret = 0;
 
@@ -9822,6 +9825,26 @@  static int __perf_event_overflow(struct perf_event *event,
 	    !bpf_overflow_handler(event, data, regs))
 		return ret;
 
+	/*
+	 * Swap the sample period to the alternative period
+	 */
+	if (event->attr.alternative_sample_period) {
+		bool using_alt = hwc->using_alternative_sample_period;
+		u64 sample_period = (using_alt ? event->attr.sample_period
+					       : event->attr.alternative_sample_period);
+
+		hwc->sample_period = sample_period;
+		hwc->using_alternative_sample_period = !using_alt;
+
+		if (local64_read(&hwc->period_left) > 0) {
+			event->pmu->stop(event, PERF_EF_UPDATE);
+
+			local64_set(&hwc->period_left, 0);
+
+			event->pmu->start(event, PERF_EF_RELOAD);
+		}
+	}
+
 	/*
 	 * XXX event_limit might not quite work as expected on inherited
 	 * events
@@ -12244,6 +12267,12 @@  perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	local64_set(&hwc->period_left, hwc->sample_period);
 
+	/*
+	 * alternative_sample_period cannot be used with freq
+	 */
+	if (attr->freq && attr->alternative_sample_period)
+		goto err_ns;
+
 	/*
 	 * We do not support PERF_SAMPLE_READ on inherited events unless
 	 * PERF_SAMPLE_TID is also selected, which allows inherited events to
@@ -12700,9 +12729,19 @@  SYSCALL_DEFINE5(perf_event_open,
 	if (attr.freq) {
 		if (attr.sample_freq > sysctl_perf_event_sample_rate)
 			return -EINVAL;
+		if (attr.alternative_sample_period)
+			return -EINVAL;
 	} else {
 		if (attr.sample_period & (1ULL << 63))
 			return -EINVAL;
+		if (attr.alternative_sample_period) {
+			if (!attr.sample_period)
+				return -EINVAL;
+			if (attr.alternative_sample_period & (1ULL << 63))
+				return -EINVAL;
+			if (attr.alternative_sample_period == attr.sample_period)
+				attr.alternative_sample_period = 0;
+		}
 	}
 
 	/* Only privileged users can get physical addresses */