diff mbox

arm-cci500: Workaround pmu_event_set_period

Message ID 1443196253-22765-1-git-send-email-suzuki.poulose@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Suzuki K Poulose Sept. 25, 2015, 3:50 p.m. UTC
From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>

The CCI PMU driver sets the event counter to the half of the maximum
value(2^31) it can count before we start the counters via
pmu_event_set_period(). This is done to give us the best chance to
handle the overflow interrupt, taking care of extreme interrupt latencies.

However, CCI-500 comes with advanced power saving schemes, which disables
the clock to the event counters unless the counters are enabled to count
(PMCR.CEN). This prevents the driver from writing the period to the
counters before starting them.  Also, there is no way we can reset the
individual event counter to 0 (PMCR.RST resets all the counters, losing
their current readings). However the value of the counter is preserved and
could be read back, when the counters are not enabled.

So we cannot reliably use the counters and compute the number of events
generated during the sampling period since we don't have the value of the
counter at start.

Here are the possible solutions:

 1) Disable clock gating on CCI-500 by setting Control_Override_Reg[bit3].
    - The Control_Override_Reg is secure (and hence not programmable from
      Linux), and also has an impact on power consumption.

 2) Change the order of operations
	i.e,
	a) Program and enable individual counters
	b) Enable counting on all the counters by setting PMCR.CEN
	c) Write the period to the individual counters
	d) Disable the counters
    - This could cause in unnecessary noise in the other counters and is
      costly (we should repeat this for all enabled counters).

 3) Don't set the counter value, instead use the current count as the
    starting count and compute the delta at the end of sampling.

This patch implements option 3 for CCI-500. CCI-400 behavior remains
unchanged. The problem didn't surface on a fast model, but was revealed
on an FPGA model. Without this patch profiling on CCI-500 is broken and
should be fixed for 4.3.

Cc: Punit Agrawal <punit.agrawal@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: arm@kernel.org
Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
---
 drivers/bus/arm-cci.c |   42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)

Comments

Russell King - ARM Linux Sept. 25, 2015, 4:05 p.m. UTC | #1
As you're posting an update to CCI500, I thought I'd hijack this
thread to complain about the same thing that I complained about
with CCI400:

+config ARM_CCI500_PMU
+       bool "ARM CCI500 PMU support"
+       default y
+       depends on (ARM && CPU_V7) || ARM64
+       depends on PERF_EVENTS
+       select ARM_CCI_PMU

Why "default y" ?  Not all ARMv7 CPUs have the CCI bus, and what it
means is that when moving forward, it's a pain to remember to turn
the option off amongst all the other Kconfig questions.

Only make things "default y" if it was something that was built in
a previous kernel and not building it in a later kernel would cause
a regression, or where it's absolutely necessary to do so.

Thanks.
Suzuki K Poulose Sept. 25, 2015, 4:09 p.m. UTC | #2
On 25/09/15 17:05, Russell King - ARM Linux wrote:
> As you're posting an update to CCI500, I thought I'd hijack this
> thread to complain about the same thing that I complained about
> with CCI400:
>
> +config ARM_CCI500_PMU
> +       bool "ARM CCI500 PMU support"
> +       default y
> +       depends on (ARM && CPU_V7) || ARM64
> +       depends on PERF_EVENTS
> +       select ARM_CCI_PMU
>
> Why "default y" ?  Not all ARMv7 CPUs have the CCI bus, and what it
> means is that when moving forward, it's a pain to remember to turn
> the option off amongst all the other Kconfig questions.
>
> Only make things "default y" if it was something that was built in
> a previous kernel and not building it in a later kernel would cause
> a regression, or where it's absolutely necessary to do so.

Sure, we can turn that off by default. I will send a patch.

Suzuki
Mark Rutland Sept. 28, 2015, 3:38 p.m. UTC | #3
On Fri, Sep 25, 2015 at 04:50:53PM +0100, Suzuki K. Poulose wrote:
> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>
> 
> The CCI PMU driver sets the event counter to the half of the maximum
> value(2^31) it can count before we start the counters via
> pmu_event_set_period(). This is done to give us the best chance to
> handle the overflow interrupt, taking care of extreme interrupt latencies.
> 
> However, CCI-500 comes with advanced power saving schemes, which disables
> the clock to the event counters unless the counters are enabled to count
> (PMCR.CEN). This prevents the driver from writing the period to the
> counters before starting them.  Also, there is no way we can reset the
> individual event counter to 0 (PMCR.RST resets all the counters, losing
> their current readings). However the value of the counter is preserved and
> could be read back, when the counters are not enabled.

Does reset work while the counters are clock gated?

I take it that we can configure the counters while they are disabled,
even if we can't alter the counter value?

> So we cannot reliably use the counters and compute the number of events
> generated during the sampling period since we don't have the value of the
> counter at start.
> 
> Here are the possible solutions:
> 
>  1) Disable clock gating on CCI-500 by setting Control_Override_Reg[bit3].
>     - The Control_Override_Reg is secure (and hence not programmable from
>       Linux), and also has an impact on power consumption.
> 
>  2) Change the order of operations
> 	i.e,
> 	a) Program and enable individual counters
> 	b) Enable counting on all the counters by setting PMCR.CEN
> 	c) Write the period to the individual counters
> 	d) Disable the counters
>     - This could cause in unnecessary noise in the other counters and is
>       costly (we should repeat this for all enabled counters).
> 
>  3) Don't set the counter value, instead use the current count as the
>     starting count and compute the delta at the end of sampling.

This leaves us less broken than we currently are, but as far as I can
tell, this leaves us back where we started, with the overflow issues
seen elsewhere, cf:

a737823d37666255 ("ARM: 6835/1: perf: ensure overflows aren't missed due to IRQ latency")
5727347180ebc6b4 ("ARM: 7354/1: perf: limit sample_period to half max_period in non-sampling mode")

> @@ -792,15 +803,33 @@ static void pmu_read(struct perf_event *event)
>  void pmu_event_set_period(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
> +	struct cci_pmu *cci_pmu = to_cci_pmu(event->pmu);
> +	u64 val = 0;
> +
>  	/*
> -	 * The CCI PMU counters have a period of 2^32. To account for the
> -	 * possiblity of extreme interrupt latency we program for a period of
> -	 * half that. Hopefully we can handle the interrupt before another 2^31
> -	 * events occur and the counter overtakes its previous value.
> +	 * If the PMU gates the clock to PMU event counters when the counters
> +	 * are not enabled, any write to it is ineffective. Hence we cannot
> +	 * set a value reliably. Also, we cannot reset an individual event
> +	 * counter; PMCR.RST resets all the counters. However the existing
> +	 * counter values can be read back. Hence, we use the existing counter
> +	 * value as the period and set the prev_count accordingly. This is
> +	 * safe, since we don't support sampling of events anyway.
>  	 */
> -	u64 val = 1ULL << 31;
> +	if (pmu_ignores_cntr_write(cci_pmu)) {
> +		val = pmu_read_counter(event);
> +	} else {
> +		/*
> +		 * The CCI PMU counters have a period of 2^32. To account for
> +		 * the possiblity of extreme interrupt latency we program for
> +		 * a period of half that. Hopefully we can handle the interrupt
> +		 * before another 2^31 events occur and the counter overtakes
> +		 * its previous value.
> +		 */
> +		val = 1ULL << 31;
> +		pmu_write_counter(event, val);
> +	}
> +
>  	local64_set(&hwc->prev_count, val);
> -	pmu_write_counter(event, val);
>  }

We could get rid of the flag and always do:

	pmu_write_counter(event, 1UL << 31);
	val = pmu_read_counter(event);

That should work for CCI-400 and CCI-500, and if the CCI-500 counters
aren't clock-gated (e.g. because of FW configuration) then we get the
period that we actually wanted.

However, can't we use an event which won't count to solve the race with
option 2?

Per the TRM, events for unimplemented interfaces don't count, though I'm
not sure if there's an event that's guaranteed to not count in all
configurations. Perhaps we might be able to dig one up.

Thanks,
Mark.
Suzuki K Poulose Sept. 29, 2015, 9:13 a.m. UTC | #4
On 28/09/15 16:38, Mark Rutland wrote:
> On Fri, Sep 25, 2015 at 04:50:53PM +0100, Suzuki K. Poulose wrote:
>> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>
>>
>> The CCI PMU driver sets the event counter to the half of the maximum
>> value(2^31) it can count before we start the counters via
>> pmu_event_set_period(). This is done to give us the best chance to
>> handle the overflow interrupt, taking care of extreme interrupt latencies.
>>
>> However, CCI-500 comes with advanced power saving schemes, which disables
>> the clock to the event counters unless the counters are enabled to count
>> (PMCR.CEN). This prevents the driver from writing the period to the
>> counters before starting them.  Also, there is no way we can reset the
>> individual event counter to 0 (PMCR.RST resets all the counters, losing
>> their current readings). However the value of the counter is preserved and
>> could be read back, when the counters are not enabled.
>
> Does reset work while the counters are clock gated?

No. It doesn't. Both PMCR.CEN and Count Control should be turned on to
alter the value of the counter.  

>
> I take it that we can configure the counters while they are disabled,
> even if we can't alter the counter value?

We can program the events for the counter, but we can't alter the counter
value.

>
>> So we cannot reliably use the counters and compute the number of events
>> generated during the sampling period since we don't have the value of the
>> counter at start.
>>
>> Here are the possible solutions:
>>
>>   1) Disable clock gating on CCI-500 by setting Control_Override_Reg[bit3].
>>      - The Control_Override_Reg is secure (and hence not programmable from
>>        Linux), and also has an impact on power consumption.
>>
>>   2) Change the order of operations
>> 	i.e,
>> 	a) Program and enable individual counters
>> 	b) Enable counting on all the counters by setting PMCR.CEN
>> 	c) Write the period to the individual counters
>> 	d) Disable the counters
>>      - This could cause in unnecessary noise in the other counters and is
>>        costly (we should repeat this for all enabled counters).
>>
>>   3) Don't set the counter value, instead use the current count as the
>>      starting count and compute the delta at the end of sampling.
>
> This leaves us less broken than we currently are, but as far as I can
> tell, this leaves us back where we started, with the overflow issues
> seen elsewhere, cf:
>
> a737823d37666255 ("ARM: 6835/1: perf: ensure overflows aren't missed due to IRQ latency")
> 5727347180ebc6b4 ("ARM: 7354/1: perf: limit sample_period to half max_period in non-sampling mode")

Thanks for the pointers. Even now, without this change, theoretically
we could run into the overflow, isn't it ? So we should fix it for ever.

>> @@ -792,15 +803,33 @@ static void pmu_read(struct perf_event *event)
>>   void pmu_event_set_period(struct perf_event *event)
>>   {
>>   	struct hw_perf_event *hwc = &event->hw;
>> +	struct cci_pmu *cci_pmu = to_cci_pmu(event->pmu);
>> +	u64 val = 0;
>> +
>>   	/*
>> -	 * The CCI PMU counters have a period of 2^32. To account for the
>> -	 * possiblity of extreme interrupt latency we program for a period of
>> -	 * half that. Hopefully we can handle the interrupt before another 2^31
>> -	 * events occur and the counter overtakes its previous value.
>> +	 * If the PMU gates the clock to PMU event counters when the counters
>> +	 * are not enabled, any write to it is ineffective. Hence we cannot
>> +	 * set a value reliably. Also, we cannot reset an individual event
>> +	 * counter; PMCR.RST resets all the counters. However the existing
>> +	 * counter values can be read back. Hence, we use the existing counter
>> +	 * value as the period and set the prev_count accordingly. This is
>> +	 * safe, since we don't support sampling of events anyway.
>>   	 */
>> -	u64 val = 1ULL << 31;
>> +	if (pmu_ignores_cntr_write(cci_pmu)) {
>> +		val = pmu_read_counter(event);
>> +	} else {
>> +		/*
>> +		 * The CCI PMU counters have a period of 2^32. To account for
>> +		 * the possiblity of extreme interrupt latency we program for
>> +		 * a period of half that. Hopefully we can handle the interrupt
>> +		 * before another 2^31 events occur and the counter overtakes
>> +		 * its previous value.
>> +		 */
>> +		val = 1ULL << 31;
>> +		pmu_write_counter(event, val);
>> +	}
>> +
>>   	local64_set(&hwc->prev_count, val);
>> -	pmu_write_counter(event, val);
>>   }
>
> We could get rid of the flag and always do:
>
> 	pmu_write_counter(event, 1UL << 31);
> 	val = pmu_read_counter(event);
>
> That should work for CCI-400 and CCI-500, and if the CCI-500 counters
> aren't clock-gated (e.g. because of FW configuration) then we get the
> period that we actually wanted.

Yes, thats looks neater. I will change it with some comments in there.

>
> However, can't we use an event which won't count to solve the race with
> option 2?

Even if we found one, won't we have to set the 'no-op' event on all counters,
if we want to change a counter value to prevent other counters from counting?
And it gets complicated with/without event grouping.

>
> Per the TRM, events for unimplemented interfaces don't count, though I'm
> not sure if there's an event that's guaranteed to not count in all
> configurations. Perhaps we might be able to dig one up.

IIRC, there wasn't any mention about a single event which falls into that
category. I think solution 3 is much cleaner than finding and using a dummy
event.

Suzuki
Mark Rutland Sept. 29, 2015, 10:30 a.m. UTC | #5
Hi,

> > Does reset work while the counters are clock gated?
> 
> No. It doesn't. Both PMCR.CEN and Count Control should be turned on to
> alter the value of the counter.  

Ok. I take it PMCR.CEN is global and Count Control is per-counter?

> > I take it that we can configure the counters while they are disabled,
> > even if we can't alter the counter value?
> 
> We can program the events for the counter, but we can't alter the counter
> value.

Ok, that's something, at least.

> >> So we cannot reliably use the counters and compute the number of events
> >> generated during the sampling period since we don't have the value of the
> >> counter at start.
> >>
> >> Here are the possible solutions:
> >>
> >>   1) Disable clock gating on CCI-500 by setting Control_Override_Reg[bit3].
> >>      - The Control_Override_Reg is secure (and hence not programmable from
> >>        Linux), and also has an impact on power consumption.
> >>
> >>   2) Change the order of operations
> >> 	i.e,
> >> 	a) Program and enable individual counters
> >> 	b) Enable counting on all the counters by setting PMCR.CEN
> >> 	c) Write the period to the individual counters
> >> 	d) Disable the counters
> >>      - This could cause in unnecessary noise in the other counters and is
> >>        costly (we should repeat this for all enabled counters).
> >>
> >>   3) Don't set the counter value, instead use the current count as the
> >>      starting count and compute the delta at the end of sampling.
> >
> > This leaves us less broken than we currently are, but as far as I can
> > tell, this leaves us back where we started, with the overflow issues
> > seen elsewhere, cf:
> >
> > a737823d37666255 ("ARM: 6835/1: perf: ensure overflows aren't missed due to IRQ latency")
> > 5727347180ebc6b4 ("ARM: 7354/1: perf: limit sample_period to half max_period in non-sampling mode")
> 
> Thanks for the pointers. Even now, without this change, theoretically
> we could run into the overflow, isn't it ? So we should fix it for ever.

It's not really fixable as such (given sufficiently frequent events and
large enough IRQ latency there's simply nothing you can do). The current
behaviour on CCI-400 is about the best we can do, so we should aim for
that on CCI-500 if possible.

> > We could get rid of the flag and always do:
> >
> > 	pmu_write_counter(event, 1UL << 31);
> > 	val = pmu_read_counter(event);
> >
> > That should work for CCI-400 and CCI-500, and if the CCI-500 counters
> > aren't clock-gated (e.g. because of FW configuration) then we get the
> > period that we actually wanted.
> 
> Yes, thats looks neater. I will change it with some comments in there.

Great.

That still leaves us with racy overflow handling on a clock gated
CCI-500 PMU, which we should aim to avoid if possible.

> > However, can't we use an event which won't count to solve the race with
> > option 2?
> 
> Even if we found one, won't we have to set the 'no-op' event on all counters,
> if we want to change a counter value to prevent other counters from counting?

Can't we just clear the counter_enable bits for the other counters?

We could handle that in pmu_{disable,enable} to amortize the cost (or
even do it lazily on the first re-programming).

> And it gets complicated with/without event grouping.

I don't see why grouping would be any more complicated by this. What am
I missing?

> > Per the TRM, events for unimplemented interfaces don't count, though I'm
> > not sure if there's an event that's guaranteed to not count in all
> > configurations. Perhaps we might be able to dig one up.
> 
> IIRC, there wasn't any mention about a single event which falls into that
> category. I think solution 3 is much cleaner than finding and using a dummy
> event.

I agree that it is simpler, but it leaves us with a worrying high chance
of coming up with erroneous values, which detracts from the usefulness
of the PMU.

Thanks,
Mark.
diff mbox

Patch

diff --git a/drivers/bus/arm-cci.c b/drivers/bus/arm-cci.c
index 577cc4b..7db99dc 100644
--- a/drivers/bus/arm-cci.c
+++ b/drivers/bus/arm-cci.c
@@ -88,6 +88,10 @@  static const struct of_device_id arm_cci_matches[] = {
 #define CCI_PMU_MAX_HW_CNTRS(model) \
 	((model)->num_hw_cntrs + (model)->fixed_hw_cntrs)
 
+/*	 CCI PMU flags	 */
+/* Writes to CNTR is ignored, see pmu_event_set_period() */
+#define CCI_PMU_IGNORE_CNTR_WRITE	(1 << 0)
+
 /* Types of interfaces that can generate events */
 enum {
 	CCI_IF_SLAVE,
@@ -118,6 +122,7 @@  struct cci_pmu;
  */
 struct cci_pmu_model {
 	char *name;
+	u32 flags;
 	u32 fixed_hw_cntrs;
 	u32 num_hw_cntrs;
 	u32 cntr_size;
@@ -472,6 +477,7 @@  static inline struct cci_pmu_model *probe_cci_model(struct platform_device *pdev
 #define CCI500_GLOBAL_PORT_MIN_EV	0x00
 #define CCI500_GLOBAL_PORT_MAX_EV	0x0f
 
+#define CCI500_PMU_FLAGS		(CCI_PMU_IGNORE_CNTR_WRITE)
 
 #define CCI500_GLOBAL_EVENT_EXT_ATTR_ENTRY(_name, _config) \
 	CCI_EXT_ATTR_ENTRY(_name, cci500_pmu_global_event_show, \
@@ -619,6 +625,11 @@  static ssize_t cci_pmu_event_show(struct device *dev,
 					 (unsigned long)eattr->var);
 }
 
+static inline bool pmu_ignores_cntr_write(struct cci_pmu *cci_pmu)
+{
+	return !!(cci_pmu->model->flags & CCI_PMU_IGNORE_CNTR_WRITE);
+}
+
 static int pmu_is_valid_counter(struct cci_pmu *cci_pmu, int idx)
 {
 	return 0 <= idx && idx <= CCI_PMU_CNTR_LAST(cci_pmu);
@@ -792,15 +803,33 @@  static void pmu_read(struct perf_event *event)
 void pmu_event_set_period(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
+	struct cci_pmu *cci_pmu = to_cci_pmu(event->pmu);
+	u64 val = 0;
+
 	/*
-	 * The CCI PMU counters have a period of 2^32. To account for the
-	 * possiblity of extreme interrupt latency we program for a period of
-	 * half that. Hopefully we can handle the interrupt before another 2^31
-	 * events occur and the counter overtakes its previous value.
+	 * If the PMU gates the clock to PMU event counters when the counters
+	 * are not enabled, any write to it is ineffective. Hence we cannot
+	 * set a value reliably. Also, we cannot reset an individual event
+	 * counter; PMCR.RST resets all the counters. However the existing
+	 * counter values can be read back. Hence, we use the existing counter
+	 * value as the period and set the prev_count accordingly. This is
+	 * safe, since we don't support sampling of events anyway.
 	 */
-	u64 val = 1ULL << 31;
+	if (pmu_ignores_cntr_write(cci_pmu)) {
+		val = pmu_read_counter(event);
+	} else {
+		/*
+		 * The CCI PMU counters have a period of 2^32. To account for
+		 * the possiblity of extreme interrupt latency we program for
+		 * a period of half that. Hopefully we can handle the interrupt
+		 * before another 2^31 events occur and the counter overtakes
+		 * its previous value.
+		 */
+		val = 1ULL << 31;
+		pmu_write_counter(event, val);
+	}
+
 	local64_set(&hwc->prev_count, val);
-	pmu_write_counter(event, val);
 }
 
 static irqreturn_t pmu_handle_irq(int irq_num, void *dev)
@@ -1380,6 +1409,7 @@  static struct cci_pmu_model cci_pmu_models[] = {
 		.name = "CCI_500",
 		.fixed_hw_cntrs = 0,
 		.num_hw_cntrs = 8,
+		.flags = CCI500_PMU_FLAGS,
 		.cntr_size = SZ_64K,
 		.format_attrs = cci500_pmu_format_attrs,
 		.nformat_attrs = ARRAY_SIZE(cci500_pmu_format_attrs),