arm-cci500: Workaround pmu_event_set_period

Message ID	1443196253-22765-1-git-send-email-suzuki.poulose@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com> To: linux-arm-kernel@lists.infradead.org Subject: [PATCH] arm-cci500: Workaround pmu_event_set_period Date: Fri, 25 Sep 2015 16:50:53 +0100 Message-Id: <1443196253-22765-1-git-send-email-suzuki.poulose@arm.com> Precedence: list Cc: Mark Rutland <mark.rutland@arm.com>, Punit Agrawal <punit.agrawal@arm.com>, Will Deacon <will.deacon@arm.com>, arm@kernel.org, "Suzuki K. Poulose" <suzuki.poulose@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Suzuki K Poulose Sept. 25, 2015, 3:50 p.m. UTC

From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>

The CCI PMU driver sets the event counter to the half of the maximum
value(2^31) it can count before we start the counters via
pmu_event_set_period(). This is done to give us the best chance to
handle the overflow interrupt, taking care of extreme interrupt latencies.

However, CCI-500 comes with advanced power saving schemes, which disables
the clock to the event counters unless the counters are enabled to count
(PMCR.CEN). This prevents the driver from writing the period to the
counters before starting them.  Also, there is no way we can reset the
individual event counter to 0 (PMCR.RST resets all the counters, losing
their current readings). However the value of the counter is preserved and
could be read back, when the counters are not enabled.

So we cannot reliably use the counters and compute the number of events
generated during the sampling period since we don't have the value of the
counter at start.

Here are the possible solutions:

 1) Disable clock gating on CCI-500 by setting Control_Override_Reg[bit3].
    - The Control_Override_Reg is secure (and hence not programmable from
      Linux), and also has an impact on power consumption.

 2) Change the order of operations
	i.e,
	a) Program and enable individual counters
	b) Enable counting on all the counters by setting PMCR.CEN
	c) Write the period to the individual counters
	d) Disable the counters
    - This could cause in unnecessary noise in the other counters and is
      costly (we should repeat this for all enabled counters).

 3) Don't set the counter value, instead use the current count as the
    starting count and compute the delta at the end of sampling.

This patch implements option 3 for CCI-500. CCI-400 behavior remains
unchanged. The problem didn't surface on a fast model, but was revealed
on an FPGA model. Without this patch profiling on CCI-500 is broken and
should be fixed for 4.3.

Cc: Punit Agrawal <punit.agrawal@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: arm@kernel.org
Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
---
 drivers/bus/arm-cci.c |   42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)

Russell King - ARM Linux Sept. 25, 2015, 4:05 p.m. UTC | #1

As you're posting an update to CCI500, I thought I'd hijack this
thread to complain about the same thing that I complained about
with CCI400:

+config ARM_CCI500_PMU
+       bool "ARM CCI500 PMU support"
+       default y
+       depends on (ARM && CPU_V7) || ARM64
+       depends on PERF_EVENTS
+       select ARM_CCI_PMU

Why "default y" ?  Not all ARMv7 CPUs have the CCI bus, and what it
means is that when moving forward, it's a pain to remember to turn
the option off amongst all the other Kconfig questions.

Only make things "default y" if it was something that was built in
a previous kernel and not building it in a later kernel would cause
a regression, or where it's absolutely necessary to do so.

Thanks.

Suzuki K Poulose Sept. 25, 2015, 4:09 p.m. UTC | #2

On 25/09/15 17:05, Russell King - ARM Linux wrote:
> As you're posting an update to CCI500, I thought I'd hijack this
> thread to complain about the same thing that I complained about
> with CCI400:
>
> +config ARM_CCI500_PMU
> +       bool "ARM CCI500 PMU support"
> +       default y
> +       depends on (ARM && CPU_V7) || ARM64
> +       depends on PERF_EVENTS
> +       select ARM_CCI_PMU
>
> Why "default y" ?  Not all ARMv7 CPUs have the CCI bus, and what it
> means is that when moving forward, it's a pain to remember to turn
> the option off amongst all the other Kconfig questions.
>
> Only make things "default y" if it was something that was built in
> a previous kernel and not building it in a later kernel would cause
> a regression, or where it's absolutely necessary to do so.

Sure, we can turn that off by default. I will send a patch.

Suzuki

Mark Rutland Sept. 28, 2015, 3:38 p.m. UTC | #3

On Fri, Sep 25, 2015 at 04:50:53PM +0100, Suzuki K. Poulose wrote:
> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>
> 
> The CCI PMU driver sets the event counter to the half of the maximum
> value(2^31) it can count before we start the counters via
> pmu_event_set_period(). This is done to give us the best chance to
> handle the overflow interrupt, taking care of extreme interrupt latencies.
> 
> However, CCI-500 comes with advanced power saving schemes, which disables
> the clock to the event counters unless the counters are enabled to count
> (PMCR.CEN). This prevents the driver from writing the period to the
> counters before starting them.  Also, there is no way we can reset the
> individual event counter to 0 (PMCR.RST resets all the counters, losing
> their current readings). However the value of the counter is preserved and
> could be read back, when the counters are not enabled.

Does reset work while the counters are clock gated?

I take it that we can configure the counters while they are disabled,
even if we can't alter the counter value?

> So we cannot reliably use the counters and compute the number of events
> generated during the sampling period since we don't have the value of the
> counter at start.
> 
> Here are the possible solutions:
> 
>  1) Disable clock gating on CCI-500 by setting Control_Override_Reg[bit3].
>     - The Control_Override_Reg is secure (and hence not programmable from
>       Linux), and also has an impact on power consumption.
> 
>  2) Change the order of operations
> 	i.e,
> 	a) Program and enable individual counters
> 	b) Enable counting on all the counters by setting PMCR.CEN
> 	c) Write the period to the individual counters
> 	d) Disable the counters
>     - This could cause in unnecessary noise in the other counters and is
>       costly (we should repeat this for all enabled counters).
> 
>  3) Don't set the counter value, instead use the current count as the
>     starting count and compute the delta at the end of sampling.

This leaves us less broken than we currently are, but as far as I can
tell, this leaves us back where we started, with the overflow issues
seen elsewhere, cf:

a737823d37666255 ("ARM: 6835/1: perf: ensure overflows aren't missed due to IRQ latency")
5727347180ebc6b4 ("ARM: 7354/1: perf: limit sample_period to half max_period in non-sampling mode")

> @@ -792,15 +803,33 @@ static void pmu_read(struct perf_event *event)
>  void pmu_event_set_period(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
> +	struct cci_pmu *cci_pmu = to_cci_pmu(event->pmu);
> +	u64 val = 0;
> +
>  	/*
> -	 * The CCI PMU counters have a period of 2^32. To account for the
> -	 * possiblity of extreme interrupt latency we program for a period of
> -	 * half that. Hopefully we can handle the interrupt before another 2^31
> -	 * events occur and the counter overtakes its previous value.
> +	 * If the PMU gates the clock to PMU event counters when the counters
> +	 * are not enabled, any write to it is ineffective. Hence we cannot
> +	 * set a value reliably. Also, we cannot reset an individual event
> +	 * counter; PMCR.RST resets all the counters. However the existing
> +	 * counter values can be read back. Hence, we use the existing counter
> +	 * value as the period and set the prev_count accordingly. This is
> +	 * safe, since we don't support sampling of events anyway.
>  	 */
> -	u64 val = 1ULL << 31;
> +	if (pmu_ignores_cntr_write(cci_pmu)) {
> +		val = pmu_read_counter(event);
> +	} else {
> +		/*
> +		 * The CCI PMU counters have a period of 2^32. To account for
> +		 * the possiblity of extreme interrupt latency we program for
> +		 * a period of half that. Hopefully we can handle the interrupt
> +		 * before another 2^31 events occur and the counter overtakes
> +		 * its previous value.
> +		 */
> +		val = 1ULL << 31;
> +		pmu_write_counter(event, val);
> +	}
> +
>  	local64_set(&hwc->prev_count, val);
> -	pmu_write_counter(event, val);
>  }

We could get rid of the flag and always do:

	pmu_write_counter(event, 1UL << 31);
	val = pmu_read_counter(event);

That should work for CCI-400 and CCI-500, and if the CCI-500 counters
aren't clock-gated (e.g. because of FW configuration) then we get the
period that we actually wanted.

However, can't we use an event which won't count to solve the race with
option 2?

Per the TRM, events for unimplemented interfaces don't count, though I'm
not sure if there's an event that's guaranteed to not count in all
configurations. Perhaps we might be able to dig one up.

Thanks,
Mark.

Suzuki K Poulose Sept. 29, 2015, 9:13 a.m. UTC | #4

On 28/09/15 16:38, Mark Rutland wrote:
> On Fri, Sep 25, 2015 at 04:50:53PM +0100, Suzuki K. Poulose wrote:
>> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com>
>>
>> The CCI PMU driver sets the event counter to the half of the maximum
>> value(2^31) it can count before we start the counters via
>> pmu_event_set_period(). This is done to give us the best chance to
>> handle the overflow interrupt, taking care of extreme interrupt latencies.
>>
>> However, CCI-500 comes with advanced power saving schemes, which disables
>> the clock to the event counters unless the counters are enabled to count
>> (PMCR.CEN). This prevents the driver from writing the period to the
>> counters before starting them.  Also, there is no way we can reset the
>> individual event counter to 0 (PMCR.RST resets all the counters, losing
>> their current readings). However the value of the counter is preserved and
>> could be read back, when the counters are not enabled.
>
> Does reset work while the counters are clock gated?

No. It doesn't. Both PMCR.CEN and Count Control should be turned on to
alter the value of the counter.  

>
> I take it that we can configure the counters while they are disabled,
> even if we can't alter the counter value?

We can program the events for the counter, but we can't alter the counter
value.

>
>> So we cannot reliably use the counters and compute the number of events
>> generated during the sampling period since we don't have the value of the
>> counter at start.
>>
>> Here are the possible solutions:
>>
>>   1) Disable clock gating on CCI-500 by setting Control_Override_Reg[bit3].
>>      - The Control_Override_Reg is secure (and hence not programmable from
>>        Linux), and also has an impact on power consumption.
>>
>>   2) Change the order of operations
>> 	i.e,
>> 	a) Program and enable individual counters
>> 	b) Enable counting on all the counters by setting PMCR.CEN
>> 	c) Write the period to the individual counters
>> 	d) Disable the counters
>>      - This could cause in unnecessary noise in the other counters and is
>>        costly (we should repeat this for all enabled counters).
>>
>>   3) Don't set the counter value, instead use the current count as the
>>      starting count and compute the delta at the end of sampling.
>
> This leaves us less broken than we currently are, but as far as I can
> tell, this leaves us back where we started, with the overflow issues
> seen elsewhere, cf:
>
> a737823d37666255 ("ARM: 6835/1: perf: ensure overflows aren't missed due to IRQ latency")
> 5727347180ebc6b4 ("ARM: 7354/1: perf: limit sample_period to half max_period in non-sampling mode")

Thanks for the pointers. Even now, without this change, theoretically
we could run into the overflow, isn't it ? So we should fix it for ever.

>> @@ -792,15 +803,33 @@ static void pmu_read(struct perf_event *event)
>>   void pmu_event_set_period(struct perf_event *event)
>>   {
>>   	struct hw_perf_event *hwc = &event->hw;
>> +	struct cci_pmu *cci_pmu = to_cci_pmu(event->pmu);
>> +	u64 val = 0;
>> +
>>   	/*
>> -	 * The CCI PMU counters have a period of 2^32. To account for the
>> -	 * possiblity of extreme interrupt latency we program for a period of
>> -	 * half that. Hopefully we can handle the interrupt before another 2^31
>> -	 * events occur and the counter overtakes its previous value.
>> +	 * If the PMU gates the clock to PMU event counters when the counters
>> +	 * are not enabled, any write to it is ineffective. Hence we cannot
>> +	 * set a value reliably. Also, we cannot reset an individual event
>> +	 * counter; PMCR.RST resets all the counters. However the existing
>> +	 * counter values can be read back. Hence, we use the existing counter
>> +	 * value as the period and set the prev_count accordingly. This is
>> +	 * safe, since we don't support sampling of events anyway.
>>   	 */
>> -	u64 val = 1ULL << 31;
>> +	if (pmu_ignores_cntr_write(cci_pmu)) {
>> +		val = pmu_read_counter(event);
>> +	} else {
>> +		/*
>> +		 * The CCI PMU counters have a period of 2^32. To account for
>> +		 * the possiblity of extreme interrupt latency we program for
>> +		 * a period of half that. Hopefully we can handle the interrupt
>> +		 * before another 2^31 events occur and the counter overtakes
>> +		 * its previous value.
>> +		 */
>> +		val = 1ULL << 31;
>> +		pmu_write_counter(event, val);
>> +	}
>> +
>>   	local64_set(&hwc->prev_count, val);
>> -	pmu_write_counter(event, val);
>>   }
>
> We could get rid of the flag and always do:
>
> 	pmu_write_counter(event, 1UL << 31);
> 	val = pmu_read_counter(event);
>
> That should work for CCI-400 and CCI-500, and if the CCI-500 counters
> aren't clock-gated (e.g. because of FW configuration) then we get the
> period that we actually wanted.

Yes, thats looks neater. I will change it with some comments in there.

>
> However, can't we use an event which won't count to solve the race with
> option 2?

Even if we found one, won't we have to set the 'no-op' event on all counters,
if we want to change a counter value to prevent other counters from counting?
And it gets complicated with/without event grouping.

>
> Per the TRM, events for unimplemented interfaces don't count, though I'm
> not sure if there's an event that's guaranteed to not count in all
> configurations. Perhaps we might be able to dig one up.

IIRC, there wasn't any mention about a single event which falls into that
category. I think solution 3 is much cleaner than finding and using a dummy
event.

Suzuki

Mark Rutland Sept. 29, 2015, 10:30 a.m. UTC | #5

Hi,

> > Does reset work while the counters are clock gated?
> 
> No. It doesn't. Both PMCR.CEN and Count Control should be turned on to
> alter the value of the counter.  

Ok. I take it PMCR.CEN is global and Count Control is per-counter?

> > I take it that we can configure the counters while they are disabled,
> > even if we can't alter the counter value?
> 
> We can program the events for the counter, but we can't alter the counter
> value.

Ok, that's something, at least.

> >> So we cannot reliably use the counters and compute the number of events
> >> generated during the sampling period since we don't have the value of the
> >> counter at start.
> >>
> >> Here are the possible solutions:
> >>
> >>   1) Disable clock gating on CCI-500 by setting Control_Override_Reg[bit3].
> >>      - The Control_Override_Reg is secure (and hence not programmable from
> >>        Linux), and also has an impact on power consumption.
> >>
> >>   2) Change the order of operations
> >> 	i.e,
> >> 	a) Program and enable individual counters
> >> 	b) Enable counting on all the counters by setting PMCR.CEN
> >> 	c) Write the period to the individual counters
> >> 	d) Disable the counters
> >>      - This could cause in unnecessary noise in the other counters and is
> >>        costly (we should repeat this for all enabled counters).
> >>
> >>   3) Don't set the counter value, instead use the current count as the
> >>      starting count and compute the delta at the end of sampling.
> >
> > This leaves us less broken than we currently are, but as far as I can
> > tell, this leaves us back where we started, with the overflow issues
> > seen elsewhere, cf:
> >
> > a737823d37666255 ("ARM: 6835/1: perf: ensure overflows aren't missed due to IRQ latency")
> > 5727347180ebc6b4 ("ARM: 7354/1: perf: limit sample_period to half max_period in non-sampling mode")
> 
> Thanks for the pointers. Even now, without this change, theoretically
> we could run into the overflow, isn't it ? So we should fix it for ever.

It's not really fixable as such (given sufficiently frequent events and
large enough IRQ latency there's simply nothing you can do). The current
behaviour on CCI-400 is about the best we can do, so we should aim for
that on CCI-500 if possible.

> > We could get rid of the flag and always do:
> >
> > 	pmu_write_counter(event, 1UL << 31);
> > 	val = pmu_read_counter(event);
> >
> > That should work for CCI-400 and CCI-500, and if the CCI-500 counters
> > aren't clock-gated (e.g. because of FW configuration) then we get the
> > period that we actually wanted.
> 
> Yes, thats looks neater. I will change it with some comments in there.

Great.

That still leaves us with racy overflow handling on a clock gated
CCI-500 PMU, which we should aim to avoid if possible.

> > However, can't we use an event which won't count to solve the race with
> > option 2?
> 
> Even if we found one, won't we have to set the 'no-op' event on all counters,
> if we want to change a counter value to prevent other counters from counting?

Can't we just clear the counter_enable bits for the other counters?

We could handle that in pmu_{disable,enable} to amortize the cost (or
even do it lazily on the first re-programming).

> And it gets complicated with/without event grouping.

I don't see why grouping would be any more complicated by this. What am
I missing?

> > Per the TRM, events for unimplemented interfaces don't count, though I'm
> > not sure if there's an event that's guaranteed to not count in all
> > configurations. Perhaps we might be able to dig one up.
> 
> IIRC, there wasn't any mention about a single event which falls into that
> category. I think solution 3 is much cleaner than finding and using a dummy
> event.

I agree that it is simpler, but it leaves us with a worrying high chance
of coming up with erroneous values, which detracts from the usefulness
of the PMU.

Thanks,
Mark.

arm-cci500: Workaround pmu_event_set_period

Commit Message

Comments

Patch