Message ID | 1458333422-8963-1-git-send-email-agustinv@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Mar 18, 2016 at 04:37:02PM -0400, Agustin Vega-Frias wrote: > This adds a new dynamic PMU to the Perf Events framework to program > and control the L3 cache PMUs in some Qualcomm Technologies SOCs. > > The driver supports a distributed cache architecture where the overall > cache is comprised of multiple slices each with its own PMU. The driver > aggregates counts across the whole system to provide a global picture > of the metrics selected by the user. So is there never a situation where you want to profile just a single slice? It userspace at all aware of these slices through other means? That is; typically we do not aggregate in-kernel like this but simply expose each slice as a separate PMU and let userspace sort things.
On Fri, Mar 18, 2016 at 04:37:02PM -0400, Agustin Vega-Frias wrote: > This adds a new dynamic PMU to the Perf Events framework to program > and control the L3 cache PMUs in some Qualcomm Technologies SOCs. > > The driver supports a distributed cache architecture where the overall > cache is comprised of multiple slices each with its own PMU. The driver > aggregates counts across the whole system to provide a global picture > of the metrics selected by the user. > > The driver exports formatting and event information to sysfs so it can > be used by the perf user space tools with the syntaxes: > perf stat -a -e l3cache/read-miss/ > perf stat -a -e l3cache/event=0x21/ > > Signed-off-by: Agustin Vega-Frias <agustinv@codeaurora.org> > --- > arch/arm64/kernel/Makefile | 4 + > arch/arm64/kernel/perf_event_qcom_l3_cache.c | 816 +++++++++++++++++++++++++++ > 2 files changed, 820 insertions(+) > create mode 100644 arch/arm64/kernel/perf_event_qcom_l3_cache.c Move this to drivers/bus (where the CCI and CCN PMU drivers live), or drivers/perf (where some common infrastructure lives). This isn't architectural, and isn't CPU-specific, so it has no reason to live in arch/arm64. > +static > +int qcom_l3_cache__event_init(struct perf_event *event) > +{ > + struct hw_perf_event *hwc = &event->hw; > + > + if (event->attr.type != l3cache_pmu.pmu.type) > + return -ENOENT; > + > + /* > + * There are no per-counter mode filters in the PMU. > + */ > + if (event->attr.exclude_user || event->attr.exclude_kernel || > + event->attr.exclude_hv || event->attr.exclude_idle) > + return -EINVAL; > + > + hwc->idx = -1; > + > + /* > + * Sampling not supported since these events are not core-attributable. > + */ > + if (hwc->sample_period) > + return -EINVAL; > + > + /* > + * Task mode not available, we run the counters as system counters, > + * not attributable to any CPU and therefore cannot attribute per-task. > + */ > + if (event->cpu < 0) > + return -EINVAL; > + > + return 0; > +} Please follow what the other system PMUs do and (forcefully) ensure that all events share the same CPU in event_init. Otherwise, events can exist in multiple percpu contexts, and management thereof can race on things line pmu_{enable,disable}. You'll also want to expose a cpumask to userspace for that, to ensure that it only opens events on a single CPU. For an example, see drivers/bus/arm-ccn.c. That also handles migrating events to a new CPU upon hotplug. > +static > +int qcom_l3_cache__event_add(struct perf_event *event, int flags) > +{ > + struct l3cache_pmu *system = to_l3cache_pmu(event->pmu); > + struct hw_perf_event *hwc = &event->hw; > + int idx; > + int prev_cpu; > + int err = 0; > + > + /* > + * We need to disable the pmu while adding the event, otherwise > + * the perf tick might kick-in and re-add this event. > + */ > + perf_pmu_disable(event->pmu); > + > + /* > + * This ensures all events are on the same CPU context. No need to open > + * these on all CPUs since they are system events. The strategy here is > + * to set system->cpu when the first event is created and from that > + * point on, only events in the same CPU context will be allowed to be > + * active. system->cpu will get reset back to -1 when the last event > + * is deleted, please see qcom_l3_cache__event_del below. > + */ > + prev_cpu = atomic_cmpxchg(&system->cpu, -1, event->cpu); > + if ((event->cpu != prev_cpu) && (prev_cpu != -1)) { > + err = -EAGAIN; > + goto out; > + } As above, handle this in event_init, as other system PMUs do. Handling it here is rather late, and unnecessarily fragile. > +/* > + * In some platforms interrupt resources might not come directly from the GIC, > + * but from separate IRQ circuitry that signals a summary IRQ to the GIC and > + * is handled by a secondary IRQ controller. This is problematic under ACPI boot > + * because the ACPI core does not use the Resource Source field on the Extended > + * Interrupt Descriptor, which in theory could be used to specify an alternative > + * IRQ controller. > + > + * For this reason in these platforms we implement the secondary IRQ controller > + * using the gpiolib and specify the IRQs as GpioInt resources, so when getting > + * an IRQ from the device we first try platform_get_irq and if it fails we try > + * devm_gpiod_get_index/gpiod_to_irq. > + */ > +static > +int qcom_l3_cache_pmu_get_slice_irq(struct platform_device *pdev, > + struct platform_device *sdev) > +{ > + int virq = platform_get_irq(sdev, 0); > + struct gpio_desc *desc; > + > + if (virq >= 0) > + return virq; > + > + desc = devm_gpiod_get_index(&sdev->dev, NULL, 0, GPIOD_ASIS); > + if (IS_ERR(desc)) > + return -ENOENT; > + > + return gpiod_to_irq(desc); > +} > + As Marc Zyngier pointed out in another thread, you should represent your interrupt controller as an interrupt controller rather than pretending it is a GPIO controller. Drivers should be able to remain blissfully unaware what the other end of their interrupt line is wired up to, and shouldn't have to jump through hoops like the above. Thanks, Mark.
On Mon, Mar 21, 2016 at 10:35:08AM +0000, Mark Rutland wrote: > On Fri, Mar 18, 2016 at 04:37:02PM -0400, Agustin Vega-Frias wrote: > > This adds a new dynamic PMU to the Perf Events framework to program > > and control the L3 cache PMUs in some Qualcomm Technologies SOCs. > > > > The driver supports a distributed cache architecture where the overall > > cache is comprised of multiple slices each with its own PMU. The driver > > aggregates counts across the whole system to provide a global picture > > of the metrics selected by the user. > > > > The driver exports formatting and event information to sysfs so it can > > be used by the perf user space tools with the syntaxes: > > perf stat -a -e l3cache/read-miss/ > > perf stat -a -e l3cache/event=0x21/ > > > > Signed-off-by: Agustin Vega-Frias <agustinv@codeaurora.org> > > --- > > arch/arm64/kernel/Makefile | 4 + > > arch/arm64/kernel/perf_event_qcom_l3_cache.c | 816 +++++++++++++++++++++++++++ > > 2 files changed, 820 insertions(+) > > create mode 100644 arch/arm64/kernel/perf_event_qcom_l3_cache.c > > Move this to drivers/bus (where the CCI and CCN PMU drivers live), or > drivers/perf (where some common infrastructure lives). Please stick to drivers/perf, as I have a vague plan to move the CCI/CCN PMU code out of drivers/bus and into drivers/perf (which didn't exist when they were originally upstreamed). Will
On Mon, Mar 21, 2016 at 10:35:08AM +0000, Mark Rutland wrote: > > +static > > +int qcom_l3_cache__event_add(struct perf_event *event, int flags) > > +{ > > + struct l3cache_pmu *system = to_l3cache_pmu(event->pmu); > > + struct hw_perf_event *hwc = &event->hw; > > + int idx; > > + int prev_cpu; > > + int err = 0; > > + > > + /* > > + * We need to disable the pmu while adding the event, otherwise > > + * the perf tick might kick-in and re-add this event. > > + */ > > + perf_pmu_disable(event->pmu); Why did you write that? If you really need this you did something seriously wrong elsewhere, because: kernel/events/core.c:event_sched_in() is the only place calling pmu::add() and that explicitly already does this.
On 2016-03-21 05:04, Peter Zijlstra wrote: > On Fri, Mar 18, 2016 at 04:37:02PM -0400, Agustin Vega-Frias wrote: >> This adds a new dynamic PMU to the Perf Events framework to program >> and control the L3 cache PMUs in some Qualcomm Technologies SOCs. >> >> The driver supports a distributed cache architecture where the overall >> cache is comprised of multiple slices each with its own PMU. The >> driver >> aggregates counts across the whole system to provide a global picture >> of the metrics selected by the user. > > So is there never a situation where you want to profile just a single > slice? No, access to each individual slice is determined by hashing based on the target address. > > It userspace at all aware of these slices through other means? Userspace is not aware of the actual topology. > > That is; typically we do not aggregate in-kernel like this but simply > expose each slice as a separate PMU and let userspace sort things. My decision of single vs. multiple PMUs was based on reducing the overhead required of retrieving the system-wide counts, which would require multiple system calls in the multiple-PMU case.
On Mon, Mar 21, 2016 at 11:56:59AM -0400, agustinv@codeaurora.org wrote: > On 2016-03-21 05:04, Peter Zijlstra wrote: > >On Fri, Mar 18, 2016 at 04:37:02PM -0400, Agustin Vega-Frias wrote: > >>This adds a new dynamic PMU to the Perf Events framework to program > >>and control the L3 cache PMUs in some Qualcomm Technologies SOCs. > >> > >>The driver supports a distributed cache architecture where the overall > >>cache is comprised of multiple slices each with its own PMU. The driver > >>aggregates counts across the whole system to provide a global picture > >>of the metrics selected by the user. > > > >So is there never a situation where you want to profile just a single > >slice? > > No, access to each individual slice is determined by hashing based on the > target address. > > > > >It userspace at all aware of these slices through other means? > > Userspace is not aware of the actual topology. > > > > >That is; typically we do not aggregate in-kernel like this but simply > >expose each slice as a separate PMU and let userspace sort things. > > My decision of single vs. multiple PMUs was based on reducing the overhead > required of retrieving the system-wide counts, which would require multiple > system calls in the multiple-PMU case. OK. A bit weird your hardware has a PMU per slice if its otherwise completely hidden. In any case, put a comment somewhere describing how access to a single slice never makes sense.
On 2016-03-21 06:35, Mark Rutland wrote: > On Fri, Mar 18, 2016 at 04:37:02PM -0400, Agustin Vega-Frias wrote: >> This adds a new dynamic PMU to the Perf Events framework to program >> and control the L3 cache PMUs in some Qualcomm Technologies SOCs. >> >> The driver supports a distributed cache architecture where the overall >> cache is comprised of multiple slices each with its own PMU. The >> driver >> aggregates counts across the whole system to provide a global picture >> of the metrics selected by the user. >> >> The driver exports formatting and event information to sysfs so it can >> be used by the perf user space tools with the syntaxes: >> perf stat -a -e l3cache/read-miss/ >> perf stat -a -e l3cache/event=0x21/ >> >> Signed-off-by: Agustin Vega-Frias <agustinv@codeaurora.org> >> --- >> arch/arm64/kernel/Makefile | 4 + >> arch/arm64/kernel/perf_event_qcom_l3_cache.c | 816 >> +++++++++++++++++++++++++++ >> 2 files changed, 820 insertions(+) >> create mode 100644 arch/arm64/kernel/perf_event_qcom_l3_cache.c > > Move this to drivers/bus (where the CCI and CCN PMU drivers live), or > drivers/perf (where some common infrastructure lives). > > This isn't architectural, and isn't CPU-specific, so it has no reason > to > live in arch/arm64. > I will move it to drivers/perf. Thanks. >> +static >> +int qcom_l3_cache__event_init(struct perf_event *event) >> +{ >> + struct hw_perf_event *hwc = &event->hw; >> + >> + if (event->attr.type != l3cache_pmu.pmu.type) >> + return -ENOENT; >> + >> + /* >> + * There are no per-counter mode filters in the PMU. >> + */ >> + if (event->attr.exclude_user || event->attr.exclude_kernel || >> + event->attr.exclude_hv || event->attr.exclude_idle) >> + return -EINVAL; >> + >> + hwc->idx = -1; >> + >> + /* >> + * Sampling not supported since these events are not >> core-attributable. >> + */ >> + if (hwc->sample_period) >> + return -EINVAL; >> + >> + /* >> + * Task mode not available, we run the counters as system counters, >> + * not attributable to any CPU and therefore cannot attribute >> per-task. >> + */ >> + if (event->cpu < 0) >> + return -EINVAL; >> + >> + return 0; >> +} > > Please follow what the other system PMUs do and (forcefully) ensure > that all > events share the same CPU in event_init. > > Otherwise, events can exist in multiple percpu contexts, and management > thereof can race on things line pmu_{enable,disable}. > > You'll also want to expose a cpumask to userspace for that, to ensure > that it > only opens events on a single CPU. > > For an example, see drivers/bus/arm-ccn.c. That also handles migrating > events to a new CPU upon hotplug. Understood. I will look at the CCN implementation as reference, > >> +static >> +int qcom_l3_cache__event_add(struct perf_event *event, int flags) >> +{ >> + struct l3cache_pmu *system = to_l3cache_pmu(event->pmu); >> + struct hw_perf_event *hwc = &event->hw; >> + int idx; >> + int prev_cpu; >> + int err = 0; >> + >> + /* >> + * We need to disable the pmu while adding the event, otherwise >> + * the perf tick might kick-in and re-add this event. >> + */ >> + perf_pmu_disable(event->pmu); >> + >> + /* >> + * This ensures all events are on the same CPU context. No need to >> open >> + * these on all CPUs since they are system events. The strategy here >> is >> + * to set system->cpu when the first event is created and from that >> + * point on, only events in the same CPU context will be allowed to >> be >> + * active. system->cpu will get reset back to -1 when the last event >> + * is deleted, please see qcom_l3_cache__event_del below. >> + */ >> + prev_cpu = atomic_cmpxchg(&system->cpu, -1, event->cpu); >> + if ((event->cpu != prev_cpu) && (prev_cpu != -1)) { >> + err = -EAGAIN; >> + goto out; >> + } > > As above, handle this in event_init, as other system PMUs do. > > Handling it here is rather late, and unnecessarily fragile. Understood. I will look at the CCN implementation as reference, > >> +/* >> + * In some platforms interrupt resources might not come directly from >> the GIC, >> + * but from separate IRQ circuitry that signals a summary IRQ to the >> GIC and >> + * is handled by a secondary IRQ controller. This is problematic >> under ACPI boot >> + * because the ACPI core does not use the Resource Source field on >> the Extended >> + * Interrupt Descriptor, which in theory could be used to specify an >> alternative >> + * IRQ controller. >> + >> + * For this reason in these platforms we implement the secondary IRQ >> controller >> + * using the gpiolib and specify the IRQs as GpioInt resources, so >> when getting >> + * an IRQ from the device we first try platform_get_irq and if it >> fails we try >> + * devm_gpiod_get_index/gpiod_to_irq. >> + */ >> +static >> +int qcom_l3_cache_pmu_get_slice_irq(struct platform_device *pdev, >> + struct platform_device *sdev) >> +{ >> + int virq = platform_get_irq(sdev, 0); >> + struct gpio_desc *desc; >> + >> + if (virq >= 0) >> + return virq; >> + >> + desc = devm_gpiod_get_index(&sdev->dev, NULL, 0, GPIOD_ASIS); >> + if (IS_ERR(desc)) >> + return -ENOENT; >> + >> + return gpiod_to_irq(desc); >> +} >> + > > As Marc Zyngier pointed out in another thread, you should represent > your > interrupt controller as an interrupt controller rather than pretending > it is a > GPIO controller. > > Drivers should be able to remain blissfully unaware what the other end > of their > interrupt line is wired up to, and shouldn't have to jump through hoops > like > the above. Understood. We need to close the loop with Rafael J. Wysocki w.r.t. ACPI support for multiple regular IRQ controllers similar to DT. > > Thanks, > Mark. Thanks, Agustin
On 2016-03-21 08:04, Peter Zijlstra wrote: > On Mon, Mar 21, 2016 at 10:35:08AM +0000, Mark Rutland wrote: >> > +static >> > +int qcom_l3_cache__event_add(struct perf_event *event, int flags) >> > +{ >> > + struct l3cache_pmu *system = to_l3cache_pmu(event->pmu); >> > + struct hw_perf_event *hwc = &event->hw; >> > + int idx; >> > + int prev_cpu; >> > + int err = 0; >> > + >> > + /* >> > + * We need to disable the pmu while adding the event, otherwise >> > + * the perf tick might kick-in and re-add this event. >> > + */ >> > + perf_pmu_disable(event->pmu); > > Why did you write that? If you really need this you did something > seriously wrong elsewhere, because: > > kernel/events/core.c:event_sched_in() is the only place calling > pmu::add() and that explicitly already does this. This might have been before I added the restriction that only one CPU can open the events, but I will double-check and remove this as it is unnecessary. Thanks, Agustin
On 2016-03-21 06:35, Mark Rutland wrote: > On Fri, Mar 18, 2016 at 04:37:02PM -0400, Agustin Vega-Frias wrote: >> +/* >> + * In some platforms interrupt resources might not come directly from >> the GIC, >> + * but from separate IRQ circuitry that signals a summary IRQ to the >> GIC and >> + * is handled by a secondary IRQ controller. This is problematic >> under ACPI boot >> + * because the ACPI core does not use the Resource Source field on >> the Extended >> + * Interrupt Descriptor, which in theory could be used to specify an >> alternative >> + * IRQ controller. >> + >> + * For this reason in these platforms we implement the secondary IRQ >> controller >> + * using the gpiolib and specify the IRQs as GpioInt resources, so >> when getting >> + * an IRQ from the device we first try platform_get_irq and if it >> fails we try >> + * devm_gpiod_get_index/gpiod_to_irq. >> + */ >> +static >> +int qcom_l3_cache_pmu_get_slice_irq(struct platform_device *pdev, >> + struct platform_device *sdev) >> +{ >> + int virq = platform_get_irq(sdev, 0); >> + struct gpio_desc *desc; >> + >> + if (virq >= 0) >> + return virq; >> + >> + desc = devm_gpiod_get_index(&sdev->dev, NULL, 0, GPIOD_ASIS); >> + if (IS_ERR(desc)) >> + return -ENOENT; >> + >> + return gpiod_to_irq(desc); >> +} >> + > > As Marc Zyngier pointed out in another thread, you should represent > your > interrupt controller as an interrupt controller rather than pretending > it is a > GPIO controller. > > Drivers should be able to remain blissfully unaware what the other end > of their > interrupt line is wired up to, and shouldn't have to jump through hoops > like > the above. > > Thanks, > Mark. Given that this driver is ACPI-only we are leaning toward implementing overflow signalling as ACPI events. This will hide these details from the driver and use standard ACPI APIs. Thoughts? Thanks, Agustin
On Fri, Mar 18, 2016 at 04:37:02PM -0400, Agustin Vega-Frias wrote: > +static int qcom_l3_cache_pmu_probe(struct platform_device *pdev) > +{ > + int result; > + > + INIT_LIST_HEAD(&l3cache_pmu.pmus); > + > + atomic_set(&l3cache_pmu.cpu, -1); > + l3cache_pmu.pmu = (struct pmu) { > + .task_ctx_nr = perf_hw_context, This cannot be right. There should only be a single perf_hw_context driver in the system and that is typically the core pmu. Also, since this is an L3 PMU, it seems very unlikely that these events are actually per logical CPU, which is required for per-task events. If these events are indeed not per logical CPU but per L3 cluster, you should designate a single logical CPU per cluster to handle these events. See for example arch/x86/events/intel/rapl.c for a relatively simple PMU driver that has similar constraints. > + .name = "l3cache", > + .pmu_enable = qcom_l3_cache__pmu_enable, > + .pmu_disable = qcom_l3_cache__pmu_disable, > + .event_init = qcom_l3_cache__event_init, > + .add = qcom_l3_cache__event_add, > + .del = qcom_l3_cache__event_del, > + .start = qcom_l3_cache__event_start, > + .stop = qcom_l3_cache__event_stop, > + .read = qcom_l3_cache__event_read, > + > + .event_idx = dummy_event_idx, perf_event_idx_default() wasn't good enough? > + .attr_groups = qcom_l3_cache_pmu_attr_grps, > + };
On 2016-03-23 06:29, Marc Zyngier wrote: > On 22/03/16 18:33, agustinv@codeaurora.org wrote: >> On 2016-03-21 06:35, Mark Rutland wrote: >>> On Fri, Mar 18, 2016 at 04:37:02PM -0400, Agustin Vega-Frias wrote: >>>> +/* >>>> + * In some platforms interrupt resources might not come directly >>>> from >>>> the GIC, >>>> + * but from separate IRQ circuitry that signals a summary IRQ to >>>> the >>>> GIC and >>>> + * is handled by a secondary IRQ controller. This is problematic >>>> under ACPI boot >>>> + * because the ACPI core does not use the Resource Source field on >>>> the Extended >>>> + * Interrupt Descriptor, which in theory could be used to specify >>>> an >>>> alternative >>>> + * IRQ controller. >>>> + >>>> + * For this reason in these platforms we implement the secondary >>>> IRQ >>>> controller >>>> + * using the gpiolib and specify the IRQs as GpioInt resources, so >>>> when getting >>>> + * an IRQ from the device we first try platform_get_irq and if it >>>> fails we try >>>> + * devm_gpiod_get_index/gpiod_to_irq. >>>> + */ >>>> +static >>>> +int qcom_l3_cache_pmu_get_slice_irq(struct platform_device *pdev, >>>> + struct platform_device *sdev) >>>> +{ >>>> + int virq = platform_get_irq(sdev, 0); >>>> + struct gpio_desc *desc; >>>> + >>>> + if (virq >= 0) >>>> + return virq; >>>> + >>>> + desc = devm_gpiod_get_index(&sdev->dev, NULL, 0, GPIOD_ASIS); >>>> + if (IS_ERR(desc)) >>>> + return -ENOENT; >>>> + >>>> + return gpiod_to_irq(desc); >>>> +} >>>> + >>> >>> As Marc Zyngier pointed out in another thread, you should represent >>> your >>> interrupt controller as an interrupt controller rather than >>> pretending >>> it is a >>> GPIO controller. >>> >>> Drivers should be able to remain blissfully unaware what the other >>> end >>> of their >>> interrupt line is wired up to, and shouldn't have to jump through >>> hoops >>> like >>> the above. >>> >>> Thanks, >>> Mark. >> >> Given that this driver is ACPI-only we are leaning toward implementing >> overflow signalling as ACPI events. >> This will hide these details from the driver and use standard ACPI >> APIs. >> >> Thoughts? > > Please don't do that. The HW (whatever that is, and whatever the > firmware is) provides you with an interrupt line, not an ACPI event. > "Hiding the details" is always the wrong thing to do. How do we cope > with CPU affinity, for example? How do we selectively disable this > interrupt at the controller level? > > Also, ACPI events will be signalled by (guess what?) an interrupt. So > what are we actually gaining here? > > I'd *really* advise you to stick to existing abstractions, as they: > - accurately describe the way the HW works > - are already supported in the kernel > > The existing shortcomings of the ACPI layer should be addressed pretty > easily (if the feature is not supported on x86, let's find out why - we > can make it an arm64 feature if we really have to). > > Thanks, > > M. > -- > Jazz is not dead. It just smells funny... ACPI events *are* existing abstractions within ACPI-based systems and the mechanism works across OSes. If there was a one to one relationship between IRQs and events this would be hiding for the sake of hiding. In this case I am proposing to demultiplex the IRQ in the ACPI layer just as a secondary IRQ driver/domain does in DT. I have reached out to Rafael and Jeremy Pieralisi w.r.t. support of multiple IRQ domains via the Resource Source field. On the other hand we had implemented the IRQ combiner driver as a GPIO chip because the hardware is not an interrupt controller, it is an "interrupt combiner", which is very similar to a GPIO controller. Consider how a GPIO controller works: it sends a summary interrupt when a GPIO is toggled, software reads a status register to determine which pin was toggled. These combiners do the *exact* same thing except it's based on internal chip wires instead of external pins: it sends a summary interrupt when any wire is toggled and software checks a status register to determine which wire was toggled. That's pretty similar if you ask me. Given that we need to support other operating systems representing it as a GPIO controller was the most OS agnostic way to implement something like this on ACPI based systems. Thanks, Agustin.
On Wed, Mar 23, 2016 at 08:36:14AM -0400, agustinv@codeaurora.org wrote: > ACPI events *are* existing abstractions within ACPI-based systems and the > mechanism works across OSes. Correction; it never works anywhere. Its a nice idea on paper, but when you're stuck with a firmware that hands you random values pulled from a crack monkey's behind you have to then hack around, you wish you had done the sane thing and stuffed it in code you can actually fix.
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 83cd7e6..eff5dea 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -43,6 +43,10 @@ arm64-obj-$(CONFIG_ARMV8_DEPRECATED) += armv8_deprecated.o arm64-obj-$(CONFIG_ACPI) += acpi.o arm64-obj-$(CONFIG_PARAVIRT) += paravirt.o +ifeq ($(CONFIG_ARCH_QCOM), y) +arm64-obj-$(CONFIG_HW_PERF_EVENTS) += perf_event_qcom_l3_cache.o +endif + obj-y += $(arm64-obj-y) vdso/ obj-m += $(arm64-obj-m) head-y := head.o diff --git a/arch/arm64/kernel/perf_event_qcom_l3_cache.c b/arch/arm64/kernel/perf_event_qcom_l3_cache.c new file mode 100644 index 0000000..89b5ceb --- /dev/null +++ b/arch/arm64/kernel/perf_event_qcom_l3_cache.c @@ -0,0 +1,816 @@ +/* Copyright (c) 2015-2016, The Linux Foundation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 and + * only version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/module.h> +#include <linux/bitops.h> +#include <linux/gpio/consumer.h> +#include <linux/interrupt.h> +#include <linux/io.h> +#include <linux/irq.h> +#include <linux/irqdomain.h> +#include <linux/list.h> +#include <linux/acpi.h> +#include <linux/perf_event.h> +#include <linux/platform_device.h> + +/* + * Driver for the L3 cache PMUs in Qualcomm Technologies chips. + * + * The driver supports a distributed cache architecture where the overall + * cache is comprised of multiple slices each with its own PMU. The driver + * aggregates counts across the whole system to provide a global picture + * of the metrics selected by the user. + */ + +/* + * General constants + */ + +#define L3_NUM_COUNTERS (8) +#define L3_MAX_EVTYPE (0xFF) +#define L3_MAX_PERIOD U32_MAX +#define L3_CNT_PERIOD (U32_MAX - 0xFFFF) + +/* + * Register offsets + */ + +/* Perfmon registers */ +#define L3_HML3_PM_CR 0x000 +#define L3_HML3_PM_EVCNTR(__cntr) (0x040 + ((__cntr) & 0x7) * 8) +#define L3_HML3_PM_CNTCTL(__cntr) (0x200 + ((__cntr) & 0x7) * 8) +#define L3_HML3_PM_EVTYPE(__cntr) (0x240 + ((__cntr) & 0x7) * 8) +#define L3_HML3_PM_FILTRA 0x460 +#define L3_HML3_PM_FILTRB 0x464 +#define L3_HML3_PM_FILTRC 0x468 +#define L3_HML3_PM_FILTRAM 0x470 +#define L3_HML3_PM_FILTRBM 0x474 +#define L3_HML3_PM_FILTRCM 0x478 + +/* Basic counter registers */ +#define L3_M_BC_CR 0x500 +#define L3_M_BC_SATROLL_CR 0x504 +#define L3_M_BC_CNTENSET 0x508 +#define L3_M_BC_CNTENCLR 0x50C +#define L3_M_BC_INTENSET 0x510 +#define L3_M_BC_INTENCLR 0x514 +#define L3_M_BC_GANG 0x718 +#define L3_M_BC_OVSR 0x740 + +/* + * Bit field manipulators + */ + +/* L3_HML3_PM_CR */ +#define PM_CR_RESET (0) + +/* L3_HML3_PM_XCNTCTL/L3_HML3_PM_CNTCTLx */ +#define PMCNT_RESET (0) + +/* L3_HML3_PM_EVTYPEx */ +#define EVSEL(__val) ((u32)((__val) & 0xFF)) + +/* Reset value for all the filter registers */ +#define PM_FLTR_RESET (0) + +/* L3_M_BC_CR */ +#define BC_RETRIEVAL_MODE (((u32)1) << 2) +#define BC_RESET (((u32)1) << 1) +#define BC_ENABLE ((u32)1) + +/* L3_M_BC_SATROLL_CR */ +#define BC_SATROLL_CR_RESET (0) + +/* L3_M_BC_CNTENSET */ +#define PMCNTENSET(__cntr) (((u32)1) << ((__cntr) & 0x7)) + +/* L3_M_BC_CNTENCLR */ +#define PMCNTENCLR(__cntr) (((u32)1) << ((__cntr) & 0x7)) +#define BC_CNTENCLR_RESET (0xFF) + +/* L3_M_BC_INTENSET */ +#define PMINTENSET(__cntr) (((u32)1) << ((__cntr) & 0x7)) + +/* L3_M_BC_INTENCLR */ +#define PMINTENCLR(__cntr) (((u32)1) << ((__cntr) & 0x7)) +#define BC_INTENCLR_RESET (0xFF) + +/* L3_M_BC_GANG */ +#define BC_GANG_RESET (0) + +/* L3_M_BC_OVSR */ +#define PMOVSRCLR(__cntr) (((u32)1) << ((__cntr) & 0x7)) +#define PMOVSRCLR_RESET (0xFF) + +/* + * Events + */ + +#define L3_CYCLES 0x01 +#define L3_READ_HIT 0x20 +#define L3_READ_MISS 0x21 +#define L3_READ_HIT_D 0x22 +#define L3_READ_MISS_D 0x23 +#define L3_WRITE_HIT 0x24 +#define L3_WRITE_MISS 0x25 + +/* + * The cache is made-up of one or more slices, each slice has its own PMU. + * This structure represents one of the hardware PMUs. + */ +struct hml3_pmu { + struct list_head entry; + void __iomem *regs; + u32 inten; + atomic_t prev_count[L3_NUM_COUNTERS]; +}; + +static +void hml3_pmu__reset(struct hml3_pmu *pmu) +{ + int i; + + writel_relaxed(BC_RESET, pmu->regs + L3_M_BC_CR); + + /* + * Use writel for the first programming command to ensure the basic + * counter unit is stopped before proceeding + */ + writel(BC_SATROLL_CR_RESET, pmu->regs + L3_M_BC_SATROLL_CR); + writel_relaxed(BC_CNTENCLR_RESET, pmu->regs + L3_M_BC_CNTENCLR); + writel_relaxed(BC_INTENCLR_RESET, pmu->regs + L3_M_BC_INTENCLR); + writel_relaxed(BC_GANG_RESET, pmu->regs + L3_M_BC_GANG); + writel_relaxed(PMOVSRCLR_RESET, pmu->regs + L3_M_BC_OVSR); + + writel_relaxed(PM_CR_RESET, pmu->regs + L3_HML3_PM_CR); + for (i = 0; i < L3_NUM_COUNTERS; ++i) { + writel_relaxed(PMCNT_RESET, pmu->regs + L3_HML3_PM_CNTCTL(i)); + writel_relaxed(EVSEL(0), pmu->regs + L3_HML3_PM_EVTYPE(i)); + } + writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRA); + writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRB); + writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRC); + writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRAM); + writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRBM); + writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRCM); + pmu->inten = 0; +} + +static inline +void hml3_pmu__init(struct hml3_pmu *pmu, void __iomem *regs) +{ + pmu->regs = regs; + hml3_pmu__reset(pmu); + + /* + * Use writel here to ensure all programming commands are done + * before proceeding + */ + writel(BC_ENABLE, pmu->regs + L3_M_BC_CR); +} + +static inline +void hml3_pmu__deinit(struct hml3_pmu *pmu) +{ + hml3_pmu__reset(pmu); +} + +static inline +void hml3_pmu__enable(struct hml3_pmu *pmu) +{ + writel_relaxed(BC_ENABLE, pmu->regs + L3_M_BC_CR); +} + +static inline +void hml3_pmu__disable(struct hml3_pmu *pmu) +{ + writel_relaxed(0, pmu->regs + L3_M_BC_CR); +} + +static inline +void hml3_pmu__counter_set_event(struct hml3_pmu *pmu, u8 cntr, u32 event) +{ + writel_relaxed(EVSEL(event), pmu->regs + L3_HML3_PM_EVTYPE(cntr)); +} + +static inline +void hml3_pmu__counter_set_value(struct hml3_pmu *pmu, u8 cntr, u32 value) +{ + writel_relaxed(value, pmu->regs + L3_HML3_PM_EVCNTR(cntr)); +} + +static inline +u32 hml3_pmu__counter_get_value(struct hml3_pmu *pmu, u8 cntr) +{ + return readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(cntr)); +} + +static inline +void hml3_pmu__counter_enable(struct hml3_pmu *pmu, u8 cntr) +{ + writel_relaxed(PMCNTENSET(cntr), pmu->regs + L3_M_BC_CNTENSET); +} + +static inline +void hml3_pmu__counter_reset_trigger(struct hml3_pmu *pmu, u8 cntr) +{ + writel_relaxed(PMCNT_RESET, pmu->regs + L3_HML3_PM_CNTCTL(cntr)); +} + +static inline +void hml3_pmu__counter_disable(struct hml3_pmu *pmu, u8 cntr) +{ + writel_relaxed(PMCNTENCLR(cntr), pmu->regs + L3_M_BC_CNTENCLR); +} + +static inline +void hml3_pmu__counter_enable_interrupt(struct hml3_pmu *pmu, u8 cntr) +{ + writel_relaxed(PMINTENSET(cntr), pmu->regs + L3_M_BC_INTENSET); + pmu->inten |= PMINTENSET(cntr); +} + +static inline +void hml3_pmu__counter_disable_interrupt(struct hml3_pmu *pmu, u8 cntr) +{ + writel_relaxed(PMINTENCLR(cntr), pmu->regs + L3_M_BC_INTENCLR); + pmu->inten &= ~(PMINTENCLR(cntr)); +} + +static inline +u32 hml3_pmu__getreset_ovsr(struct hml3_pmu *pmu) +{ + u32 result = readl_relaxed(pmu->regs + L3_M_BC_OVSR); + + writel_relaxed(result, pmu->regs + L3_M_BC_OVSR); + return result; +} + +static inline +int hml3_pmu__has_overflowed(u32 ovsr) +{ + return (ovsr & PMOVSRCLR_RESET) != 0; +} + +static inline +int hml3_pmu__counter_has_overflowed(u32 ovsr, u8 cntr) +{ + return (ovsr & PMOVSRCLR(cntr)) != 0; +} + +/* + * Decoding of settings from perf_event_attr + * + * The config format for perf events is: + * - config: bits 0-7: event type + * bit 32: HW counter size requested, 0: 32 bits, 1: 64 bits + */ +static inline u32 get_event_type(struct perf_event *event) +{ + return (event->attr.config) & L3_MAX_EVTYPE; +} + +/* + * Aggregate PMU. Implements the core pmu functions and manages + * the hardware PMU, configuring each one in the same way and + * aggregating events when needed. + */ + +struct l3cache_pmu { + u32 num_pmus; + atomic_t cpu; + struct list_head pmus; + unsigned long used_mask[BITS_TO_LONGS(L3_NUM_COUNTERS)]; + struct perf_event *events[L3_NUM_COUNTERS]; + struct pmu pmu; +}; + +#define to_l3cache_pmu(p) (container_of(p, struct l3cache_pmu, pmu)) + +static struct l3cache_pmu l3cache_pmu = { 0 }; + +static +void qcom_l3_cache__event_update_from_slice(struct perf_event *event, + struct hml3_pmu *slice) +{ + struct hw_perf_event *hwc = &event->hw; + u32 delta, prev, now; + +again: + prev = atomic_read(&slice->prev_count[hwc->idx]); + now = hml3_pmu__counter_get_value(slice, hwc->idx); + + if (atomic_cmpxchg(&slice->prev_count[hwc->idx], prev, now) != prev) + goto again; + + delta = now - prev; + + local64_add(delta, &event->count); +} + +static +void qcom_l3_cache__slice_set_period(struct hml3_pmu *slice, int idx, u32 prev) +{ + u32 value = L3_MAX_PERIOD - (L3_CNT_PERIOD - 1); + + if (prev < value) { + value += prev; + atomic_set(&slice->prev_count[idx], value); + } else { + value = prev; + } + hml3_pmu__counter_set_value(slice, idx, value); +} + +static +int qcom_l3_cache__get_event_idx(struct l3cache_pmu *system) +{ + int idx; + + for (idx = 0; idx < L3_NUM_COUNTERS; ++idx) { + if (!test_and_set_bit(idx, system->used_mask)) + return idx; + } + + /* The counters are all in use. */ + return -EAGAIN; +} + +static +irqreturn_t qcom_l3_cache__handle_irq(int irq_num, void *data) +{ + struct hml3_pmu *slice = data; + u32 status; + int idx; + + status = hml3_pmu__getreset_ovsr(slice); + if (!hml3_pmu__has_overflowed(status)) + return IRQ_NONE; + + while (status) { + struct perf_event *event; + + idx = __ffs(status); + status &= ~(1 << idx); + event = l3cache_pmu.events[idx]; + if (!event) + continue; + + qcom_l3_cache__event_update_from_slice(event, slice); + qcom_l3_cache__slice_set_period(slice, idx, + atomic_read(&slice->prev_count[idx])); + } + + /* + * Handle the pending perf events. + * + * Note: this call *must* be run with interrupts disabled. For + * platforms that can have the PMU interrupts raised as an NMI, this + * will not work. + */ + irq_work_run(); + + return IRQ_HANDLED; +} + +/* + * Implementation of abstract pmu functionality required by + * the core perf events code. + */ + +static +void qcom_l3_cache__pmu_enable(struct pmu *pmu) +{ + struct l3cache_pmu *system = to_l3cache_pmu(pmu); + struct hml3_pmu *slice; + int idx; + + /* + * Re-write CNTCTL for all existing events to re-assert + * the start trigger. + */ + for (idx = 0; idx < L3_NUM_COUNTERS; idx++) + if (system->events[idx]) + list_for_each_entry(slice, &system->pmus, entry) + hml3_pmu__counter_reset_trigger(slice, idx); + + /* Ensure all programming commands are done before proceeding */ + wmb(); + list_for_each_entry(slice, &system->pmus, entry) + hml3_pmu__enable(slice); +} + +static +void qcom_l3_cache__pmu_disable(struct pmu *pmu) +{ + struct l3cache_pmu *system = to_l3cache_pmu(pmu); + struct hml3_pmu *slice; + + list_for_each_entry(slice, &system->pmus, entry) + hml3_pmu__disable(slice); + + /* Ensure the basic counter unit is stopped before proceeding */ + wmb(); +} + +static +int qcom_l3_cache__event_init(struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + + if (event->attr.type != l3cache_pmu.pmu.type) + return -ENOENT; + + /* + * There are no per-counter mode filters in the PMU. + */ + if (event->attr.exclude_user || event->attr.exclude_kernel || + event->attr.exclude_hv || event->attr.exclude_idle) + return -EINVAL; + + hwc->idx = -1; + + /* + * Sampling not supported since these events are not core-attributable. + */ + if (hwc->sample_period) + return -EINVAL; + + /* + * Task mode not available, we run the counters as system counters, + * not attributable to any CPU and therefore cannot attribute per-task. + */ + if (event->cpu < 0) + return -EINVAL; + + return 0; +} + +static +void qcom_l3_cache__event_update(struct perf_event *event) +{ + struct l3cache_pmu *system = to_l3cache_pmu(event->pmu); + struct hml3_pmu *slice; + + list_for_each_entry(slice, &system->pmus, entry) + qcom_l3_cache__event_update_from_slice(event, slice); +} + +static +void qcom_l3_cache__event_start(struct perf_event *event, int flags) +{ + struct l3cache_pmu *system = to_l3cache_pmu(event->pmu); + struct hml3_pmu *slice; + struct hw_perf_event *hwc = &event->hw; + int idx = hwc->idx; + u32 event_type = get_event_type(event); + + hwc->state = 0; + + if (flags & PERF_EF_RELOAD) + WARN_ON(system->num_pmus != 1); + + list_for_each_entry(slice, &system->pmus, entry) { + qcom_l3_cache__slice_set_period(slice, hwc->idx, 0); + hml3_pmu__counter_set_event(slice, idx, event_type); + hml3_pmu__counter_enable_interrupt(slice, idx); + hml3_pmu__counter_enable(slice, idx); + } +} + +static +void qcom_l3_cache__event_stop(struct perf_event *event, int flags) +{ + struct l3cache_pmu *system = to_l3cache_pmu(event->pmu); + struct hml3_pmu *slice; + struct hw_perf_event *hwc = &event->hw; + + if (!(hwc->state & PERF_HES_STOPPED)) { + list_for_each_entry(slice, &system->pmus, entry) { + hml3_pmu__counter_disable_interrupt(slice, hwc->idx); + hml3_pmu__counter_disable(slice, hwc->idx); + } + + if (flags & PERF_EF_UPDATE) + qcom_l3_cache__event_update(event); + hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE; + } +} + +static +int qcom_l3_cache__event_add(struct perf_event *event, int flags) +{ + struct l3cache_pmu *system = to_l3cache_pmu(event->pmu); + struct hw_perf_event *hwc = &event->hw; + int idx; + int prev_cpu; + int err = 0; + + /* + * We need to disable the pmu while adding the event, otherwise + * the perf tick might kick-in and re-add this event. + */ + perf_pmu_disable(event->pmu); + + /* + * This ensures all events are on the same CPU context. No need to open + * these on all CPUs since they are system events. The strategy here is + * to set system->cpu when the first event is created and from that + * point on, only events in the same CPU context will be allowed to be + * active. system->cpu will get reset back to -1 when the last event + * is deleted, please see qcom_l3_cache__event_del below. + */ + prev_cpu = atomic_cmpxchg(&system->cpu, -1, event->cpu); + if ((event->cpu != prev_cpu) && (prev_cpu != -1)) { + err = -EAGAIN; + goto out; + } + + /* + * Try to allocate a counter. + */ + idx = qcom_l3_cache__get_event_idx(system); + if (idx < 0) { + err = idx; + goto out; + } + + hwc->idx = idx; + hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE; + system->events[idx] = event; + + if (flags & PERF_EF_START) + qcom_l3_cache__event_start(event, flags); + + /* Propagate changes to the userspace mapping. */ + perf_event_update_userpage(event); + +out: + perf_pmu_enable(event->pmu); + return err; +} + +static +void qcom_l3_cache__event_del(struct perf_event *event, int flags) +{ + struct l3cache_pmu *system = to_l3cache_pmu(event->pmu); + struct hw_perf_event *hwc = &event->hw; + + qcom_l3_cache__event_stop(event, flags | PERF_EF_UPDATE); + system->events[hwc->idx] = NULL; + clear_bit(hwc->idx, system->used_mask); + + /* + * If this is the last event, reset &system->cpu to allow the next + * event to be created in any CPU context. + */ + if (find_first_bit(system->used_mask, L3_NUM_COUNTERS) == + L3_NUM_COUNTERS) + atomic_set(&system->cpu, -1); + + perf_event_update_userpage(event); +} + +static +void qcom_l3_cache__event_read(struct perf_event *event) +{ + qcom_l3_cache__event_update(event); +} + +static +int dummy_event_idx(struct perf_event *event) +{ + return 0; +} + +/* + * Export nodes so perf user space can create events symbolically. E.g.: + * perf stat -a -e l3cache/read-miss/ ls + * perf stat -a -e l3cache/event=0x21/ ls + */ + +ssize_t l3cache_pmu_event_sysfs_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct perf_pmu_events_attr *pmu_attr; + + pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr); + return sprintf(page, "event=0x%02llx,name=%s\n", + pmu_attr->id, attr->attr.name); +} + +#define L3CACHE_EVENT_VAR(__id) pmu_event_attr_##__id +#define L3CACHE_EVENT_PTR(__id) (&L3CACHE_EVENT_VAR(__id).attr.attr) + +#define L3CACHE_EVENT_ATTR(__name, __id) \ + PMU_EVENT_ATTR(__name, L3CACHE_EVENT_VAR(__id), __id, \ + l3cache_pmu_event_sysfs_show) + + +L3CACHE_EVENT_ATTR(cycles, L3_CYCLES); +L3CACHE_EVENT_ATTR(read-hit, L3_READ_HIT); +L3CACHE_EVENT_ATTR(read-miss, L3_READ_MISS); +L3CACHE_EVENT_ATTR(read-hit-d-side, L3_READ_HIT_D); +L3CACHE_EVENT_ATTR(read-miss-d-side, L3_READ_MISS_D); +L3CACHE_EVENT_ATTR(write-hit, L3_WRITE_HIT); +L3CACHE_EVENT_ATTR(write-miss, L3_WRITE_MISS); + +static struct attribute *qcom_l3_cache_pmu_events[] = { + L3CACHE_EVENT_PTR(L3_CYCLES), + L3CACHE_EVENT_PTR(L3_READ_HIT), + L3CACHE_EVENT_PTR(L3_READ_MISS), + L3CACHE_EVENT_PTR(L3_READ_HIT_D), + L3CACHE_EVENT_PTR(L3_READ_MISS_D), + L3CACHE_EVENT_PTR(L3_WRITE_HIT), + L3CACHE_EVENT_PTR(L3_WRITE_MISS), + NULL +}; + +static struct attribute_group qcom_l3_cache_pmu_events_group = { + .name = "events", + .attrs = qcom_l3_cache_pmu_events, +}; + +PMU_FORMAT_ATTR(event, "config:0-7"); +static struct attribute *qcom_l3_cache_pmu_formats[] = { + &format_attr_event.attr, + NULL, +}; + +static struct attribute_group qcom_l3_cache_pmu_format_group = { + .name = "format", + .attrs = qcom_l3_cache_pmu_formats, +}; + +static const struct attribute_group *qcom_l3_cache_pmu_attr_grps[] = { + &qcom_l3_cache_pmu_format_group, + &qcom_l3_cache_pmu_events_group, + NULL, +}; + +/* + * Probing functions and data. + */ + +/* + * In some platforms interrupt resources might not come directly from the GIC, + * but from separate IRQ circuitry that signals a summary IRQ to the GIC and + * is handled by a secondary IRQ controller. This is problematic under ACPI boot + * because the ACPI core does not use the Resource Source field on the Extended + * Interrupt Descriptor, which in theory could be used to specify an alternative + * IRQ controller. + + * For this reason in these platforms we implement the secondary IRQ controller + * using the gpiolib and specify the IRQs as GpioInt resources, so when getting + * an IRQ from the device we first try platform_get_irq and if it fails we try + * devm_gpiod_get_index/gpiod_to_irq. + */ +static +int qcom_l3_cache_pmu_get_slice_irq(struct platform_device *pdev, + struct platform_device *sdev) +{ + int virq = platform_get_irq(sdev, 0); + struct gpio_desc *desc; + + if (virq >= 0) + return virq; + + desc = devm_gpiod_get_index(&sdev->dev, NULL, 0, GPIOD_ASIS); + if (IS_ERR(desc)) + return -ENOENT; + + return gpiod_to_irq(desc); +} + +static int qcom_l3_cache_pmu_probe_slice(struct device *dev, void *data) +{ + struct platform_device *pdev = to_platform_device(dev->parent); + struct platform_device *sdev = to_platform_device(dev); + struct l3cache_pmu *system = data; + struct resource *slice_info; + void __iomem *slice_mem; + struct hml3_pmu *slice; + int irq, err; + + slice_info = platform_get_resource(sdev, IORESOURCE_MEM, 0); + slice = devm_kzalloc(&pdev->dev, sizeof(*slice), GFP_KERNEL); + if (!slice) + return -ENOMEM; + + slice_mem = devm_ioremap_resource(&pdev->dev, slice_info); + if (IS_ERR(slice_mem)) { + dev_err(&pdev->dev, "Can't map slice @%pa\n", + &slice_info->start); + return PTR_ERR(slice_mem); + } + + irq = qcom_l3_cache_pmu_get_slice_irq(pdev, sdev); + if (irq < 0) { + dev_err(&pdev->dev, + "Failed to get valid irq for slice @%pa\n", + &slice_info->start); + return irq; + } + + err = devm_request_irq(&pdev->dev, irq, qcom_l3_cache__handle_irq, 0, + "qcom-l3-cache-pmu", slice); + if (err) { + dev_err(&pdev->dev, "Request for IRQ failed for slice @%pa\n", + &slice_info->start); + return err; + } + + hml3_pmu__init(slice, slice_mem); + list_add(&slice->entry, &system->pmus); + l3cache_pmu.num_pmus++; + + return 0; +} + +static int qcom_l3_cache_pmu_probe(struct platform_device *pdev) +{ + int result; + + INIT_LIST_HEAD(&l3cache_pmu.pmus); + + atomic_set(&l3cache_pmu.cpu, -1); + l3cache_pmu.pmu = (struct pmu) { + .task_ctx_nr = perf_hw_context, + + .name = "l3cache", + .pmu_enable = qcom_l3_cache__pmu_enable, + .pmu_disable = qcom_l3_cache__pmu_disable, + .event_init = qcom_l3_cache__event_init, + .add = qcom_l3_cache__event_add, + .del = qcom_l3_cache__event_del, + .start = qcom_l3_cache__event_start, + .stop = qcom_l3_cache__event_stop, + .read = qcom_l3_cache__event_read, + + .event_idx = dummy_event_idx, + + .attr_groups = qcom_l3_cache_pmu_attr_grps, + }; + + result = device_for_each_child(&pdev->dev, &l3cache_pmu, + qcom_l3_cache_pmu_probe_slice); + + if (result < 0) + return -ENODEV; + + if (l3cache_pmu.num_pmus == 0) { + dev_err(&pdev->dev, "No hardware HML3 PMUs found\n"); + return -ENODEV; + } + + result = perf_pmu_register(&l3cache_pmu.pmu, + l3cache_pmu.pmu.name, -1); + + if (result < 0) + dev_err(&pdev->dev, + "Failed to register L3 cache PMU (%d)\n", + result); + else + dev_info(&pdev->dev, + "Registered L3 cache PMU, type: %d, using %d HW PMUs\n", + l3cache_pmu.pmu.type, l3cache_pmu.num_pmus); + + return result; +} + +static int qcom_l3_cache_pmu_remove(struct platform_device *pdev) +{ + perf_pmu_unregister(&l3cache_pmu.pmu); + return 0; +} + +static const struct acpi_device_id qcom_l3_cache_pmu_acpi_match[] = { + { "QCOM8080", }, + { } +}; +MODULE_DEVICE_TABLE(acpi, qcom_l3_cache_pmu_acpi_match); + +static struct platform_driver qcom_l3_cache_pmu_driver = { + .driver = { + .name = "qcom-l3cache-pmu", + .owner = THIS_MODULE, + .acpi_match_table = ACPI_PTR(qcom_l3_cache_pmu_acpi_match), + }, + .probe = qcom_l3_cache_pmu_probe, + .remove = qcom_l3_cache_pmu_remove, +}; + +static int __init register_qcom_l3_cache_pmu_driver(void) +{ + return platform_driver_register(&qcom_l3_cache_pmu_driver); +} +device_initcall(register_qcom_l3_cache_pmu_driver);
This adds a new dynamic PMU to the Perf Events framework to program and control the L3 cache PMUs in some Qualcomm Technologies SOCs. The driver supports a distributed cache architecture where the overall cache is comprised of multiple slices each with its own PMU. The driver aggregates counts across the whole system to provide a global picture of the metrics selected by the user. The driver exports formatting and event information to sysfs so it can be used by the perf user space tools with the syntaxes: perf stat -a -e l3cache/read-miss/ perf stat -a -e l3cache/event=0x21/ Signed-off-by: Agustin Vega-Frias <agustinv@codeaurora.org> --- arch/arm64/kernel/Makefile | 4 + arch/arm64/kernel/perf_event_qcom_l3_cache.c | 816 +++++++++++++++++++++++++++ 2 files changed, 820 insertions(+) create mode 100644 arch/arm64/kernel/perf_event_qcom_l3_cache.c