Message ID | 1425905192-10509-2-git-send-email-suzuki.poulose@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Mar 09, 2015 at 12:46:30PM +0000, Suzuki K. Poulose wrote: > From: "Suzuki K. Poulose" <suzuki.poulose@arm.com> > > Don't allow grouping hardware events from different PMUs > (eg. CCI + CPU). Uhm, how does this work? If we have multiple hardware PMUs we'll stop scheduling events after the first failed event schedule. This can leave one of the PMUs severely under utilized.
On 10/03/15 11:27, Peter Zijlstra wrote: > On Mon, Mar 09, 2015 at 12:46:30PM +0000, Suzuki K. Poulose wrote: >> From: "Suzuki K. Poulose" <suzuki.poulose@arm.com> >> >> Don't allow grouping hardware events from different PMUs >> (eg. CCI + CPU). > > Uhm, how does this work? If we have multiple hardware PMUs we'll stop > scheduling events after the first failed event schedule. This can leave > one of the PMUs severely under utilized. This is done from pmu->event_init(), where we haven't scheduled an event yet. Do you think we need to solve it using a different approach ? What is the best way to handle this situation ? Is it OK to allow different PMUs in the group ? Suzuki >
On Tue, Mar 10, 2015 at 11:27:23AM +0000, Peter Zijlstra wrote: > On Mon, Mar 09, 2015 at 12:46:30PM +0000, Suzuki K. Poulose wrote: > > From: "Suzuki K. Poulose" <suzuki.poulose@arm.com> > > > > Don't allow grouping hardware events from different PMUs > > (eg. CCI + CPU). > > Uhm, how does this work? If we have multiple hardware PMUs we'll stop > scheduling events after the first failed event schedule. This can leave > one of the PMUs severely under utilized. The problem is here group validation at pmu::event_init() time, not scheduling. We don't allow grouping across disparate HW PMUs because we can't provide group semantics anyway. Scheduling is not a problem in this case (unlike the big.LITTLE case I have a patch for [1]). We have a CPU PMU and an "uncore" CCI PMU. You can't create task-bound events for the CCI, but you can create CPU-bound events for the CCI on the nominal CPU the CCI is monitored from. The context check you added in c3c87e770458aa00 "perf: Tighten (and fix) the grouping condition" implicitly rejects groups that have CPU and CCI events (each event::ctx will be the relevant pmu::pmu_cpu_context and will differ), and this is sane -- you can't provide group semantics across disparate HW PMUs. Unfortunately that happens after we've done the event->pmu->event_init(event) dance on each event, and in our event_init function we try to verify the group is sane. In our verification we ignore SW events, but assume that all !SW events are for the CPU PMU. If you add a CPU event to a CCI group, that's not the case, and we use container_of on an unsuitable object, derefence garbage, invoke the eschaton and so on. It would be nicer if we could prevent this in the core so we're not reliant on every PMU driver doing the same verification. My initial thought was that seemed like unnecessary duplication of the ctx checking above, but if we're going to end up shoving it into several drivers anyway perhaps it's the lesser evil. Mark. [1] https://lkml.org/lkml/2015/2/5/520
On Tue, Mar 10, 2015 at 12:05:21PM +0000, Mark Rutland wrote: > On Tue, Mar 10, 2015 at 11:27:23AM +0000, Peter Zijlstra wrote: > > On Mon, Mar 09, 2015 at 12:46:30PM +0000, Suzuki K. Poulose wrote: > > > From: "Suzuki K. Poulose" <suzuki.poulose@arm.com> > > > > > > Don't allow grouping hardware events from different PMUs > > > (eg. CCI + CPU). > > > > Uhm, how does this work? If we have multiple hardware PMUs we'll stop > > scheduling events after the first failed event schedule. This can leave > > one of the PMUs severely under utilized. > > The problem is here group validation at pmu::event_init() time, not > scheduling. Maybe make that a little more explicit. > We don't allow grouping across disparate HW PMUs because we can't > provide group semantics anyway. Scheduling is not a problem in this case > (unlike the big.LITTLE case I have a patch for [1]). Right, I remember that; I was wondering if this was related. > We have a CPU PMU and an "uncore" CCI PMU. You can't create task-bound > events for the CCI, but you can create CPU-bound events for the CCI on > the nominal CPU the CCI is monitored from. Indeed, ok. > The context check you added in c3c87e770458aa00 "perf: Tighten (and fix) > the grouping condition" implicitly rejects groups that have CPU and CCI > events (each event::ctx will be the relevant pmu::pmu_cpu_context and > will differ), and this is sane -- you can't provide group semantics > across disparate HW PMUs. Agreed. > Unfortunately that happens after we've done the > event->pmu->event_init(event) dance on each event, and in our event_init > function we try to verify the group is sane. In our verification we > ignore SW events, but assume that all !SW events are for the CPU PMU. > If you add a CPU event to a CCI group, that's not the case, and we use > container_of on an unsuitable object, derefence garbage, invoke the > eschaton and so on. Indeed, on x86 we explicitly ignore everything not an x86_pmu event. > It would be nicer if we could prevent this in the core so we're not > reliant on every PMU driver doing the same verification. My initial > thought was that seemed like unnecessary duplication of the ctx checking > above, but if we're going to end up shoving it into several drivers > anyway perhaps it's the lesser evil. Again, agreed, that would be better and less error prone. But I'm not entirely sure how to go about doing it :/ I'll have to go think about that; and conferences are not the best place for that. Suggestions on that are welcome of course ;)
On Tue, Mar 10, 2015 at 01:53:51PM +0100, Peter Zijlstra wrote: > > It would be nicer if we could prevent this in the core so we're not > > reliant on every PMU driver doing the same verification. My initial > > thought was that seemed like unnecessary duplication of the ctx checking > > above, but if we're going to end up shoving it into several drivers > > anyway perhaps it's the lesser evil. > > Again, agreed, that would be better and less error prone. But I'm not > entirely sure how to go about doing it :/ I'll have to go think about > that; and conferences are not the best place for that. > > Suggestions on that are welcome of course ;) So the problem is that event_init() is what will return the pmu, so we cannot make decisions on it until after that returns. Maybe we can pull out the validate step into its own funciton; pmu->validate() or whatnot, to be called slightly later.
> So the problem is that event_init() is what will return the pmu, so we > cannot make decisions on it until after that returns. I took a look into hacking something into perf_try_init_event, but it ends up duplicating all of the existing tests and looks really out of place. > Maybe we can pull out the validate step into its own funciton; > pmu->validate() or whatnot, to be called slightly later. Perhaps the other way around: move the call to pmu::event_init() later (after the other checks in perf_event_open), and in its original spot add a new pmu::can_handle() that only tells us whether an event in isolation is for this PMU, without validation of the specifics of the event. That would keep the split between the two callbacks more clearly defined, though it would require updating most PMU drivers, so perhaps that's too invasive. I'll dig into this a bit. Mark.
On Tue, Mar 10, 2015 at 12:53:51PM +0000, Peter Zijlstra wrote: > On Tue, Mar 10, 2015 at 12:05:21PM +0000, Mark Rutland wrote: > > On Tue, Mar 10, 2015 at 11:27:23AM +0000, Peter Zijlstra wrote: > > > On Mon, Mar 09, 2015 at 12:46:30PM +0000, Suzuki K. Poulose wrote: > > > > From: "Suzuki K. Poulose" <suzuki.poulose@arm.com> > > > > > > > > Don't allow grouping hardware events from different PMUs > > > > (eg. CCI + CPU). > > > > > > Uhm, how does this work? If we have multiple hardware PMUs we'll stop > > > scheduling events after the first failed event schedule. This can leave > > > one of the PMUs severely under utilized. > > > > The problem is here group validation at pmu::event_init() time, not > > scheduling. > > Maybe make that a little more explicit. On the assumption that the patch is otherwise OK, how about the commit message below? ---->8---- arm/pmu: Reject groups spanning multiple hardware PMUs The perf core implicitly rejects events spanning multiple HW PMUs, as in these cases the event->ctx will differ. However this validation is performed after pmu::event_init() is called in perf_init_event(), and thus pmu::event_init() may be called with a group leader from a different HW PMU. The ARM PMU driver does not take this fact into account, and when validating groups assumes that it can call to_arm_pmu(event->pmu) for any HW event. When the event in question is from another HW PMU this is wrong, and results in dereferencing garbage. This patch updates the ARM PMU driver to first test for and reject events from other PMUs, moving the to_arm_pmu and related logic after this test. Fixes a crash triggered by perf_fuzzer on Linux-4.0-rc2, with a CCI PMU present: CPU: 0 PID: 1527 Comm: perf_fuzzer Not tainted 4.0.0-rc2 #57 Hardware name: ARM-Versatile Express task: bd8484c0 ti: be676000 task.ti: be676000 PC is at 0xbf1bbc90 LR is at validate_event+0x34/0x5c pc : [<bf1bbc90>] lr : [<80016060>] psr: 00000013 ... [<80016060>] (validate_event) from [<80016198>] (validate_group+0x28/0x90) [<80016198>] (validate_group) from [<80016398>] (armpmu_event_init+0x150/0x218) [<80016398>] (armpmu_event_init) from [<800882e4>] (perf_try_init_event+0x30/0x48) [<800882e4>] (perf_try_init_event) from [<8008f544>] (perf_init_event+0x5c/0xf4) [<8008f544>] (perf_init_event) from [<8008f8a8>] (perf_event_alloc+0x2cc/0x35c) [<8008f8a8>] (perf_event_alloc) from [<8009015c>] (SyS_perf_event_open+0x498/0xa70) [<8009015c>] (SyS_perf_event_open) from [<8000e420>] (ret_fast_syscall+0x0/0x34) Code: bf1be000 bf1bb380 802a2664 00000000 (00000002) ---[ end trace 01aff0ff00926a0a ]--- ---->8---- Mark.
On Tue, Mar 10, 2015 at 03:36:08PM +0000, Mark Rutland wrote: > On the assumption that the patch is otherwise OK, how about the commit > message below? Yeah, that'll do fine. Thanks!
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c index 557e128..4a86a01 100644 --- a/arch/arm/kernel/perf_event.c +++ b/arch/arm/kernel/perf_event.c @@ -259,20 +259,29 @@ out: } static int -validate_event(struct pmu_hw_events *hw_events, - struct perf_event *event) +validate_event(struct pmu *pmu, struct pmu_hw_events *hw_events, + struct perf_event *event) { - struct arm_pmu *armpmu = to_arm_pmu(event->pmu); + struct arm_pmu *armpmu; if (is_software_event(event)) return 1; + /* + * Reject groups spanning multiple HW PMUs (e.g. CPU + CCI). The + * core perf code won't check that the pmu->ctx == leader->ctx + * until after pmu->event_init(event). + */ + if (event->pmu != pmu) + return 0; + if (event->state < PERF_EVENT_STATE_OFF) return 1; if (event->state == PERF_EVENT_STATE_OFF && !event->attr.enable_on_exec) return 1; + armpmu = to_arm_pmu(event->pmu); return armpmu->get_event_idx(hw_events, event) >= 0; } @@ -288,15 +297,15 @@ validate_group(struct perf_event *event) */ memset(&fake_pmu.used_mask, 0, sizeof(fake_pmu.used_mask)); - if (!validate_event(&fake_pmu, leader)) + if (!validate_event(event->pmu, &fake_pmu, leader)) return -EINVAL; list_for_each_entry(sibling, &leader->sibling_list, group_entry) { - if (!validate_event(&fake_pmu, sibling)) + if (!validate_event(event->pmu, &fake_pmu, sibling)) return -EINVAL; } - if (!validate_event(&fake_pmu, event)) + if (!validate_event(event->pmu, &fake_pmu, event)) return -EINVAL; return 0;