Message ID | 1541066648-40690-2-git-send-email-wei.w.wang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Intel Virtual PMU Optimization | expand |
On Thu, Nov 01, 2018 at 06:04:01PM +0800, Wei Wang wrote: > Add x86_perf_mask_perf_counters to reserve counters from the host perf > subsystem. The masked counters will not be assigned to any host perf > events. This can be used by the hypervisor to reserve perf counters for > a guest to use. > > This function is currently supported on Intel CPUs only, but put in x86 > perf core because the counter assignment is implemented here and we need > to re-enable the pmu which is defined in the x86 perf core in the case > that a counter to be masked happens to be used by the host. > > Signed-off-by: Wei Wang <wei.w.wang@intel.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Andi Kleen <ak@linux.intel.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > --- > arch/x86/events/core.c | 37 +++++++++++++++++++++++++++++++++++++ > arch/x86/include/asm/perf_event.h | 1 + > 2 files changed, 38 insertions(+) > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index 106911b..e73135a 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -716,6 +716,7 @@ struct perf_sched { > static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints, > int num, int wmin, int wmax, int gpmax) > { > + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); > int idx; > > memset(sched, 0, sizeof(*sched)); > @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** > sched->max_weight = wmax; > sched->max_gp = gpmax; > sched->constraints = constraints; > +#ifdef CONFIG_CPU_SUP_INTEL > + sched->state.used[0] = cpuc->intel_ctrl_guest_mask; > +#endif NAK. This completely undermines the whole purpose of event scheduling.
On 11/01/2018 10:52 PM, Peter Zijlstra wrote: > On Thu, Nov 01, 2018 at 06:04:01PM +0800, Wei Wang wrote: >> Add x86_perf_mask_perf_counters to reserve counters from the host perf >> subsystem. The masked counters will not be assigned to any host perf >> events. This can be used by the hypervisor to reserve perf counters for >> a guest to use. >> >> This function is currently supported on Intel CPUs only, but put in x86 >> perf core because the counter assignment is implemented here and we need >> to re-enable the pmu which is defined in the x86 perf core in the case >> that a counter to be masked happens to be used by the host. >> >> Signed-off-by: Wei Wang <wei.w.wang@intel.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Andi Kleen <ak@linux.intel.com> >> Cc: Paolo Bonzini <pbonzini@redhat.com> >> --- >> arch/x86/events/core.c | 37 +++++++++++++++++++++++++++++++++++++ >> arch/x86/include/asm/perf_event.h | 1 + >> 2 files changed, 38 insertions(+) >> >> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c >> index 106911b..e73135a 100644 >> --- a/arch/x86/events/core.c >> +++ b/arch/x86/events/core.c >> @@ -716,6 +716,7 @@ struct perf_sched { >> static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints, >> int num, int wmin, int wmax, int gpmax) >> { >> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); >> int idx; >> >> memset(sched, 0, sizeof(*sched)); >> @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** >> sched->max_weight = wmax; >> sched->max_gp = gpmax; >> sched->constraints = constraints; >> +#ifdef CONFIG_CPU_SUP_INTEL >> + sched->state.used[0] = cpuc->intel_ctrl_guest_mask; >> +#endif > NAK. This completely undermines the whole purpose of event scheduling. > Hi Peter, Could you share more details how it would affect the host side event scheduling? We wanted to partition out the perf counters that the guest needs and dedicated them to guest usages for that period (i.e. guest occupation period). The remaining counters are still schedulable for the host events (there won't be many other perf events when the guest runs, watchdog_hld seems to be the one that would). Would you have any suggestions? Thanks. Best, Wei
On Fri, Nov 02, 2018 at 05:08:31PM +0800, Wei Wang wrote: > On 11/01/2018 10:52 PM, Peter Zijlstra wrote: > > > @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** > > > sched->max_weight = wmax; > > > sched->max_gp = gpmax; > > > sched->constraints = constraints; > > > +#ifdef CONFIG_CPU_SUP_INTEL > > > + sched->state.used[0] = cpuc->intel_ctrl_guest_mask; > > > +#endif > > NAK. This completely undermines the whole purpose of event scheduling. > > > > Hi Peter, > > Could you share more details how it would affect the host side event > scheduling? Not all counters are equal; suppose you have one of those chips that can only do PEBS on counter 0, and then hand out 0 to the guest for some silly event. That means nobody can use PEBS anymore. > Would you have any suggestions? I would suggest not to use virt in the first place of course ;-) But whatever you do; you have to keep using host events to emulate the guest PMU. That doesn't mean you can't improve things; that code is quite insane from what you told earlier.
On 11/05/2018 05:34 PM, Peter Zijlstra wrote: > On Fri, Nov 02, 2018 at 05:08:31PM +0800, Wei Wang wrote: >> On 11/01/2018 10:52 PM, Peter Zijlstra wrote: >>>> @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** >>>> sched->max_weight = wmax; >>>> sched->max_gp = gpmax; >>>> sched->constraints = constraints; >>>> +#ifdef CONFIG_CPU_SUP_INTEL >>>> + sched->state.used[0] = cpuc->intel_ctrl_guest_mask; >>>> +#endif >>> NAK. This completely undermines the whole purpose of event scheduling. >>> >> Hi Peter, >> >> Could you share more details how it would affect the host side event >> scheduling? > Not all counters are equal; suppose you have one of those chips that can > only do PEBS on counter 0, and then hand out 0 to the guest for some > silly event. That means nobody can use PEBS anymore. Thanks for sharing your point. In this example (assume PEBS can only work with counter 0), how would the existing approach (i.e. using host event to emulate) work? For example, guest wants to use PEBS, host also wants to use PEBS or other features that only counter 0 fits, I think either guest or host will not work then. With the register level virtualization approach, we could further support that case: if guest requests to use a counter which host happens to be using, we can let host and guest both be satisfied by supporting counter context switching on guest/host switching. In this case, both guest and host can use counter 0. (I think this is actually a policy selection, the current series chooses to be guest first, we can further change it if necessary) >> Would you have any suggestions? > I would suggest not to use virt in the first place of course ;-) > > But whatever you do; you have to keep using host events to emulate the > guest PMU. That doesn't mean you can't improve things; that code is > quite insane from what you told earlier. I agree that the host event emulation is a functional approach, but it may not be an effective one (also got complaints from people about today's perf in the guest). We actually have similar problems when doing network virtualization. The more effective approach tends to be the one that bypasses the host network stack. Both the network stack and perf stack seem to be too heavy to be used as part of the emulation. Hope we could have thorough discussions on the two approaches from virtualization point of view, and get to know why host event emulation has to be the one, or it could be better to use the register level virtualization model. Best, Wei
On Mon, Nov 05, 2018 at 07:19:01PM +0800, Wei Wang wrote: > On 11/05/2018 05:34 PM, Peter Zijlstra wrote: > > On Fri, Nov 02, 2018 at 05:08:31PM +0800, Wei Wang wrote: > > > On 11/01/2018 10:52 PM, Peter Zijlstra wrote: > > > > > @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** > > > > > sched->max_weight = wmax; > > > > > sched->max_gp = gpmax; > > > > > sched->constraints = constraints; > > > > > +#ifdef CONFIG_CPU_SUP_INTEL > > > > > + sched->state.used[0] = cpuc->intel_ctrl_guest_mask; > > > > > +#endif > > > > NAK. This completely undermines the whole purpose of event scheduling. > > > > > > > Hi Peter, > > > > > > Could you share more details how it would affect the host side event > > > scheduling? > > Not all counters are equal; suppose you have one of those chips that can > > only do PEBS on counter 0, and then hand out 0 to the guest for some > > silly event. That means nobody can use PEBS anymore. > > Thanks for sharing your point. > > In this example (assume PEBS can only work with counter 0), how would the > existing approach (i.e. using host event to emulate) work? > For example, guest wants to use PEBS, host also wants to use PEBS or other > features that only counter 0 fits, I think either guest or host will not > work then. The answer for PEBS is really simple; PEBS does not virtualize (Andi tried and can tell you why; IIRC it has something to do with how the hardware asks for a Linear Address instead of a Physical Address). So the problem will not arrise. But there are certainly constrained events that will result in the same problem. The traditional approach of perf on resource contention is to share it; you get only partial runtime and can scale up the events given the runtime metrics provided. We also have perf_event_attr::pinned, which is normally only available to root, in which case we'll end up marking any contending event to an error state. Neither are ideal for MSR level emulation. > With the register level virtualization approach, we could further support > that case: if guest requests to use a counter which host happens to be > using, we can let host and guest both be satisfied by supporting counter > context switching on guest/host switching. In this case, both guest and host > can use counter 0. (I think this is actually a policy selection, the current > series chooses to be guest first, we can further change it if necessary) That can only work if the host counter has perf_event_attr::exclude_guest=1, any counter without that must also count when the guest is running. (and, IIRC, normal perf tool events do not have that set by default) > > > Would you have any suggestions? > > I would suggest not to use virt in the first place of course ;-) > > > > But whatever you do; you have to keep using host events to emulate the > > guest PMU. That doesn't mean you can't improve things; that code is > > quite insane from what you told earlier. > > I agree that the host event emulation is a functional approach, but it may > not be an effective one (also got complaints from people about today's perf > in the guest). > We actually have similar problems when doing network virtualization. The > more effective approach tends to be the one that bypasses the host network > stack. Both the network stack and perf stack seem to be too heavy to be used > as part of the emulation. The thing is; you cannot do blind pass-through of the PMU, some of its features simply do not work in a guest. Also, the host perf driver expects certain functionality that must be respected. Those are the constraints you have to work with. Back when we all started down this virt rathole, I proposed people do paravirt perf, where events would be handed to the host kernel and let the host kernel do its normal thing. But people wanted to do the MSR based thing because of !linux guests. Now I don't care about virt much, but I care about !linux guests even less.
On Monday, November 5, 2018 8:14 PM, Peter Zijlstra wrote: > The answer for PEBS is really simple; PEBS does not virtualize (Andi tried and > can tell you why; IIRC it has something to do with how the hardware asks for > a Linear Address instead of a Physical Address). So the problem will not arrise. > > But there are certainly constrained events that will result in the same > problem. Yes. I just followed the PEBS assumption to discuss the counter contention that you mentioned. > > The traditional approach of perf on resource contention is to share it; you > get only partial runtime and can scale up the events given the runtime > metrics provided. > > We also have perf_event_attr::pinned, which is normally only available to > root, in which case we'll end up marking any contending event to an error > state. Yes, that's one of the limitations with the existing host event emulation approach - in that case (i.e. both require counter 0 to work) the existing approach fails to have the guest perf function, even the pinned event has ".exclude_guest" set. > > Neither are ideal for MSR level emulation. > > That can only work if the host counter has perf_event_attr::exclude_guest=1, > any counter without that must also count when the guest is running. > > (and, IIRC, normal perf tool events do not have that set by default) Probably no. Please see Line 81 at https://github.com/torvalds/linux/blob/master/tools/perf/util/util.c perf_guest by default is false, which makes "attr->exclude_guest = 1" > The thing is; you cannot do blind pass-through of the PMU, some of its > features simply do not work in a guest. Also, the host perf driver expects > certain functionality that must be respected. Actually we are not blindly assigning the perf counters. Guest works with its own complete perf stack (like the one on the host) which also has its own constraints. The perf counter that the guest is requesting from the hypervisor is the one that comes out from its event constraints (i.e. the one that will work for that feature on the guest). The counter is also not passed through to the guest, guest accesses to the assigned counter will still exit to the hypervisor, and the hypervisor helps update the counter. Please correct me if I misunderstood your point. > > Those are the constraints you have to work with. > > Back when we all started down this virt rathole, I proposed people do > paravirt perf, where events would be handed to the host kernel and let the > host kernel do its normal thing. But people wanted to do the MSR based > thing because of !linux guests. IMHO, it is worthwhile to care more about the real use case. When a user gets a virtual machine from a vendor, all he can do is to run perf inside the guest. The above contention concerns would not happen, because the user wouldn't be able to come to the host to run perf on the virtualization software (e.g. ./perf qemu..) and in the meantime running perf in the guest to cause the contention. On the other hand, when we improve the user experience of running perf inside the guest by reducing the virtualization overhead, that would bring real benefits to the real use case. Best, Wei
Please; don't send malformed emails like this. Lines wrap at 78 chars. On Mon, Nov 05, 2018 at 03:37:24PM +0000, Wang, Wei W wrote: > On Monday, November 5, 2018 8:14 PM, Peter Zijlstra wrote: > > That can only work if the host counter has perf_event_attr::exclude_guest=1, > > any counter without that must also count when the guest is running. > > > > (and, IIRC, normal perf tool events do not have that set by default) > > Probably no. Please see Line 81 at > https://github.com/torvalds/linux/blob/master/tools/perf/util/util.c > perf_guest by default is false, which makes "attr->exclude_guest = 1" Then you're in luck. But if the host creates an even that has exclude_guest=0 set, it should still work. > > The thing is; you cannot do blind pass-through of the PMU, some of its > > features simply do not work in a guest. Also, the host perf driver expects > > certain functionality that must be respected. > > Actually we are not blindly assigning the perf counters. Guest works > with its own complete perf stack (like the one on the host) which also > has its own constraints. But it knows nothing of the host state. > The counter is also not passed through to the guest, guest accesses to > the assigned counter will still exit to the hypervisor, and the > hypervisor helps update the counter. Yes, you have to; because the PMU doesn't properly virtualize, also because the HV -- linux in our case -- already claimed the PMU. So the network passthrough case you mentioned simply doesn't apply at all. Don't bother looking at it for inspiration. > > Those are the constraints you have to work with. > > > > Back when we all started down this virt rathole, I proposed people do > > paravirt perf, where events would be handed to the host kernel and let the > > host kernel do its normal thing. But people wanted to do the MSR based > > thing because of !linux guests. > > IMHO, it is worthwhile to care more about the real use case. When a > user gets a virtual machine from a vendor, all he can do is to run > perf inside the guest. The above contention concerns would not happen, > because the user wouldn't be able to come to the host to run perf on > the virtualization software (e.g. ./perf qemu..) and in the meantime > running perf in the guest to cause the contention. That's your job. Mine is to make sure that whatever you propose fits in the existing model and doesn't make a giant mess of things. And for Linux guests on Linux hosts, paravirt perf still makes the most sense to me; then you get the host scheduling all the events and providing the guest with the proper counts/runtimes/state. > On the other hand, when we improve the user experience of running perf > inside the guest by reducing the virtualization overhead, that would > bring real benefits to the real use case. You can start to improve things by doing a less stupid implementation of the existing code.
I ran into a similar problem with my PEBS virtualization patchkit. My solution was: basically schedule as normal, but tell the scheduler to force allocate a counter on a specific index. It can be done only with a few lines of change in the scheduler code. -Andi
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 106911b..e73135a 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -716,6 +716,7 @@ struct perf_sched { static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints, int num, int wmin, int wmax, int gpmax) { + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); int idx; memset(sched, 0, sizeof(*sched)); @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint ** sched->max_weight = wmax; sched->max_gp = gpmax; sched->constraints = constraints; +#ifdef CONFIG_CPU_SUP_INTEL + sched->state.used[0] = cpuc->intel_ctrl_guest_mask; +#endif for (idx = 0; idx < num; idx++) { if (constraints[idx]->weight == wmin) @@ -2386,6 +2390,39 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re } } +#ifdef CONFIG_CPU_SUP_INTEL +/** + * x86_perf_mask_perf_counters - mask perf counters + * @mask: the bitmask of counters + * + * Mask the perf counters that are not available to be used by the perf core. + * If the counter to be masked has been assigned, it will be taken back and + * then the perf core will re-assign usable counters to its events. + * + * This can be used by a component outside the perf core to reserve counters. + * For example, a hypervisor uses it to reserve counters for a guest to use, + * and later return the counters by another call with the related bits cleared. + */ +void x86_perf_mask_perf_counters(u64 mask) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + + /* + * If the counter happens to be used by a host event, take it back + * first, and then restart the pmu after mask that counter as being + * reserved. + */ + if (mask & cpuc->intel_ctrl_host_mask) { + perf_pmu_disable(&pmu); + cpuc->intel_ctrl_guest_mask = mask; + perf_pmu_enable(&pmu); + } else { + cpuc->intel_ctrl_guest_mask = mask; + } +} +EXPORT_SYMBOL_GPL(x86_perf_mask_perf_counters); +#endif + static inline int valid_user_frame(const void __user *fp, unsigned long size) { diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 8bdf749..5b4463e 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -297,6 +297,7 @@ static inline void perf_check_microcode(void) { } #ifdef CONFIG_CPU_SUP_INTEL extern void intel_pt_handle_vmx(int on); +extern void x86_perf_mask_perf_counters(u64 mask); #endif #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
Add x86_perf_mask_perf_counters to reserve counters from the host perf subsystem. The masked counters will not be assigned to any host perf events. This can be used by the hypervisor to reserve perf counters for a guest to use. This function is currently supported on Intel CPUs only, but put in x86 perf core because the counter assignment is implemented here and we need to re-enable the pmu which is defined in the x86 perf core in the case that a counter to be masked happens to be used by the host. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/events/core.c | 37 +++++++++++++++++++++++++++++++++++++ arch/x86/include/asm/perf_event.h | 1 + 2 files changed, 38 insertions(+)