[v1,1/8] perf/x86: add support to mask counters from host

Message ID	1541066648-40690-2-git-send-email-wei.w.wang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Wei Wang <wei.w.wang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, pbonzini@redhat.com, ak@linux.intel.com, peterz@infradead.org Cc: mingo@redhat.com, rkrcmar@redhat.com, like.xu@intel.com, wei.w.wang@intel.com Subject: [PATCH v1 1/8] perf/x86: add support to mask counters from host Date: Thu, 1 Nov 2018 18:04:01 +0800 Message-Id: <1541066648-40690-2-git-send-email-wei.w.wang@intel.com> In-Reply-To: <1541066648-40690-1-git-send-email-wei.w.wang@intel.com> References: <1541066648-40690-1-git-send-email-wei.w.wang@intel.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	Intel Virtual PMU Optimization \| expand [v1,0/8] Intel Virtual PMU Optimization [v1,1/8] perf/x86: add support to mask counters from host [v1,2/8] perf/x86/intel: add pmi callback support [v1,3/8] KVM/x86/vPMU: optimize intel vPMU [v1,4/8] KVM/x86/vPMU: support msr switch on vmx transitions [v1,5/8] KVM/x86/vPMU: intel_pmu_read_pmc [v1,6/8] KVM/x86/vPMU: remove some unused functions [v1,7/8] KVM/x86/vPMU: save/restore guest perf counters on vCPU switching [v1,8/8] KVM/x86/vPMU: return the counters to host if guest is torn down

Wang, Wei W Nov. 1, 2018, 10:04 a.m. UTC

Add x86_perf_mask_perf_counters to reserve counters from the host perf
subsystem. The masked counters will not be assigned to any host perf
events. This can be used by the hypervisor to reserve perf counters for
a guest to use.

This function is currently supported on Intel CPUs only, but put in x86
perf core because the counter assignment is implemented here and we need
to re-enable the pmu which is defined in the x86 perf core in the case
that a counter to be masked happens to be used by the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/events/core.c            | 37 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/perf_event.h |  1 +
 2 files changed, 38 insertions(+)

Peter Zijlstra Nov. 1, 2018, 2:52 p.m. UTC | #1

On Thu, Nov 01, 2018 at 06:04:01PM +0800, Wei Wang wrote:
> Add x86_perf_mask_perf_counters to reserve counters from the host perf
> subsystem. The masked counters will not be assigned to any host perf
> events. This can be used by the hypervisor to reserve perf counters for
> a guest to use.
> 
> This function is currently supported on Intel CPUs only, but put in x86
> perf core because the counter assignment is implemented here and we need
> to re-enable the pmu which is defined in the x86 perf core in the case
> that a counter to be masked happens to be used by the host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/events/core.c            | 37 +++++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/perf_event.h |  1 +
>  2 files changed, 38 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 106911b..e73135a 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -716,6 +716,7 @@ struct perf_sched {
>  static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints,
>  			    int num, int wmin, int wmax, int gpmax)
>  {
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>  	int idx;
>  
>  	memset(sched, 0, sizeof(*sched));
> @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
>  	sched->max_weight	= wmax;
>  	sched->max_gp		= gpmax;
>  	sched->constraints	= constraints;
> +#ifdef CONFIG_CPU_SUP_INTEL
> +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
> +#endif

NAK.  This completely undermines the whole purpose of event scheduling.

Wang, Wei W Nov. 2, 2018, 9:08 a.m. UTC | #2

On 11/01/2018 10:52 PM, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 06:04:01PM +0800, Wei Wang wrote:
>> Add x86_perf_mask_perf_counters to reserve counters from the host perf
>> subsystem. The masked counters will not be assigned to any host perf
>> events. This can be used by the hypervisor to reserve perf counters for
>> a guest to use.
>>
>> This function is currently supported on Intel CPUs only, but put in x86
>> perf core because the counter assignment is implemented here and we need
>> to re-enable the pmu which is defined in the x86 perf core in the case
>> that a counter to be masked happens to be used by the host.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   arch/x86/events/core.c            | 37 +++++++++++++++++++++++++++++++++++++
>>   arch/x86/include/asm/perf_event.h |  1 +
>>   2 files changed, 38 insertions(+)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 106911b..e73135a 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -716,6 +716,7 @@ struct perf_sched {
>>   static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints,
>>   			    int num, int wmin, int wmax, int gpmax)
>>   {
>> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>   	int idx;
>>   
>>   	memset(sched, 0, sizeof(*sched));
>> @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
>>   	sched->max_weight	= wmax;
>>   	sched->max_gp		= gpmax;
>>   	sched->constraints	= constraints;
>> +#ifdef CONFIG_CPU_SUP_INTEL
>> +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
>> +#endif
> NAK.  This completely undermines the whole purpose of event scheduling.
>

Hi Peter,

Could you share more details how it would affect the host side event 
scheduling?

We wanted to partition out the perf counters that the guest needs and
dedicated them to guest usages for that period (i.e. guest occupation 
period).
The remaining counters are still schedulable for the host events (there 
won't be
many other perf events when the guest runs, watchdog_hld seems to be the one
that would).

Would you have any suggestions? Thanks.

Best,
Wei

Peter Zijlstra Nov. 5, 2018, 9:34 a.m. UTC | #3

On Fri, Nov 02, 2018 at 05:08:31PM +0800, Wei Wang wrote:
> On 11/01/2018 10:52 PM, Peter Zijlstra wrote:
> > > @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
> > >   	sched->max_weight	= wmax;
> > >   	sched->max_gp		= gpmax;
> > >   	sched->constraints	= constraints;
> > > +#ifdef CONFIG_CPU_SUP_INTEL
> > > +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
> > > +#endif
> > NAK.  This completely undermines the whole purpose of event scheduling.
> > 
> 
> Hi Peter,
> 
> Could you share more details how it would affect the host side event
> scheduling?

Not all counters are equal; suppose you have one of those chips that can
only do PEBS on counter 0, and then hand out 0 to the guest for some
silly event. That means nobody can use PEBS anymore.

> Would you have any suggestions?

I would suggest not to use virt in the first place of course ;-)

But whatever you do; you have to keep using host events to emulate the
guest PMU. That doesn't mean you can't improve things; that code is
quite insane from what you told earlier.

Wang, Wei W Nov. 5, 2018, 11:19 a.m. UTC | #4

On 11/05/2018 05:34 PM, Peter Zijlstra wrote:
> On Fri, Nov 02, 2018 at 05:08:31PM +0800, Wei Wang wrote:
>> On 11/01/2018 10:52 PM, Peter Zijlstra wrote:
>>>> @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
>>>>    	sched->max_weight	= wmax;
>>>>    	sched->max_gp		= gpmax;
>>>>    	sched->constraints	= constraints;
>>>> +#ifdef CONFIG_CPU_SUP_INTEL
>>>> +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
>>>> +#endif
>>> NAK.  This completely undermines the whole purpose of event scheduling.
>>>
>> Hi Peter,
>>
>> Could you share more details how it would affect the host side event
>> scheduling?
> Not all counters are equal; suppose you have one of those chips that can
> only do PEBS on counter 0, and then hand out 0 to the guest for some
> silly event. That means nobody can use PEBS anymore.

Thanks for sharing your point.

In this example (assume PEBS can only work with counter 0), how would 
the existing approach (i.e. using host event to emulate) work?
For example, guest wants to use PEBS, host also wants to use PEBS or 
other features that only counter 0 fits, I think either guest or host 
will not work then.

With the register level virtualization approach, we could further 
support that case: if guest requests to use a counter which host happens 
to be using, we can let host and guest both be satisfied by supporting 
counter context switching on guest/host switching. In this case, both 
guest and host can use counter 0. (I think this is actually a policy 
selection, the current series chooses to be guest first, we can further 
change it if necessary)

>> Would you have any suggestions?
> I would suggest not to use virt in the first place of course ;-)
>
> But whatever you do; you have to keep using host events to emulate the
> guest PMU. That doesn't mean you can't improve things; that code is
> quite insane from what you told earlier.

I agree that the host event emulation is a functional approach, but it 
may not be an effective one (also got complaints from people about 
today's perf in the guest).
We actually have similar problems when doing network virtualization. The 
more effective approach tends to be the one that bypasses the host 
network stack. Both the network stack and perf stack seem to be too 
heavy to be used as part of the emulation.

Hope we could have thorough discussions on the two approaches from 
virtualization point of view, and get to know why host event emulation 
has to be the one, or it could be better to use the register level 
virtualization model.

Best,
Wei

Peter Zijlstra Nov. 5, 2018, 12:14 p.m. UTC | #5

On Mon, Nov 05, 2018 at 07:19:01PM +0800, Wei Wang wrote:
> On 11/05/2018 05:34 PM, Peter Zijlstra wrote:
> > On Fri, Nov 02, 2018 at 05:08:31PM +0800, Wei Wang wrote:
> > > On 11/01/2018 10:52 PM, Peter Zijlstra wrote:
> > > > > @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
> > > > >    	sched->max_weight	= wmax;
> > > > >    	sched->max_gp		= gpmax;
> > > > >    	sched->constraints	= constraints;
> > > > > +#ifdef CONFIG_CPU_SUP_INTEL
> > > > > +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
> > > > > +#endif
> > > > NAK.  This completely undermines the whole purpose of event scheduling.
> > > > 
> > > Hi Peter,
> > > 
> > > Could you share more details how it would affect the host side event
> > > scheduling?
> > Not all counters are equal; suppose you have one of those chips that can
> > only do PEBS on counter 0, and then hand out 0 to the guest for some
> > silly event. That means nobody can use PEBS anymore.
> 
> Thanks for sharing your point.
> 
> In this example (assume PEBS can only work with counter 0), how would the
> existing approach (i.e. using host event to emulate) work?
> For example, guest wants to use PEBS, host also wants to use PEBS or other
> features that only counter 0 fits, I think either guest or host will not
> work then.

The answer for PEBS is really simple; PEBS does not virtualize (Andi
tried and can tell you why; IIRC it has something to do with how the
hardware asks for a Linear Address instead of a Physical Address). So
the problem will not arrise.

But there are certainly constrained events that will result in the same
problem.

The traditional approach of perf on resource contention is to share it;
you get only partial runtime and can scale up the events given the
runtime metrics provided.

We also have perf_event_attr::pinned, which is normally only available
to root, in which case we'll end up marking any contending event to an
error state.

Neither are ideal for MSR level emulation.

> With the register level virtualization approach, we could further support
> that case: if guest requests to use a counter which host happens to be
> using, we can let host and guest both be satisfied by supporting counter
> context switching on guest/host switching. In this case, both guest and host
> can use counter 0. (I think this is actually a policy selection, the current
> series chooses to be guest first, we can further change it if necessary)

That can only work if the host counter has
perf_event_attr::exclude_guest=1, any counter without that must also
count when the guest is running.

(and, IIRC, normal perf tool events do not have that set by default)

> > > Would you have any suggestions?
> > I would suggest not to use virt in the first place of course ;-)
> > 
> > But whatever you do; you have to keep using host events to emulate the
> > guest PMU. That doesn't mean you can't improve things; that code is
> > quite insane from what you told earlier.
> 
> I agree that the host event emulation is a functional approach, but it may
> not be an effective one (also got complaints from people about today's perf
> in the guest).
> We actually have similar problems when doing network virtualization. The
> more effective approach tends to be the one that bypasses the host network
> stack. Both the network stack and perf stack seem to be too heavy to be used
> as part of the emulation.

The thing is; you cannot do blind pass-through of the PMU, some of its
features simply do not work in a guest. Also, the host perf driver
expects certain functionality that must be respected.

Those are the constraints you have to work with.

Back when we all started down this virt rathole, I proposed people do
paravirt perf, where events would be handed to the host kernel and let
the host kernel do its normal thing. But people wanted to do the MSR
based thing because of !linux guests.

Now I don't care about virt much, but I care about !linux guests even
less.

Wang, Wei W Nov. 5, 2018, 3:37 p.m. UTC | #6

On Monday, November 5, 2018 8:14 PM, Peter Zijlstra wrote:
> The answer for PEBS is really simple; PEBS does not virtualize (Andi tried and
> can tell you why; IIRC it has something to do with how the hardware asks for
> a Linear Address instead of a Physical Address). So the problem will not arrise.
> 
> But there are certainly constrained events that will result in the same
> problem.

Yes. I just followed the PEBS assumption to discuss the counter contention that you mentioned.

> 
> The traditional approach of perf on resource contention is to share it; you
> get only partial runtime and can scale up the events given the runtime
> metrics provided.
> 
> We also have perf_event_attr::pinned, which is normally only available to
> root, in which case we'll end up marking any contending event to an error
> state.

Yes,  that's one of the limitations with the existing host event emulation approach - in that case (i.e. both require counter 0 to work) the existing approach fails to have the guest perf function, even the pinned event has ".exclude_guest" set.

> 
> Neither are ideal for MSR level emulation.
> 
> That can only work if the host counter has perf_event_attr::exclude_guest=1,
> any counter without that must also count when the guest is running.
> 
> (and, IIRC, normal perf tool events do not have that set by default)

Probably no. Please see Line 81 at
https://github.com/torvalds/linux/blob/master/tools/perf/util/util.c
perf_guest by default is false, which makes "attr->exclude_guest = 1"

> The thing is; you cannot do blind pass-through of the PMU, some of its
> features simply do not work in a guest. Also, the host perf driver expects
> certain functionality that must be respected.

Actually we are not blindly assigning the perf counters. Guest works with its own complete perf stack (like the one on the host) which also has its own constraints. The perf counter that the guest is requesting from the hypervisor is the one that comes out from its event constraints (i.e. the one that will work for that feature on the guest). 

The counter is also not passed through to the guest, guest accesses to the assigned counter will still exit to the hypervisor, and the hypervisor helps update the counter. 

Please correct me if I misunderstood your point.

> 
> Those are the constraints you have to work with.
> 
> Back when we all started down this virt rathole, I proposed people do
> paravirt perf, where events would be handed to the host kernel and let the
> host kernel do its normal thing. But people wanted to do the MSR based
> thing because of !linux guests.

IMHO, it is worthwhile to care more about the real use case. When a user gets a virtual machine from a vendor, all he can do is to run perf inside the guest. The above contention concerns would not happen, because the user wouldn't be able to come to the host to run perf on the virtualization software (e.g. ./perf qemu..) and in the meantime running perf in the guest to cause the contention.

On the other hand, when we improve the user experience of running perf inside the guest by reducing the virtualization overhead, that would bring real benefits to the real use case.

Best,
Wei

Peter Zijlstra Nov. 5, 2018, 4:56 p.m. UTC | #7

Please; don't send malformed emails like this. Lines wrap at 78 chars.

On Mon, Nov 05, 2018 at 03:37:24PM +0000, Wang, Wei W wrote:
> On Monday, November 5, 2018 8:14 PM, Peter Zijlstra wrote:

> > That can only work if the host counter has perf_event_attr::exclude_guest=1,
> > any counter without that must also count when the guest is running.
> > 
> > (and, IIRC, normal perf tool events do not have that set by default)
> 
> Probably no. Please see Line 81 at
> https://github.com/torvalds/linux/blob/master/tools/perf/util/util.c
> perf_guest by default is false, which makes "attr->exclude_guest = 1"

Then you're in luck. But if the host creates an even that has
exclude_guest=0 set, it should still work.

> > The thing is; you cannot do blind pass-through of the PMU, some of its
> > features simply do not work in a guest. Also, the host perf driver expects
> > certain functionality that must be respected.
> 
> Actually we are not blindly assigning the perf counters. Guest works
> with its own complete perf stack (like the one on the host) which also
> has its own constraints. 

But it knows nothing of the host state.

> The counter is also not passed through to the guest, guest accesses to
> the assigned counter will still exit to the hypervisor, and the
> hypervisor helps update the counter. 

Yes, you have to; because the PMU doesn't properly virtualize, also
because the HV -- linux in our case -- already claimed the PMU.

So the network passthrough case you mentioned simply doesn't apply at
all. Don't bother looking at it for inspiration.

> > Those are the constraints you have to work with.
> > 
> > Back when we all started down this virt rathole, I proposed people do
> > paravirt perf, where events would be handed to the host kernel and let the
> > host kernel do its normal thing. But people wanted to do the MSR based
> > thing because of !linux guests.
> 
> IMHO, it is worthwhile to care more about the real use case. When a
> user gets a virtual machine from a vendor, all he can do is to run
> perf inside the guest. The above contention concerns would not happen,
> because the user wouldn't be able to come to the host to run perf on
> the virtualization software (e.g. ./perf qemu..) and in the meantime
> running perf in the guest to cause the contention.

That's your job. Mine is to make sure that whatever you propose fits in
the existing model and doesn't make a giant mess of things.

And for Linux guests on Linux hosts, paravirt perf still makes the most
sense to me; then you get the host scheduling all the events and
providing the guest with the proper counts/runtimes/state.

> On the other hand, when we improve the user experience of running perf
> inside the guest by reducing the virtualization overhead, that would
> bring real benefits to the real use case.

You can start to improve things by doing a less stupid implementation of
the existing code.

Andi Kleen Nov. 5, 2018, 6:20 p.m. UTC | #8

I ran into a similar problem with my PEBS virtualization patchkit.

My solution was: basically schedule as normal, but tell the scheduler to
force allocate a counter on a specific index.  It can be done
only with a few lines of change in the scheduler code.

-Andi

[v1,1/8] perf/x86: add support to mask counters from host

Commit Message

Comments

Patch