diff mbox series

[v3,4/5] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack

Message ID 1537437959-8751-5-git-send-email-wei.w.wang@intel.com (mailing list archive)
State New, archived
Headers show
Series Guest LBR Enabling | expand

Commit Message

Wang, Wei W Sept. 20, 2018, 10:05 a.m. UTC
From: Like Xu <like.xu@intel.com>

This patch adds support to enable and disable the host side saving and
restoring the guest lbr stack on vCPU switching. To enable that, the
host creates a perf event for the vCPU, and the event attributes are set
to user callstack mode lbr so that all the conditions are meet in the host
perf subsystem to save the lbr stack on thread switching.

The host side lbr perf event are created only for the purpose of saving
and restoring the lbr stack. There is no need to enable the lbr
functionality for this perf event, because the feature is essentially used
in the vCPU.

So, a guest_lbr boolean control is added to cpuc, to indicate if the lbr
perf event is created for the guest. When the perf subsystem handles this
event (e.g. lbr enable or read lbr stack on PMI) and finds it is for the
guest, it simply returns, because all we need for the perf event is just
a context switch support for the lbr stack.

Signed-off-by: Like Xu <like.xu@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/intel/lbr.c     | 10 +++++++---
 arch/x86/events/perf_event.h    |  1 +
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.h              |  3 +++
 arch/x86/kvm/pmu_intel.c        | 41 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 53 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra Sept. 20, 2018, 12:17 p.m. UTC | #1
On Thu, Sep 20, 2018 at 06:05:58PM +0800, Wei Wang wrote:
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 1562863..a91fdef 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -223,6 +223,7 @@ struct cpu_hw_events {
>  	 */
>  	u64				intel_ctrl_guest_mask;
>  	u64				intel_ctrl_host_mask;
> +	bool				guest_lbr;
>  	struct perf_guest_switch_msr	guest_switch_msrs[X86_PMC_IDX_MAX];
>  
>  	/*

Please, no bools in aggregate types.

Either use a bitfield or u8, there's more LBR related bools, convert and
group the lot to avoid holes.
Peter Zijlstra Sept. 20, 2018, 12:32 p.m. UTC | #2
On Thu, Sep 20, 2018 at 06:05:58PM +0800, Wei Wang wrote:

> diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
> index 5ab4a36..97a29d7 100644
> --- a/arch/x86/kvm/pmu_intel.c
> +++ b/arch/x86/kvm/pmu_intel.c
> @@ -342,6 +342,47 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
>  		pmu->global_ovf_ctrl = 0;
>  }
>  
> +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct perf_event *event;
> +	struct perf_event_attr attr = {
> +		.type = PERF_TYPE_RAW,
> +		.size = sizeof(attr),

Bit yuck to imply .config = 0, like that.

> +		.pinned = true,

Note that this can still result in failing to schedule the event, you
create a pinned task event, but a pinned cpu event takes precedence and
can claim the LBR.

How do you deal with that situation, where the LBR is in use by a host
event?

> +		.exclude_host = true,

Didn't you mean to use .exclude_guest ? You don't want this thing to run
when the guest is running, right?

But I still don't like this hack much, what happens if userspace sets
that bit? You seems to have forgotten to combine it with PF_VCPU or
whatever is the appropriate bit to check.

Similarly, how does the existing exclude_{gues,host} crud work?

> +		.sample_type = PERF_SAMPLE_BRANCH_STACK,

What's the point of setting .sample_type if !.sample_period ?

> +		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> +				      PERF_SAMPLE_BRANCH_USER |
> +				      PERF_SAMPLE_BRANCH_KERNEL,
> +	};

I think this function wants a comment to explain wth its doing and why,
because if someone stumbles over this code in a few months nobody will
remember those things.

> +
> +	if (pmu->guest_lbr_event)
> +		return 0;
> +
> +	event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
> +						 NULL);
> +	if (IS_ERR(event)) {
> +		pr_err("%s: failed %ld\n", __func__, PTR_ERR(event));
> +		return -ENOENT;
> +	}
> +	pmu->guest_lbr_event = event;
> +
> +	return 0;
> +}
Andi Kleen Sept. 20, 2018, 3:30 p.m. UTC | #3
> +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct perf_event *event;
> +	struct perf_event_attr attr = {
> +		.type = PERF_TYPE_RAW,
> +		.size = sizeof(attr),
> +		.pinned = true,
> +		.exclude_host = true,
> +		.sample_type = PERF_SAMPLE_BRANCH_STACK,
> +		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> +				      PERF_SAMPLE_BRANCH_USER |
> +				      PERF_SAMPLE_BRANCH_KERNEL,

I think that will allocate an extra perfmon counter, right? 

So if the guest wants to use all perfmon counters we would start to 
multiplex, which would be bad

Would need to fix perf core to not allocate a perfmon counter in
this case, if no period and no event count is requested.

-Andi
Peter Zijlstra Sept. 20, 2018, 4:24 p.m. UTC | #4
On Thu, Sep 20, 2018 at 08:30:35AM -0700, Andi Kleen wrote:
> > +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +	struct perf_event *event;
> > +	struct perf_event_attr attr = {
> > +		.type = PERF_TYPE_RAW,
> > +		.size = sizeof(attr),
> > +		.pinned = true,
> > +		.exclude_host = true,
> > +		.sample_type = PERF_SAMPLE_BRANCH_STACK,
> > +		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> > +				      PERF_SAMPLE_BRANCH_USER |
> > +				      PERF_SAMPLE_BRANCH_KERNEL,
> 
> I think that will allocate an extra perfmon counter, right? 

I throught the same too, but I think the exclude_host/guest, whichever
is the right one makes that work for regular counters too.

That code is a wee bit magical and I didn't take the time to reverse
engineer that. It most certainly needs a comment.
Wang, Wei W Sept. 27, 2018, 10:05 a.m. UTC | #5
On Thursday, September 20, 2018 11:31 PM, Andi Kleen wrote:
> > +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu) {
> > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +	struct perf_event *event;
> > +	struct perf_event_attr attr = {
> > +		.type = PERF_TYPE_RAW,
> > +		.size = sizeof(attr),
> > +		.pinned = true,
> > +		.exclude_host = true,
> > +		.sample_type = PERF_SAMPLE_BRANCH_STACK,
> > +		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK
> |
> > +				      PERF_SAMPLE_BRANCH_USER |
> > +				      PERF_SAMPLE_BRANCH_KERNEL,


Sorry for my late response (I was digging into some of the details).


> I think that will allocate an extra perfmon counter, right?
> 
> So if the guest wants to use all perfmon counters we would start to multiplex,
> which would be bad

Right. The host side counter allocation is not necessary.


> Would need to fix perf core to not allocate a perfmon counter in this case, if
> no period and no event count is requested.
> 

If no period (i.e. .sample_period = 0), the current code still uses one counter with the period set to the max period.
If we change that to do no counter allocation, would that break any of the existing usages (I didn't find the reason why the existing code not avoid that allocation for a non-sampling event)?

Best,
Wei
Wang, Wei W Sept. 27, 2018, 10:10 a.m. UTC | #6
On Friday, September 21, 2018 12:24 AM, Peter Zijlstra wrote:
> On Thu, Sep 20, 2018 at 08:30:35AM -0700, Andi Kleen wrote:
> > > +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu) {
> > > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > > +	struct perf_event *event;
> > > +	struct perf_event_attr attr = {
> > > +		.type = PERF_TYPE_RAW,
> > > +		.size = sizeof(attr),
> > > +		.pinned = true,
> > > +		.exclude_host = true,
> > > +		.sample_type = PERF_SAMPLE_BRANCH_STACK,
> > > +		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK
> |
> > > +				      PERF_SAMPLE_BRANCH_USER |
> > > +				      PERF_SAMPLE_BRANCH_KERNEL,
> >
> > I think that will allocate an extra perfmon counter, right?
> 
> I throught the same too, but I think the exclude_host/guest, whichever is the
> right one makes that work for regular counters too.

Sorry for being late.

I'm not sure if exclude_host/guest would be suitable, for example, if the guest wants to use a perf counter, host will create a perf event with "exclude_host=true" to have the counter not count in host. And "exclude_guest=true" is a flag to the perf core that the counter should not count when the guest runs.

What would you think if we add a new flag (e.g. .force_no_counters) to the perf core to indicate not allocating a perf counter?

> That code is a wee bit magical and I didn't take the time to reverse engineer
> that. It most certainly needs a comment.

No problem. I will add more comments in the next version.

Best,
Wei
Wang, Wei W Sept. 27, 2018, 1:45 p.m. UTC | #7
On Thursday, September 20, 2018 8:32 PM, Peter Zijlstra wrote:
> On Thu, Sep 20, 2018 at 06:05:58PM +0800, Wei Wang wrote:
> How do you deal with that situation, where the LBR is in use by a host event?

I guess you were referring to this use case: "perf record -b qemu-system-x86_64..", where the qemu process also needs lbr.

I also discussed with Andi about this in v1, where we switch the lbr state on VMX transitions. That could achieve the above, but adds too much overhead to VMX transitions, which are frequent. Considering the above usage is not common (it's more common to just have the task in the guest use the lbr feature), we decided to not take that usage into account, and switch lbr state only on the vCPU switching.

Best,
Wei
diff mbox series

Patch

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index c81f160..915fcc3 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -462,6 +462,9 @@  void intel_pmu_lbr_add(struct perf_event *event)
 	if (!x86_pmu.lbr_nr)
 		return;
 
+	if (event->attr.exclude_host)
+		cpuc->guest_lbr = true;
+
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -507,6 +510,7 @@  void intel_pmu_lbr_del(struct perf_event *event)
 		task_ctx->lbr_callstack_users--;
 	}
 
+	cpuc->guest_lbr = false;
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 	perf_sched_cb_dec(event->ctx->pmu);
@@ -516,7 +520,7 @@  void intel_pmu_lbr_enable_all(bool pmi)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (cpuc->lbr_users)
+	if (cpuc->lbr_users && !cpuc->guest_lbr)
 		__intel_pmu_lbr_enable(pmi);
 }
 
@@ -524,7 +528,7 @@  void intel_pmu_lbr_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (cpuc->lbr_users)
+	if (cpuc->lbr_users && !cpuc->guest_lbr)
 		__intel_pmu_lbr_disable();
 }
 
@@ -658,7 +662,7 @@  void intel_pmu_lbr_read(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (!cpuc->lbr_users)
+	if (!cpuc->lbr_users || cpuc->guest_lbr)
 		return;
 
 	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1562863..a91fdef 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -223,6 +223,7 @@  struct cpu_hw_events {
 	 */
 	u64				intel_ctrl_guest_mask;
 	u64				intel_ctrl_host_mask;
+	bool				guest_lbr;
 	struct perf_guest_switch_msr	guest_switch_msrs[X86_PMC_IDX_MAX];
 
 	/*
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a46e31..fdcac01 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -432,6 +432,7 @@  struct kvm_pmu {
 	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
 	struct irq_work irq_work;
 	u64 reprogram_pmi;
+	struct perf_event *guest_lbr_event;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ba8898e..e872aed 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -120,6 +120,9 @@  void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
+extern int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu);
+extern void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu);
+
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
 #endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
index 5ab4a36..97a29d7 100644
--- a/arch/x86/kvm/pmu_intel.c
+++ b/arch/x86/kvm/pmu_intel.c
@@ -342,6 +342,47 @@  static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 		pmu->global_ovf_ctrl = 0;
 }
 
+int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct perf_event *event;
+	struct perf_event_attr attr = {
+		.type = PERF_TYPE_RAW,
+		.size = sizeof(attr),
+		.pinned = true,
+		.exclude_host = true,
+		.sample_type = PERF_SAMPLE_BRANCH_STACK,
+		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+				      PERF_SAMPLE_BRANCH_USER |
+				      PERF_SAMPLE_BRANCH_KERNEL,
+	};
+
+	if (pmu->guest_lbr_event)
+		return 0;
+
+	event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
+						 NULL);
+	if (IS_ERR(event)) {
+		pr_err("%s: failed %ld\n", __func__, PTR_ERR(event));
+		return -ENOENT;
+	}
+	pmu->guest_lbr_event = event;
+
+	return 0;
+}
+
+void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct perf_event *event = pmu->guest_lbr_event;
+
+	if (!event)
+		return;
+
+	perf_event_release_kernel(event);
+	pmu->guest_lbr_event = NULL;
+}
+
 struct kvm_pmu_ops intel_pmu_ops = {
 	.find_arch_event = intel_find_arch_event,
 	.find_fixed_event = intel_find_fixed_event,