diff mbox series

[v7,08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack

Message ID 1562548999-37095-9-git-send-email-wei.w.wang@intel.com (mailing list archive)
State New, archived
Headers show
Series Guest LBR Enabling | expand

Commit Message

Wang, Wei W July 8, 2019, 1:23 a.m. UTC
From: Like Xu <like.xu@intel.com>

This patch adds support to enable/disable the host side save/restore
for the guest lbr stack on vCPU switching. To enable that, the host
creates a perf event for the vCPU, and the event attributes are set
to the user callstack mode lbr so that all the conditions are meet in
the host perf subsystem to save the lbr stack on task switching.

The host side lbr perf event are created only for the purpose of saving
and restoring the lbr stack. There is no need to enable the lbr
functionality for this perf event, because the feature is essentially
used in the vCPU. So perf_event_create is invoked with need_counter=false
to get no counter assigned for the perf event.

The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is
used by the vCPU only for context switching. When the perf subsystem
handles this event (e.g. lbr enable or read lbr stack on PMI) and finds
it's non-zero, it simply returns.

Signed-off-by: Like Xu <like.xu@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/events/intel/lbr.c     | 13 +++++++--
 arch/x86/events/perf_event.h    |  1 +
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.h              |  3 ++
 arch/x86/kvm/vmx/pmu_intel.c    | 61 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 76 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra July 8, 2019, 2:48 p.m. UTC | #1
On Mon, Jul 08, 2019 at 09:23:15AM +0800, Wei Wang wrote:
> From: Like Xu <like.xu@intel.com>
> 
> This patch adds support to enable/disable the host side save/restore

This patch should be disqualified on Changelog alone...

  Documentation/process/submitting-patches.rst:instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy

> for the guest lbr stack on vCPU switching. To enable that, the host
> creates a perf event for the vCPU, and the event attributes are set
> to the user callstack mode lbr so that all the conditions are meet in
> the host perf subsystem to save the lbr stack on task switching.
> 
> The host side lbr perf event are created only for the purpose of saving
> and restoring the lbr stack. There is no need to enable the lbr
> functionality for this perf event, because the feature is essentially
> used in the vCPU. So perf_event_create is invoked with need_counter=false
> to get no counter assigned for the perf event.
> 
> The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is
> used by the vCPU only for context switching. When the perf subsystem
> handles this event (e.g. lbr enable or read lbr stack on PMI) and finds
> it's non-zero, it simply returns.

*WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
this?

Many of these patches don't explain why things are done; that's a
problem.
Wang, Wei W July 9, 2019, 3:04 a.m. UTC | #2
On 07/08/2019 10:48 PM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 09:23:15AM +0800, Wei Wang wrote:
>> From: Like Xu <like.xu@intel.com>
>>
>> This patch adds support to enable/disable the host side save/restore
> This patch should be disqualified on Changelog alone...
>
>    Documentation/process/submitting-patches.rst:instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy

OK, we will discard "This patch" in the description:

To enable/disable the host side save/restore for the guest lbr stack on
vCPU switching, the host creates a perf event for the vCPU..

>> for the guest lbr stack on vCPU switching. To enable that, the host
>> creates a perf event for the vCPU, and the event attributes are set
>> to the user callstack mode lbr so that all the conditions are meet in
>> the host perf subsystem to save the lbr stack on task switching.
>>
>> The host side lbr perf event are created only for the purpose of saving
>> and restoring the lbr stack. There is no need to enable the lbr
>> functionality for this perf event, because the feature is essentially
>> used in the vCPU. So perf_event_create is invoked with need_counter=false
>> to get no counter assigned for the perf event.
>>
>> The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is
>> used by the vCPU only for context switching. When the perf subsystem
>> handles this event (e.g. lbr enable or read lbr stack on PMI) and finds
>> it's non-zero, it simply returns.
> *WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
> this?

Because the VMX transition is much more frequent than the vCPU switching.
On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this
would be too large for the frequent VMX transitions.

LBR state is saved when vCPU is scheduled out to ensure that this vCPU's
LBR data doesn't get lost (as another vCPU or host thread that is 
scheduled in
may use LBR)


> Many of these patches don't explain why things are done; that's a
> problem.

We'll improve, please help when you find anywhere isn't clear to you, 
thanks.

Best,
Wei
Peter Zijlstra July 9, 2019, 9:39 a.m. UTC | #3
On Tue, Jul 09, 2019 at 11:04:21AM +0800, Wei Wang wrote:
> On 07/08/2019 10:48 PM, Peter Zijlstra wrote:

> > *WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
> > this?
> 
> Because the VMX transition is much more frequent than the vCPU switching.
> On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this
> would be too large for the frequent VMX transitions.
> 
> LBR state is saved when vCPU is scheduled out to ensure that this
> vCPU's LBR data doesn't get lost (as another vCPU or host thread that
> is scheduled in may use LBR)

But VMENTER/VMEXIT still have to enable/disable the LBR, right?
Otherwise the host will pollute LBR contents. And you then rely on this
'fake' event to ensure the host doesn't use LBR when the VCPU is
running.

But what about the counter scheduling rules; what happens when a CPU
event claims the LBR before the task event can claim it? CPU events have
precedence over task events.

Then your VCPU will clobber host state; which is BAD.

I'm missing all these details in the Changelogs. Please describe the
whole setup and explain why this approach.
Wang, Wei W July 9, 2019, 11:34 a.m. UTC | #4
On 07/09/2019 05:39 PM, Peter Zijlstra wrote:
> On Tue, Jul 09, 2019 at 11:04:21AM +0800, Wei Wang wrote:
>> On 07/08/2019 10:48 PM, Peter Zijlstra wrote:
>>> *WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
>>> this?
>> Because the VMX transition is much more frequent than the vCPU switching.
>> On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this
>> would be too large for the frequent VMX transitions.
>>
>> LBR state is saved when vCPU is scheduled out to ensure that this
>> vCPU's LBR data doesn't get lost (as another vCPU or host thread that
>> is scheduled in may use LBR)
> But VMENTER/VMEXIT still have to enable/disable the LBR, right?
> Otherwise the host will pollute LBR contents. And you then rely on this
> 'fake' event to ensure the host doesn't use LBR when the VCPU is
> running.

Yes, only the debugctl msr is save/restore on vmx tranisions.


>
> But what about the counter scheduling rules;

The counter is emulated independent of the lbr emulation.

Here is the background reason:

The direction we are going is the architectural emulation, where the 
features
are emulated based on the hardware behavior described in the spec. So 
the lbr
emulation path only offers the lbr feature to the guest (no counters 
associated, as
the lbr feature doesn't have a counter essentially).

If the above isn't clear, please see this example: the guest could run 
any software
to use the lbr feature (non-perf or non-linux, or even a testing kernel 
module to try
lbr for their own purpose), and it could choose to use a regular timer 
to do sampling.
If the lbr emulation takes a counter to generate a PMI to the guest to 
do sampling,
that pmi isn't expected from the guest perspective.

So the counter scheduling isn't considered by the lbr emulation here, it 
is considered
by the counter emulation. If the guest needs a counter, it configures 
the related msr,
which traps to KVM, and the counter emulation has it own emulation path
(e.g. reprogram_gp_counter which is called when the guest writes to the 
emulated
eventsel msr).


> what happens when a CPU
> event claims the LBR before the task event can claim it? CPU events have
> precedence over task events.

I think the precedence (cpu pined and task pined) is for the counter 
multiplexing,
right?

For the lbr feature, could we thought of it as first come, first served?
For example, if we have 2 host threads who want to use lbr at the same time,
I think one of them would simply fail to use.

So if guest first gets the lbr, host wouldn't take over unless some 
userspace
command (we added to QEMU) is executed to have the vCPU actively
stop using lbr.


>
> I'm missing all these details in the Changelogs. Please describe the
> whole setup and explain why this approach.

OK, just shared some important background above.
I'll see if any more important details missed.

Best,
Wei
Peter Zijlstra July 9, 2019, 11:45 a.m. UTC | #5
On Mon, Jul 08, 2019 at 09:23:15AM +0800, Wei Wang wrote:
> +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct perf_event *event;
> +
> +	/*
> +	 * The main purpose of this perf event is to have the host perf core
> +	 * help save/restore the guest lbr stack on vcpu switching. There is
> +	 * no perf counters allocated for the event.
> +	 *
> +	 * About the attr:
> +	 * exclude_guest: set to true to indicate that the event runs on the
> +	 *                host only.

That's a lie; it _never_ runs. You specifically don't want the host to
enable LBR so as not to corrupt the guest state.

> +	 * pinned:        set to false, so that the FLEXIBLE events will not
> +	 *                be rescheduled for this event which actually doesn't
> +	 *                need a perf counter.

Unparsable gibberish. Specifically by making it flexible it is
susceptible to rotation and there's no guarantee it will actually get
scheduled.

> +	 * config:        Actually this field won't be used by the perf core
> +	 *                as this event doesn't have a perf counter.
> +	 * sample_period: Same as above.

If it's unused; why do we need to set it at all?

> +	 * sample_type:   tells the perf core that it is an lbr event.
> +	 * branch_sample_type: tells the perf core that the lbr event works in
> +	 *                the user callstack mode so that the lbr stack will be
> +	 *                saved/restored on vCPU switching.

Again; doesn't make sense. What does the user part have to do with
save/restore? What happens when this vcpu thread drops to userspace for
an assist?

> +	 */
> +	struct perf_event_attr attr = {
> +		.type = PERF_TYPE_RAW,
> +		.size = sizeof(attr),
> +		.exclude_guest = true,
> +		.pinned = false,
> +		.config = 0,
> +		.sample_period = 0,
> +		.sample_type = PERF_SAMPLE_BRANCH_STACK,
> +		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> +				      PERF_SAMPLE_BRANCH_USER,
> +	};
> +
> +	if (pmu->vcpu_lbr_event)
> +		return 0;
> +
> +	event = perf_event_create(&attr, -1, current, NULL, NULL, false);
> +	if (IS_ERR(event)) {
> +		pr_err("%s: failed %ld\n", __func__, PTR_ERR(event));
> +		return -ENOENT;
> +	}
> +	pmu->vcpu_lbr_event = event;
> +
> +	return 0;
> +}
Peter Zijlstra July 9, 2019, 12:19 p.m. UTC | #6
On Tue, Jul 09, 2019 at 07:34:26PM +0800, Wei Wang wrote:

> > But what about the counter scheduling rules;
> 
> The counter is emulated independent of the lbr emulation.

> > what happens when a CPU
> > event claims the LBR before the task event can claim it? CPU events have
> > precedence over task events.
> 
> I think the precedence (cpu pined and task pined) is for the counter
> multiplexing,
> right?

No; for all scheduling. The order is:

  CPU-pinned
  Task-pinned
  CPU-flexible
  Task-flexible

The way you created the event it would land in 'task-flexible', but even
if you make it task-pinned, a CPU (or CPU-pinned) event could claim the
LBR before your fake event.

> For the lbr feature, could we thought of it as first come, first served?
> For example, if we have 2 host threads who want to use lbr at the same time,
> I think one of them would simply fail to use.
>
> So if guest first gets the lbr, host wouldn't take over unless some
> userspace command (we added to QEMU) is executed to have the vCPU
> actively stop using lbr.

Doesn't work that way.

Say you start KVM with LBR emulation, it creates this task event, it
gets the LBR (nobody else wants it) and the guest works and starts using
the LBR.

Then the host creates a CPU LBR event and the vCPU suddenly gets denied
the LBR and the guest no longer functions correctly.

Or you should fail to VMENTER, in which case you starve the guest, but
at least it doesn't malfunction.
Wang, Wei W July 10, 2019, 8:19 a.m. UTC | #7
On 07/09/2019 08:19 PM, Peter Zijlstra wrote:
>
>> For the lbr feature, could we thought of it as first come, first served?
>> For example, if we have 2 host threads who want to use lbr at the same time,
>> I think one of them would simply fail to use.
>>
>> So if guest first gets the lbr, host wouldn't take over unless some
>> userspace command (we added to QEMU) is executed to have the vCPU
>> actively stop using lbr.
> Doesn't work that way.
>
> Say you start KVM with LBR emulation, it creates this task event, it
> gets the LBR (nobody else wants it) and the guest works and starts using
> the LBR.
>
> Then the host creates a CPU LBR event and the vCPU suddenly gets denied
> the LBR and the guest no longer functions correctly.
>
> Or you should fail to VMENTER, in which case you starve the guest, but
> at least it doesn't malfunction.
>

Ok, I think we could go with the first option for now, like this:

When there are other host threads starting to use lbr, host perf notifies
kvm to disable the guest use of lbr (add a notifier_block), which includes:
  - cancel passing through lbr msrs to guest
  - upon guest access to the lbr msr (will trap to KVM now), return #GP
    (or 0).


Best,
Wei
Wang, Wei W July 10, 2019, 8:21 a.m. UTC | #8
On 07/09/2019 07:45 PM, Peter Zijlstra wrote:
>
>> +	 * config:        Actually this field won't be used by the perf core
>> +	 *                as this event doesn't have a perf counter.
>> +	 * sample_period: Same as above.
> If it's unused; why do we need to set it at all?

OK, we'll remove the unused fields.

>
>> +	 * sample_type:   tells the perf core that it is an lbr event.
>> +	 * branch_sample_type: tells the perf core that the lbr event works in
>> +	 *                the user callstack mode so that the lbr stack will be
>> +	 *                saved/restored on vCPU switching.
> Again; doesn't make sense. What does the user part have to do with
> save/restore? What happens when this vcpu thread drops to userspace for
> an assist?

This is a fake event which doesn't run the lbr on the host.
Returning to userspace doesn't need save/restore lbr, because userspace
wouldn't change those lbr msrs.

The event is created to save/restore lbr msrs on vcpu switching.
Host perf only do this save/restore for "user callstack mode" lbr event, 
so we
construct the event to be "user callstack mode" to reuse the host lbr 
save/restore.

An alternative method is to have KVM vcpu sched_in/out callback
save/restore those msrs, which don't need to create this fake event.

Please let me know if you prefer the second method.

Best,
Wei
diff mbox series

Patch

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 784642a..118764b 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -462,6 +462,9 @@  void intel_pmu_lbr_add(struct perf_event *event)
 	if (!x86_pmu.lbr_nr)
 		return;
 
+	if (event->attr.exclude_guest && is_no_counter_event(event))
+		cpuc->vcpu_lbr = 1;
+
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -509,6 +512,9 @@  void intel_pmu_lbr_del(struct perf_event *event)
 		task_ctx->lbr_callstack_users--;
 	}
 
+	if (event->attr.exclude_guest && is_no_counter_event(event))
+		cpuc->vcpu_lbr = 0;
+
 	if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
 		cpuc->lbr_pebs_users--;
 	cpuc->lbr_users--;
@@ -521,7 +527,7 @@  void intel_pmu_lbr_enable_all(bool pmi)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (cpuc->lbr_users)
+	if (cpuc->lbr_users && !cpuc->vcpu_lbr)
 		__intel_pmu_lbr_enable(pmi);
 }
 
@@ -529,7 +535,7 @@  void intel_pmu_lbr_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (cpuc->lbr_users)
+	if (cpuc->lbr_users && !cpuc->vcpu_lbr)
 		__intel_pmu_lbr_disable();
 }
 
@@ -669,7 +675,8 @@  void intel_pmu_lbr_read(void)
 	 * This could be smarter and actually check the event,
 	 * but this simple approach seems to work for now.
 	 */
-	if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+	if (!cpuc->lbr_users || cpuc->vcpu_lbr ||
+	    cpuc->lbr_users == cpuc->lbr_pebs_users)
 		return;
 
 	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 186c1c7..86605d1 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -238,6 +238,7 @@  struct cpu_hw_events {
 	/*
 	 * Intel LBR bits
 	 */
+	u8				vcpu_lbr;
 	int				lbr_users;
 	int				lbr_pebs_users;
 	struct perf_branch_stack	lbr_stack;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d80925..79e9c92 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -474,6 +474,7 @@  struct kvm_pmu {
 	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
 	struct irq_work irq_work;
 	u64 reprogram_pmi;
+	struct perf_event *vcpu_lbr_event;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7926b65..384a0b7 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -123,6 +123,9 @@  void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
+extern int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu);
+extern void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu);
+
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
 #endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 09ae6ff..24a544e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -507,6 +507,67 @@  static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 		pmu->global_ovf_ctrl = 0;
 }
 
+int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct perf_event *event;
+
+	/*
+	 * The main purpose of this perf event is to have the host perf core
+	 * help save/restore the guest lbr stack on vcpu switching. There is
+	 * no perf counters allocated for the event.
+	 *
+	 * About the attr:
+	 * exclude_guest: set to true to indicate that the event runs on the
+	 *                host only.
+	 * pinned:        set to false, so that the FLEXIBLE events will not
+	 *                be rescheduled for this event which actually doesn't
+	 *                need a perf counter.
+	 * config:        Actually this field won't be used by the perf core
+	 *                as this event doesn't have a perf counter.
+	 * sample_period: Same as above.
+	 * sample_type:   tells the perf core that it is an lbr event.
+	 * branch_sample_type: tells the perf core that the lbr event works in
+	 *                the user callstack mode so that the lbr stack will be
+	 *                saved/restored on vCPU switching.
+	 */
+	struct perf_event_attr attr = {
+		.type = PERF_TYPE_RAW,
+		.size = sizeof(attr),
+		.exclude_guest = true,
+		.pinned = false,
+		.config = 0,
+		.sample_period = 0,
+		.sample_type = PERF_SAMPLE_BRANCH_STACK,
+		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+				      PERF_SAMPLE_BRANCH_USER,
+	};
+
+	if (pmu->vcpu_lbr_event)
+		return 0;
+
+	event = perf_event_create(&attr, -1, current, NULL, NULL, false);
+	if (IS_ERR(event)) {
+		pr_err("%s: failed %ld\n", __func__, PTR_ERR(event));
+		return -ENOENT;
+	}
+	pmu->vcpu_lbr_event = event;
+
+	return 0;
+}
+
+void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct perf_event *event = pmu->vcpu_lbr_event;
+
+	if (!event)
+		return;
+
+	perf_event_release_kernel(event);
+	pmu->vcpu_lbr_event = NULL;
+}
+
 struct kvm_pmu_ops intel_pmu_ops = {
 	.find_arch_event = intel_find_arch_event,
 	.find_fixed_event = intel_find_fixed_event,