[v2,07/54] perf: Add generic exclude_guest support

Message ID	20240506053020.3911940-8-mizhang@google.com (mailing list archive)
State	New
Headers	show Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AF90C6A008 for <kvm@vger.kernel.org>; Mon, 6 May 2024 05:30:40 +0000 (UTC) Reply-To: Mingwei Zhang <mizhang@google.com> Date: Mon, 6 May 2024 05:29:32 +0000 In-Reply-To: <20240506053020.3911940-1-mizhang@google.com> Precedence: bulk Mime-Version: 1.0 References: <20240506053020.3911940-1-mizhang@google.com> Message-ID: <20240506053020.3911940-8-mizhang@google.com> Subject: [PATCH v2 07/54] perf: Add generic exclude_guest support From: Mingwei Zhang <mizhang@google.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Xiong Zhang <xiong.y.zhang@intel.com>, Dapeng Mi <dapeng1.mi@linux.intel.com>, Kan Liang <kan.liang@intel.com>, Zhenyu Wang <zhenyuw@linux.intel.com>, Manali Shukla <manali.shukla@amd.com>, Sandipan Das <sandipan.das@amd.com> Cc: Jim Mattson <jmattson@google.com>, Stephane Eranian <eranian@google.com>, Ian Rogers <irogers@google.com>, Namhyung Kim <namhyung@kernel.org>, Mingwei Zhang <mizhang@google.com>, gce-passthrou-pmu-dev@google.com, Samantha Alt <samantha.alt@intel.com>, Zhiyuan Lv <zhiyuan.lv@intel.com>, Yanfei Xu <yanfei.xu@intel.com>, maobibo <maobibo@loongson.cn>, Like Xu <like.xu.linux@gmail.com>, Peter Zijlstra <peterz@infradead.org>, kvm@vger.kernel.org, linux-perf-users@vger.kernel.org Content-Type: text/plain; charset="UTF-8"
Series	Mediated Passthrough vPMU 2.0 for x86 \| expand [v2,00/54] Mediated Passthrough vPMU 2.0 for x86 [v2,01/54] KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET" [v2,02/54] KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible [v2,03/54] KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms [v2,04/54] x86/msr: Define PerfCntrGlobalStatusSet register [v2,05/54] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET [v2,06/54] perf: Support get/put passthrough PMU interfaces [v2,07/54] perf: Add generic exclude_guest support [v2,08/54] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU [v2,09/54] perf: core/x86: Register a new vector for KVM GUEST PMI [v2,10/54] KVM: x86: Extract x86_set_kvm_irq_handler() function [v2,11/54] KVM: x86/pmu: Register guest pmi handler for emulated PMU [v2,12/54] perf: x86: Add x86 function to switch PMI handler [v2,13/54] perf: core/x86: Forbid PMI handler when guest own PMU [v2,14/54] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap [v2,15/54] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter [v2,16/54] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs [v2,17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode [v2,18/54] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled [v2,19/54] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() [v2,20/54] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest [v2,21/54] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS [v2,22/54] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed [v2,23/54] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL [v2,24/54] KVM: x86/pmu: Create a function prototype to disable MSR interception [v2,25/54] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs [v2,26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU [v2,27/54] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() [v2,28/54] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc [v2,29/54] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context [v2,30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU [v2,31/54] KVM: x86/pmu: Make check_pmu_event_filter() an exported function [v2,32/54] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed [v2,33/54] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed [v2,34/54] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary [v2,35/54] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU [v2,36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary [v2,37/54] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM [v2,38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch [v2,39/54] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter [v2,40/54] KVM: x86/pmu: Introduce PMU operator to increment counter [v2,41/54] KVM: x86/pmu: Introduce PMU operator for setting counter overflow [v2,42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU [v2,43/54] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API [v2,44/54] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU [v2,45/54] KVM: nVMX: Add nested virtualization support for passthrough PMU [v2,46/54] perf/x86/amd/core: Set passthrough capability for host [v2,47/54] KVM: x86/pmu/svm: Set passthrough capability for vcpus [v2,48/54] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter [v2,49/54] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest [v2,50/54] KVM: x86/pmu/svm: Implement callback to disable MSR interception [v2,51/54] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event sel… [v2,52/54] KVM: x86/pmu/svm: Add registers to direct access list [v2,53/54] KVM: x86/pmu/svm: Implement handlers to save and restore context [v2,54/54] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU

Message ID

20240506053020.3911940-8-mizhang@google.com (mailing list archive)

State

New

Headers

Reply-To: Mingwei Zhang <mizhang@google.com>
Date: Mon,  6 May 2024 05:29:32 +0000
In-Reply-To: <20240506053020.3911940-1-mizhang@google.com>
Precedence: bulk
Mime-Version: 1.0
References: <20240506053020.3911940-1-mizhang@google.com>
Message-ID: <20240506053020.3911940-8-mizhang@google.com>
Subject: [PATCH v2 07/54] perf: Add generic exclude_guest support
From: Mingwei Zhang <mizhang@google.com>
To: Sean Christopherson <seanjc@google.com>,
 Paolo Bonzini <pbonzini@redhat.com>,
	Xiong Zhang <xiong.y.zhang@intel.com>,
 Dapeng Mi <dapeng1.mi@linux.intel.com>,
	Kan Liang <kan.liang@intel.com>, Zhenyu Wang <zhenyuw@linux.intel.com>,
	Manali Shukla <manali.shukla@amd.com>, Sandipan Das <sandipan.das@amd.com>
Cc: Jim Mattson <jmattson@google.com>, Stephane Eranian <eranian@google.com>,
	Ian Rogers <irogers@google.com>, Namhyung Kim <namhyung@kernel.org>,
	Mingwei Zhang <mizhang@google.com>, gce-passthrou-pmu-dev@google.com,
	Samantha Alt <samantha.alt@intel.com>, Zhiyuan Lv <zhiyuan.lv@intel.com>,
	Yanfei Xu <yanfei.xu@intel.com>, maobibo <maobibo@loongson.cn>,
	Like Xu <like.xu.linux@gmail.com>, Peter Zijlstra <peterz@infradead.org>,
 kvm@vger.kernel.org,
	linux-perf-users@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"

Series

Mediated Passthrough vPMU 2.0 for x86 | expand

Commit Message

Mingwei Zhang May 6, 2024, 5:29 a.m. UTC

From: Kan Liang <kan.liang@linux.intel.com>

Current perf doesn't explicitly schedule out all exclude_guest events
while the guest is running. There is no problem for the current emulated
vPMU. Because perf owns all the PMU counters. It can mask the counter
which is assigned to a exclude_guest event when a guest is running
(Intel way), or set the correspoinding HOSTONLY bit in evsentsel (AMD
way). The counter doesn't count when a guest is running.

However, either way doesn't work with the passthrough vPMU introduced.
A guest owns all the PMU counters when it's running. Host should not
mask any counters. The counter may be used by the guest. The evsentsel
may be overwrite.

Perf should explicitly schedule out all exclude_guest events to release
the PMU resources when entering a guest, and resume the counting when
exiting the guest.

Expose two interfaces to KVM. The KVM should notify the perf when
entering/exiting a guest.

It's possible that a exclude_guest event is created when a guest is
running. The new event should not be scheduled in as well.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h |   4 ++
 kernel/events/core.c       | 104 +++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)

Comments

Peter Zijlstra May 7, 2024, 8:58 a.m. UTC | #1

On Mon, May 06, 2024 at 05:29:32AM +0000, Mingwei Zhang wrote:

> @@ -5791,6 +5801,100 @@ void perf_put_mediated_pmu(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>  
> +static void perf_sched_out_exclude_guest(struct perf_event_context *ctx)
> +{
> +	struct perf_event_pmu_context *pmu_ctx;
> +
> +	update_context_time(ctx);
> +	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> +		struct perf_event *event, *tmp;
> +		struct pmu *pmu = pmu_ctx->pmu;
> +
> +		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
> +			continue;
> +
> +		perf_pmu_disable(pmu);
> +
> +		/*
> +		 * All active events must be exclude_guest events.
> +		 * See perf_get_mediated_pmu().
> +		 * Unconditionally remove all active events.
> +		 */
> +		list_for_each_entry_safe(event, tmp, &pmu_ctx->pinned_active, active_list)
> +			group_sched_out(event, pmu_ctx->ctx);
> +
> +		list_for_each_entry_safe(event, tmp, &pmu_ctx->flexible_active, active_list)
> +			group_sched_out(event, pmu_ctx->ctx);
> +
> +		pmu_ctx->rotate_necessary = 0;
> +
> +		perf_pmu_enable(pmu);
> +	}
> +}
> +
> +/* When entering a guest, schedule out all exclude_guest events. */
> +void perf_guest_enter(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
> +		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +		return;
> +	}
> +
> +	perf_sched_out_exclude_guest(&cpuctx->ctx);
> +	if (cpuctx->task_ctx)
> +		perf_sched_out_exclude_guest(cpuctx->task_ctx);
> +
> +	__this_cpu_write(perf_in_guest, true);
> +
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}
> +
> +static void perf_sched_in_exclude_guest(struct perf_event_context *ctx)
> +{
> +	struct perf_event_pmu_context *pmu_ctx;
> +
> +	update_context_time(ctx);
> +	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> +		struct pmu *pmu = pmu_ctx->pmu;
> +
> +		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
> +			continue;
> +
> +		perf_pmu_disable(pmu);
> +		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu);
> +		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
> +		perf_pmu_enable(pmu);
> +	}
> +}
> +
> +void perf_guest_exit(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
> +		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +		return;
> +	}
> +
> +	__this_cpu_write(perf_in_guest, false);
> +
> +	perf_sched_in_exclude_guest(&cpuctx->ctx);
> +	if (cpuctx->task_ctx)
> +		perf_sched_in_exclude_guest(cpuctx->task_ctx);
> +
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}

Bah, this is a ton of copy-paste from the normal scheduling code with
random changes. Why ?

Why can't this use ctx_sched_{in,out}() ? Surely the whole
CAP_PASSTHROUGHT thing is but a flag away.

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index dd4920bf3d1b..acf16676401a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1734,6 +1734,8 @@  extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 extern int perf_get_mediated_pmu(void);
 extern void perf_put_mediated_pmu(void);
+void perf_guest_enter(void);
+void perf_guest_exit(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1826,6 +1828,8 @@  static inline int perf_get_mediated_pmu(void)
 }
 
 static inline void perf_put_mediated_pmu(void)			{ }
+static inline void perf_guest_enter(void)			{ }
+static inline void perf_guest_exit(void)			{ }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 701b622c670e..4c6daf5cc923 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -406,6 +406,7 @@  static atomic_t nr_include_guest_events __read_mostly;
 
 static refcount_t nr_mediated_pmu_vms = REFCOUNT_INIT(0);
 static DEFINE_MUTEX(perf_mediated_pmu_mutex);
+static DEFINE_PER_CPU(bool, perf_in_guest);
 
 /* !exclude_guest system wide event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
 static inline bool is_include_guest_event(struct perf_event *event)
@@ -3854,6 +3855,15 @@  static int merge_sched_in(struct perf_event *event, void *data)
 	if (!event_filter_match(event))
 		return 0;
 
+	/*
+	 * Don't schedule in any exclude_guest events of PMU with
+	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
+	 */
+	if (__this_cpu_read(perf_in_guest) &&
+	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
+	    event->attr.exclude_guest)
+		return 0;
+
 	if (group_can_go_on(event, *can_add_hw)) {
 		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
@@ -5791,6 +5801,100 @@  void perf_put_mediated_pmu(void)
 }
 EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
 
+static void perf_sched_out_exclude_guest(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+
+	update_context_time(ctx);
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		struct perf_event *event, *tmp;
+		struct pmu *pmu = pmu_ctx->pmu;
+
+		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
+			continue;
+
+		perf_pmu_disable(pmu);
+
+		/*
+		 * All active events must be exclude_guest events.
+		 * See perf_get_mediated_pmu().
+		 * Unconditionally remove all active events.
+		 */
+		list_for_each_entry_safe(event, tmp, &pmu_ctx->pinned_active, active_list)
+			group_sched_out(event, pmu_ctx->ctx);
+
+		list_for_each_entry_safe(event, tmp, &pmu_ctx->flexible_active, active_list)
+			group_sched_out(event, pmu_ctx->ctx);
+
+		pmu_ctx->rotate_necessary = 0;
+
+		perf_pmu_enable(pmu);
+	}
+}
+
+/* When entering a guest, schedule out all exclude_guest events. */
+void perf_guest_enter(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		return;
+	}
+
+	perf_sched_out_exclude_guest(&cpuctx->ctx);
+	if (cpuctx->task_ctx)
+		perf_sched_out_exclude_guest(cpuctx->task_ctx);
+
+	__this_cpu_write(perf_in_guest, true);
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+
+static void perf_sched_in_exclude_guest(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+
+	update_context_time(ctx);
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		struct pmu *pmu = pmu_ctx->pmu;
+
+		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
+			continue;
+
+		perf_pmu_disable(pmu);
+		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu);
+		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
+		perf_pmu_enable(pmu);
+	}
+}
+
+void perf_guest_exit(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		return;
+	}
+
+	__this_cpu_write(perf_in_guest, false);
+
+	perf_sched_in_exclude_guest(&cpuctx->ctx);
+	if (cpuctx->task_ctx)
+		perf_sched_in_exclude_guest(cpuctx->task_ctx);
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block

[v2,07/54] perf: Add generic exclude_guest support

Commit Message

Comments

Patch