[RFC,v3,14/58] perf: Add switch_interrupt() interface

Message ID	20240801045907.4010984-15-mizhang@google.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF5811494A7 for <kvm@vger.kernel.org>; Thu, 1 Aug 2024 04:59:37 +0000 (UTC) Reply-To: Mingwei Zhang <mizhang@google.com> Date: Thu, 1 Aug 2024 04:58:23 +0000 In-Reply-To: <20240801045907.4010984-1-mizhang@google.com> Precedence: bulk Mime-Version: 1.0 References: <20240801045907.4010984-1-mizhang@google.com> Message-ID: <20240801045907.4010984-15-mizhang@google.com> Subject: [RFC PATCH v3 14/58] perf: Add switch_interrupt() interface From: Mingwei Zhang <mizhang@google.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Xiong Zhang <xiong.y.zhang@intel.com>, Dapeng Mi <dapeng1.mi@linux.intel.com>, Kan Liang <kan.liang@intel.com>, Zhenyu Wang <zhenyuw@linux.intel.com>, Manali Shukla <manali.shukla@amd.com>, Sandipan Das <sandipan.das@amd.com> Cc: Jim Mattson <jmattson@google.com>, Stephane Eranian <eranian@google.com>, Ian Rogers <irogers@google.com>, Namhyung Kim <namhyung@kernel.org>, Mingwei Zhang <mizhang@google.com>, gce-passthrou-pmu-dev@google.com, Samantha Alt <samantha.alt@intel.com>, Zhiyuan Lv <zhiyuan.lv@intel.com>, Yanfei Xu <yanfei.xu@intel.com>, Like Xu <like.xu.linux@gmail.com>, Peter Zijlstra <peterz@infradead.org>, Raghavendra Rao Ananta <rananta@google.com>, kvm@vger.kernel.org, linux-perf-users@vger.kernel.org Content-Type: text/plain; charset="UTF-8"
Series	Mediated Passthrough vPMU 3.0 for x86 \| expand [RFC,v3,00/58] Mediated Passthrough vPMU 3.0 for x86 [RFC,v3,01/58] sched/core: Move preempt_model_*() helpers from sched.h to preempt.h [RFC,v3,02/58] sched/core: Drop spinlocks on contention iff kernel is preemptible [RFC,v3,03/58] perf/x86: Do not set bit width for unavailable counters [RFC,v3,04/58] x86/msr: Define PerfCntrGlobalStatusSet register [RFC,v3,05/58] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET [RFC,v3,06/58] perf: Support get/put passthrough PMU interfaces [RFC,v3,07/58] perf: Skip pmu_ctx based on event_type [RFC,v3,08/58] perf: Clean up perf ctx time [RFC,v3,09/58] perf: Add a EVENT_GUEST flag [RFC,v3,10/58] perf: Add generic exclude_guest support [RFC,v3,11/58] x86/irq: Factor out common code for installing kvm irq handler [RFC,v3,12/58] perf: core/x86: Register a new vector for KVM GUEST PMI [RFC,v3,13/58] KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler [RFC,v3,14/58] perf: Add switch_interrupt() interface [RFC,v3,15/58] perf/x86: Support switch_interrupt interface [RFC,v3,16/58] perf/x86: Forbid PMI handler when guest own PMU [RFC,v3,17/58] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap [RFC,v3,18/58] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter [RFC,v3,19/58] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs [RFC,v3,20/58] KVM: x86/pmu: Always set global enable bits in passthrough mode [RFC,v3,21/58] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled [RFC,v3,22/58] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() [RFC,v3,23/58] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest [RFC,v3,24/58] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS [RFC,v3,25/58] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed [RFC,v3,26/58] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL [RFC,v3,27/58] KVM: x86/pmu: Create a function prototype to disable MSR interception [RFC,v3,28/58] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs [RFC,v3,29/58] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU [RFC,v3,30/58] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() [RFC,v3,31/58] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc [RFC,v3,32/58] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context [RFC,v3,33/58] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU [RFC,v3,34/58] KVM: x86/pmu: Make check_pmu_event_filter() an exported function [RFC,v3,35/58] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed [RFC,v3,36/58] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed [RFC,v3,37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary [RFC,v3,38/58] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU [RFC,v3,39/58] KVM: x86/pmu: Notify perf core at KVM context switch boundary [RFC,v3,40/58] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM [RFC,v3,41/58] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter [RFC,v3,42/58] KVM: x86/pmu: Introduce PMU operator to increment counter [RFC,v3,43/58] KVM: x86/pmu: Introduce PMU operator for setting counter overflow [RFC,v3,44/58] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU [RFC,v3,45/58] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API [RFC,v3,46/58] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU [RFC,v3,47/58] KVM: nVMX: Add nested virtualization support for passthrough PMU [RFC,v3,48/58] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU [RFC,v3,49/58] KVM: x86/pmu/svm: Set passthrough capability for vcpus [RFC,v3,50/58] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter [RFC,v3,51/58] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest [RFC,v3,52/58] KVM: x86/pmu/svm: Implement callback to disable MSR interception [RFC,v3,53/58] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event… [RFC,v3,54/58] KVM: x86/pmu/svm: Add registers to direct access list [RFC,v3,55/58] KVM: x86/pmu/svm: Implement handlers to save and restore context [RFC,v3,56/58] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU [RFC,v3,57/58] KVM: x86/pmu/svm: Implement callback to increment counters [RFC,v3,58/58] perf/x86/amd: Support PERF_PMU_CAP_PASSTHROUGH_VPMU for AMD host

Mingwei Zhang Aug. 1, 2024, 4:58 a.m. UTC

From: Kan Liang <kan.liang@linux.intel.com>

There will be a dedicated interrupt vector for guests on some platforms,
e.g., Intel. Add an interface to switch the interrupt vector while
entering/exiting a guest.

When PMI switch into a new guest vector, guest_lvtpc value need to be
reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
bit should be cleared also, then PMI can be generated continuously
for guest. So guest_lvtpc parameter is added into perf_guest_enter()
and switch_interrupt().

At switch_interrupt(), the target pmu with PASSTHROUGH cap should
be found. Since only one passthrough pmu is supported, we keep the
implementation simply by tracking the pmu as a global variable.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>

[Simplify the commit with removal of srcu lock/unlock since only one pmu is
supported.]

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 include/linux/perf_event.h |  9 +++++++--
 kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
 2 files changed, 41 insertions(+), 4 deletions(-)

Manali Shukla Sept. 19, 2024, 6:02 a.m. UTC | #1

On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> There will be a dedicated interrupt vector for guests on some platforms,
> e.g., Intel. Add an interface to switch the interrupt vector while
> entering/exiting a guest.
> 
> When PMI switch into a new guest vector, guest_lvtpc value need to be
> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> bit should be cleared also, then PMI can be generated continuously
> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> and switch_interrupt().
> 
> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
> be found. Since only one passthrough pmu is supported, we keep the
> implementation simply by tracking the pmu as a global variable.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> 
> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
> supported.]
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h |  9 +++++++--
>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 75773f9890cc..aeb08f78f539 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -541,6 +541,11 @@ struct pmu {
>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>  	 */
>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
> +
> +	/*
> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
> +	 */
> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>  };
>  
>  enum perf_addr_filter_action_t {
> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>  int perf_get_mediated_pmu(void);
>  void perf_put_mediated_pmu(void);
> -void perf_guest_enter(void);
> +void perf_guest_enter(u32 guest_lvtpc);
>  void perf_guest_exit(void);
>  #else /* !CONFIG_PERF_EVENTS: */
>  static inline void *
> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
>  }
>  
>  static inline void perf_put_mediated_pmu(void)			{ }
> -static inline void perf_guest_enter(void)			{ }
> +static inline void perf_guest_enter(u32 guest_lvtpc)		{ }
>  static inline void perf_guest_exit(void)			{ }
>  #endif
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 57ff737b922b..047ca5748ee2 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
>  
>  static LIST_HEAD(pmus);
>  static DEFINE_MUTEX(pmus_lock);
> +static struct pmu *passthru_pmu;
>  static struct srcu_struct pmus_srcu;
>  static cpumask_var_t perf_online_mask;
>  static struct kmem_cache *perf_event_cache;
> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>  
> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
> +{
> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
> +	if (!passthru_pmu)
> +		return;
> +
> +	if (passthru_pmu->switch_interrupt &&
> +	    try_module_get(passthru_pmu->module)) {
> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
> +		module_put(passthru_pmu->module);
> +	}
> +}
> +
>  /* When entering a guest, schedule out all exclude_guest events. */
> -void perf_guest_enter(void)
> +void perf_guest_enter(u32 guest_lvtpc)
>  {
>  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>  
> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>  	}
>  
> +	perf_switch_interrupt(true, guest_lvtpc);
> +
>  	__this_cpu_write(perf_in_guest, true);
>  
>  unlock:
> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>  		goto unlock;
>  
> +	perf_switch_interrupt(false, 0);
> +
>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  	if (!pmu->event_idx)
>  		pmu->event_idx = perf_event_idx_default;
>  
> -	list_add_rcu(&pmu->entry, &pmus);
> +	/*
> +	 * Initialize passthru_pmu with the core pmu that has
> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
> +	 */
> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> +		if (!passthru_pmu)
> +			passthru_pmu = pmu;
> +
> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
> +			ret = -EINVAL;
> +			goto free_dev;
> +		}
> +	}


Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However, 
if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.

> +
> +	list_add_tail_rcu(&pmu->entry, &pmus);
>  	atomic_set(&pmu->exclusive_cnt, 0);
>  	ret = 0;
>  unlock:

Liang, Kan Sept. 19, 2024, 1 p.m. UTC | #2

On 2024-09-19 2:02 a.m., Manali Shukla wrote:
> On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> There will be a dedicated interrupt vector for guests on some platforms,
>> e.g., Intel. Add an interface to switch the interrupt vector while
>> entering/exiting a guest.
>>
>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>> bit should be cleared also, then PMI can be generated continuously
>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>> and switch_interrupt().
>>
>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>> be found. Since only one passthrough pmu is supported, we keep the
>> implementation simply by tracking the pmu as a global variable.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>
>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>> supported.]
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h |  9 +++++++--
>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 75773f9890cc..aeb08f78f539 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -541,6 +541,11 @@ struct pmu {
>>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>  	 */
>>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
>> +
>> +	/*
>> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
>> +	 */
>> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>>  };
>>  
>>  enum perf_addr_filter_action_t {
>> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>>  int perf_get_mediated_pmu(void);
>>  void perf_put_mediated_pmu(void);
>> -void perf_guest_enter(void);
>> +void perf_guest_enter(u32 guest_lvtpc);
>>  void perf_guest_exit(void);
>>  #else /* !CONFIG_PERF_EVENTS: */
>>  static inline void *
>> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
>>  }
>>  
>>  static inline void perf_put_mediated_pmu(void)			{ }
>> -static inline void perf_guest_enter(void)			{ }
>> +static inline void perf_guest_enter(u32 guest_lvtpc)		{ }
>>  static inline void perf_guest_exit(void)			{ }
>>  #endif
>>  
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 57ff737b922b..047ca5748ee2 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
>>  
>>  static LIST_HEAD(pmus);
>>  static DEFINE_MUTEX(pmus_lock);
>> +static struct pmu *passthru_pmu;
>>  static struct srcu_struct pmus_srcu;
>>  static cpumask_var_t perf_online_mask;
>>  static struct kmem_cache *perf_event_cache;
>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>  
>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
>> +{
>> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
>> +	if (!passthru_pmu)
>> +		return;
>> +
>> +	if (passthru_pmu->switch_interrupt &&
>> +	    try_module_get(passthru_pmu->module)) {
>> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
>> +		module_put(passthru_pmu->module);
>> +	}
>> +}
>> +
>>  /* When entering a guest, schedule out all exclude_guest events. */
>> -void perf_guest_enter(void)
>> +void perf_guest_enter(u32 guest_lvtpc)
>>  {
>>  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>  
>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
>>  
>> +	perf_switch_interrupt(true, guest_lvtpc);
>> +
>>  	__this_cpu_write(perf_in_guest, true);
>>  
>>  unlock:
>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>  		goto unlock;
>>  
>> +	perf_switch_interrupt(false, 0);
>> +
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>>  	if (!pmu->event_idx)
>>  		pmu->event_idx = perf_event_idx_default;
>>  
>> -	list_add_rcu(&pmu->entry, &pmus);
>> +	/*
>> +	 * Initialize passthru_pmu with the core pmu that has
>> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
>> +	 */
>> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>> +		if (!passthru_pmu)
>> +			passthru_pmu = pmu;
>> +
>> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
>> +			ret = -EINVAL;
>> +			goto free_dev;
>> +		}
>> +	}
> 
> 
> Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However, 
> if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
> fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>

The original implementation doesn't limit the number of PMUs with
PERF_PMU_CAP_PASSTHROUGH_VPMU. But at that time, we could not find a
case of more than one PMU with the flag. After several debates, the
patch was simplified only to support one PMU with the flag.
It should not be hard to change it back.

Thanks,
Kan

>> +
>> +	list_add_tail_rcu(&pmu->entry, &pmus);
>>  	atomic_set(&pmu->exclusive_cnt, 0);
>>  	ret = 0;
>>  unlock:
> 
>

Manali Shukla Sept. 20, 2024, 5:09 a.m. UTC | #3

On 9/19/2024 6:30 PM, Liang, Kan wrote:
> 
> 
> On 2024-09-19 2:02 a.m., Manali Shukla wrote:
>> On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> There will be a dedicated interrupt vector for guests on some platforms,
>>> e.g., Intel. Add an interface to switch the interrupt vector while
>>> entering/exiting a guest.
>>>
>>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>>> bit should be cleared also, then PMI can be generated continuously
>>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>>> and switch_interrupt().
>>>
>>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>>> be found. Since only one passthrough pmu is supported, we keep the
>>> implementation simply by tracking the pmu as a global variable.
>>>
>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>>> supported.]
>>>
>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>> ---
>>>  include/linux/perf_event.h |  9 +++++++--
>>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>> index 75773f9890cc..aeb08f78f539 100644
>>> --- a/include/linux/perf_event.h
>>> +++ b/include/linux/perf_event.h
>>> @@ -541,6 +541,11 @@ struct pmu {
>>>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>>  	 */
>>>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
>>> +
>>> +	/*
>>> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
>>> +	 */
>>> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>>>  };
>>>  
>>>  enum perf_addr_filter_action_t {
>>> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>>>  int perf_get_mediated_pmu(void);
>>>  void perf_put_mediated_pmu(void);
>>> -void perf_guest_enter(void);
>>> +void perf_guest_enter(u32 guest_lvtpc);
>>>  void perf_guest_exit(void);
>>>  #else /* !CONFIG_PERF_EVENTS: */
>>>  static inline void *
>>> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
>>>  }
>>>  
>>>  static inline void perf_put_mediated_pmu(void)			{ }
>>> -static inline void perf_guest_enter(void)			{ }
>>> +static inline void perf_guest_enter(u32 guest_lvtpc)		{ }
>>>  static inline void perf_guest_exit(void)			{ }
>>>  #endif
>>>  
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index 57ff737b922b..047ca5748ee2 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
>>>  
>>>  static LIST_HEAD(pmus);
>>>  static DEFINE_MUTEX(pmus_lock);
>>> +static struct pmu *passthru_pmu;
>>>  static struct srcu_struct pmus_srcu;
>>>  static cpumask_var_t perf_online_mask;
>>>  static struct kmem_cache *perf_event_cache;
>>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>>>  }
>>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>>  
>>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
>>> +{
>>> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
>>> +	if (!passthru_pmu)
>>> +		return;
>>> +
>>> +	if (passthru_pmu->switch_interrupt &&
>>> +	    try_module_get(passthru_pmu->module)) {
>>> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
>>> +		module_put(passthru_pmu->module);
>>> +	}
>>> +}
>>> +
>>>  /* When entering a guest, schedule out all exclude_guest events. */
>>> -void perf_guest_enter(void)
>>> +void perf_guest_enter(u32 guest_lvtpc)
>>>  {
>>>  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>>  
>>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>>  	}
>>>  
>>> +	perf_switch_interrupt(true, guest_lvtpc);
>>> +
>>>  	__this_cpu_write(perf_in_guest, true);
>>>  
>>>  unlock:
>>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>>>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>>  		goto unlock;
>>>  
>>> +	perf_switch_interrupt(false, 0);
>>> +
>>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>>>  	if (!pmu->event_idx)
>>>  		pmu->event_idx = perf_event_idx_default;
>>>  
>>> -	list_add_rcu(&pmu->entry, &pmus);
>>> +	/*
>>> +	 * Initialize passthru_pmu with the core pmu that has
>>> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
>>> +	 */
>>> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>>> +		if (!passthru_pmu)
>>> +			passthru_pmu = pmu;
>>> +
>>> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
>>> +			ret = -EINVAL;
>>> +			goto free_dev;
>>> +		}
>>> +	}
>>
>>
>> Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However, 
>> if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
>> fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>>
> 
> The original implementation doesn't limit the number of PMUs with
> PERF_PMU_CAP_PASSTHROUGH_VPMU. But at that time, we could not find a
> case of more than one PMU with the flag. After several debates, the
> patch was simplified only to support one PMU with the flag.
> It should not be hard to change it back.
> 
> Thanks,
> Kan
> 

Yes, we have a use case to use mediated passthrough vPMU framework for IBS virtualization.
So, we will need it.

- Manali

>>> +
>>> +	list_add_tail_rcu(&pmu->entry, &pmus);
>>>  	atomic_set(&pmu->exclusive_cnt, 0);
>>>  	ret = 0;
>>>  unlock:
>>
>>

Mingwei Zhang Sept. 23, 2024, 6:49 p.m. UTC | #4

On Fri, Sep 20, 2024 at 7:09 AM Manali Shukla <manali.shukla@amd.com> wrote:
>
> On 9/19/2024 6:30 PM, Liang, Kan wrote:
> >
> >
> > On 2024-09-19 2:02 a.m., Manali Shukla wrote:
> >> On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
> >>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>
> >>> There will be a dedicated interrupt vector for guests on some platforms,
> >>> e.g., Intel. Add an interface to switch the interrupt vector while
> >>> entering/exiting a guest.
> >>>
> >>> When PMI switch into a new guest vector, guest_lvtpc value need to be
> >>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> >>> bit should be cleared also, then PMI can be generated continuously
> >>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> >>> and switch_interrupt().
> >>>
> >>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
> >>> be found. Since only one passthrough pmu is supported, we keep the
> >>> implementation simply by tracking the pmu as a global variable.
> >>>
> >>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>
> >>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
> >>> supported.]
> >>>
> >>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >>> ---
> >>>  include/linux/perf_event.h |  9 +++++++--
> >>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
> >>>  2 files changed, 41 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >>> index 75773f9890cc..aeb08f78f539 100644
> >>> --- a/include/linux/perf_event.h
> >>> +++ b/include/linux/perf_event.h
> >>> @@ -541,6 +541,11 @@ struct pmu {
> >>>      * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
> >>>      */
> >>>     int (*check_period)             (struct perf_event *event, u64 value); /* optional */
> >>> +
> >>> +   /*
> >>> +    * Switch the interrupt vectors, e.g., guest enter/exit.
> >>> +    */
> >>> +   void (*switch_interrupt)        (bool enter, u32 guest_lvtpc); /* optional */
> >>>  };
> >>>
> >>>  enum perf_addr_filter_action_t {
> >>> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
> >>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
> >>>  int perf_get_mediated_pmu(void);
> >>>  void perf_put_mediated_pmu(void);
> >>> -void perf_guest_enter(void);
> >>> +void perf_guest_enter(u32 guest_lvtpc);
> >>>  void perf_guest_exit(void);
> >>>  #else /* !CONFIG_PERF_EVENTS: */
> >>>  static inline void *
> >>> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
> >>>  }
> >>>
> >>>  static inline void perf_put_mediated_pmu(void)                     { }
> >>> -static inline void perf_guest_enter(void)                  { }
> >>> +static inline void perf_guest_enter(u32 guest_lvtpc)               { }
> >>>  static inline void perf_guest_exit(void)                   { }
> >>>  #endif
> >>>
> >>> diff --git a/kernel/events/core.c b/kernel/events/core.c
> >>> index 57ff737b922b..047ca5748ee2 100644
> >>> --- a/kernel/events/core.c
> >>> +++ b/kernel/events/core.c
> >>> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
> >>>
> >>>  static LIST_HEAD(pmus);
> >>>  static DEFINE_MUTEX(pmus_lock);
> >>> +static struct pmu *passthru_pmu;
> >>>  static struct srcu_struct pmus_srcu;
> >>>  static cpumask_var_t perf_online_mask;
> >>>  static struct kmem_cache *perf_event_cache;
> >>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
> >>>
> >>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
> >>> +{
> >>> +   /* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
> >>> +   if (!passthru_pmu)
> >>> +           return;
> >>> +
> >>> +   if (passthru_pmu->switch_interrupt &&
> >>> +       try_module_get(passthru_pmu->module)) {
> >>> +           passthru_pmu->switch_interrupt(enter, guest_lvtpc);
> >>> +           module_put(passthru_pmu->module);
> >>> +   }
> >>> +}
> >>> +
> >>>  /* When entering a guest, schedule out all exclude_guest events. */
> >>> -void perf_guest_enter(void)
> >>> +void perf_guest_enter(u32 guest_lvtpc)
> >>>  {
> >>>     struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> >>>
> >>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
> >>>             perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
> >>>     }
> >>>
> >>> +   perf_switch_interrupt(true, guest_lvtpc);
> >>> +
> >>>     __this_cpu_write(perf_in_guest, true);
> >>>
> >>>  unlock:
> >>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
> >>>     if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
> >>>             goto unlock;
> >>>
> >>> +   perf_switch_interrupt(false, 0);
> >>> +
> >>>     perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> >>>     ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
> >>>     perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> >>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> >>>     if (!pmu->event_idx)
> >>>             pmu->event_idx = perf_event_idx_default;
> >>>
> >>> -   list_add_rcu(&pmu->entry, &pmus);
> >>> +   /*
> >>> +    * Initialize passthru_pmu with the core pmu that has
> >>> +    * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
> >>> +    */
> >>> +   if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> >>> +           if (!passthru_pmu)
> >>> +                   passthru_pmu = pmu;
> >>> +
> >>> +           if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
> >>> +                   ret = -EINVAL;
> >>> +                   goto free_dev;
> >>> +           }
> >>> +   }
> >>
> >>
> >> Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However,
> >> if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
> >> fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> >>
> >
> > The original implementation doesn't limit the number of PMUs with
> > PERF_PMU_CAP_PASSTHROUGH_VPMU. But at that time, we could not find a
> > case of more than one PMU with the flag. After several debates, the
> > patch was simplified only to support one PMU with the flag.
> > It should not be hard to change it back.

The original implementation is by design having a terrible performance
overhead, ie., every PMU context switch at runtime requires a SRCU
lock pair and pmu list traversal. To reduce the overhead, we put
"passthrough" pmus in the front of the list and quickly exit the pmu
traversal when we just pass the last "passthrough" pmu.

I honestly do not like the idea because it is simply a hacky solution
with high maintenance cost, but I am not the right person to make the
call. So, if perf (kernel) subsystem maintainers are happy, I can
withdraw from this one.

My second point is: if you look at the details, the only reason why we
traverse the pmu list is to check if a pmu has implemented the
"switch_interrupt()" API. If so, invoke it, which will change the PMI
vector from NMI to a maskable interrupt for KVM. In AMD case, I bet
even if there are multiple passthrough pmus, only one requires
switching the interrupt vector. Let me know if this is wrong.

Thanks.
-Mingwei

> >
> > Thanks,
> > Kan
> >
>
> Yes, we have a use case to use mediated passthrough vPMU framework for IBS virtualization.
> So, we will need it.
>
> - Manali
>
> >>> +
> >>> +   list_add_tail_rcu(&pmu->entry, &pmus);
> >>>     atomic_set(&pmu->exclusive_cnt, 0);
> >>>     ret = 0;
> >>>  unlock:
> >>
> >>
>

Manali Shukla Sept. 24, 2024, 4:55 p.m. UTC | #5

On 9/24/2024 12:19 AM, Mingwei Zhang wrote:
> On Fri, Sep 20, 2024 at 7:09 AM Manali Shukla <manali.shukla@amd.com> wrote:
>>
>> On 9/19/2024 6:30 PM, Liang, Kan wrote:
>>>
>>>
>>> On 2024-09-19 2:02 a.m., Manali Shukla wrote:
>>>> On 8/1/2024 10:28 AM, Mingwei Zhang wrote:
>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>
>>>>> There will be a dedicated interrupt vector for guests on some platforms,
>>>>> e.g., Intel. Add an interface to switch the interrupt vector while
>>>>> entering/exiting a guest.
>>>>>
>>>>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>>>>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>>>>> bit should be cleared also, then PMI can be generated continuously
>>>>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>>>>> and switch_interrupt().
>>>>>
>>>>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>>>>> be found. Since only one passthrough pmu is supported, we keep the
>>>>> implementation simply by tracking the pmu as a global variable.
>>>>>
>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>
>>>>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>>>>> supported.]
>>>>>
>>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>>>> ---
>>>>>  include/linux/perf_event.h |  9 +++++++--
>>>>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>>>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>>> index 75773f9890cc..aeb08f78f539 100644
>>>>> --- a/include/linux/perf_event.h
>>>>> +++ b/include/linux/perf_event.h
>>>>> @@ -541,6 +541,11 @@ struct pmu {
>>>>>      * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>>>>      */
>>>>>     int (*check_period)             (struct perf_event *event, u64 value); /* optional */
>>>>> +
>>>>> +   /*
>>>>> +    * Switch the interrupt vectors, e.g., guest enter/exit.
>>>>> +    */
>>>>> +   void (*switch_interrupt)        (bool enter, u32 guest_lvtpc); /* optional */
>>>>>  };
>>>>>
>>>>>  enum perf_addr_filter_action_t {
>>>>> @@ -1738,7 +1743,7 @@ extern int perf_event_period(struct perf_event *event, u64 value);
>>>>>  extern u64 perf_event_pause(struct perf_event *event, bool reset);
>>>>>  int perf_get_mediated_pmu(void);
>>>>>  void perf_put_mediated_pmu(void);
>>>>> -void perf_guest_enter(void);
>>>>> +void perf_guest_enter(u32 guest_lvtpc);
>>>>>  void perf_guest_exit(void);
>>>>>  #else /* !CONFIG_PERF_EVENTS: */
>>>>>  static inline void *
>>>>> @@ -1833,7 +1838,7 @@ static inline int perf_get_mediated_pmu(void)
>>>>>  }
>>>>>
>>>>>  static inline void perf_put_mediated_pmu(void)                     { }
>>>>> -static inline void perf_guest_enter(void)                  { }
>>>>> +static inline void perf_guest_enter(u32 guest_lvtpc)               { }
>>>>>  static inline void perf_guest_exit(void)                   { }
>>>>>  #endif
>>>>>
>>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>>> index 57ff737b922b..047ca5748ee2 100644
>>>>> --- a/kernel/events/core.c
>>>>> +++ b/kernel/events/core.c
>>>>> @@ -422,6 +422,7 @@ static inline bool is_include_guest_event(struct perf_event *event)
>>>>>
>>>>>  static LIST_HEAD(pmus);
>>>>>  static DEFINE_MUTEX(pmus_lock);
>>>>> +static struct pmu *passthru_pmu;
>>>>>  static struct srcu_struct pmus_srcu;
>>>>>  static cpumask_var_t perf_online_mask;
>>>>>  static struct kmem_cache *perf_event_cache;
>>>>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>>>>>  }
>>>>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>>>>
>>>>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
>>>>> +{
>>>>> +   /* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
>>>>> +   if (!passthru_pmu)
>>>>> +           return;
>>>>> +
>>>>> +   if (passthru_pmu->switch_interrupt &&
>>>>> +       try_module_get(passthru_pmu->module)) {
>>>>> +           passthru_pmu->switch_interrupt(enter, guest_lvtpc);
>>>>> +           module_put(passthru_pmu->module);
>>>>> +   }
>>>>> +}
>>>>> +
>>>>>  /* When entering a guest, schedule out all exclude_guest events. */
>>>>> -void perf_guest_enter(void)
>>>>> +void perf_guest_enter(u32 guest_lvtpc)
>>>>>  {
>>>>>     struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>>>>>
>>>>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>>>>>             perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>>>>     }
>>>>>
>>>>> +   perf_switch_interrupt(true, guest_lvtpc);
>>>>> +
>>>>>     __this_cpu_write(perf_in_guest, true);
>>>>>
>>>>>  unlock:
>>>>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>>>>>     if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>>>>             goto unlock;
>>>>>
>>>>> +   perf_switch_interrupt(false, 0);
>>>>> +
>>>>>     perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>>>>     ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>>>>>     perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>>>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>>>>>     if (!pmu->event_idx)
>>>>>             pmu->event_idx = perf_event_idx_default;
>>>>>
>>>>> -   list_add_rcu(&pmu->entry, &pmus);
>>>>> +   /*
>>>>> +    * Initialize passthru_pmu with the core pmu that has
>>>>> +    * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
>>>>> +    */
>>>>> +   if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>>>>> +           if (!passthru_pmu)
>>>>> +                   passthru_pmu = pmu;
>>>>> +
>>>>> +           if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
>>>>> +                   ret = -EINVAL;
>>>>> +                   goto free_dev;
>>>>> +           }
>>>>> +   }
>>>>
>>>>
>>>> Our intention is to virtualize IBS PMUs (Op and Fetch) using the same framework. However,
>>>> if IBS PMUs are also using the PERF_PMU_CAP_PASSTHROUGH_VPMU capability, IBS PMU registration
>>>> fails at this point because the Core PMU is already registered with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>>>>
>>>
>>> The original implementation doesn't limit the number of PMUs with
>>> PERF_PMU_CAP_PASSTHROUGH_VPMU. But at that time, we could not find a
>>> case of more than one PMU with the flag. After several debates, the
>>> patch was simplified only to support one PMU with the flag.
>>> It should not be hard to change it back.
> 
> The original implementation is by design having a terrible performance
> overhead, ie., every PMU context switch at runtime requires a SRCU
> lock pair and pmu list traversal. To reduce the overhead, we put
> "passthrough" pmus in the front of the list and quickly exit the pmu
> traversal when we just pass the last "passthrough" pmu.
> 
> I honestly do not like the idea because it is simply a hacky solution
> with high maintenance cost, but I am not the right person to make the
> call. So, if perf (kernel) subsystem maintainers are happy, I can
> withdraw from this one.
> 
> My second point is: if you look at the details, the only reason why we
> traverse the pmu list is to check if a pmu has implemented the
> "switch_interrupt()" API. If so, invoke it, which will change the PMI
> vector from NMI to a maskable interrupt for KVM. In AMD case, I bet
> even if there are multiple passthrough pmus, only one requires
> switching the interrupt vector. Let me know if this is wrong.
> 
> Thanks.
> -Mingwei
> 

Yeah, That is correct. Currently, switching of the interrupt vector is
needed only by one passthrough PMU for AMD. It is not required for IBS
virtualization and PMC virtualization because VNMI is used to deliver
interrupt in both cases.

With the mediated passthrough VPMU enabled, all exclude_guest events
for PMUs with PERF_PMU_CAP_PASSTHROUGH_VPMU capability are scheduled out 
when entering a guest and scheduled back in upon exit. We need this
functionality for IBS virtualization and PMC virtualization.

However, the current design allows only one passthrough PMU, which
prevents the IBS driver from loading with the
PERF_PMU_CAP_PASSTHROUGH_VPMU capability.

As a result, we are unable to utilize functionality of scheduling out
and scheduling in exclude_guest events within  the mediated
passthrough VPMU framework for IBS virtualization.

- Manali


>>>
>>> Thanks,
>>> Kan
>>>
>>
>> Yes, we have a use case to use mediated passthrough vPMU framework for IBS virtualization.
>> So, we will need it.
>>
>> - Manali
>>
>>>>> +
>>>>> +   list_add_tail_rcu(&pmu->entry, &pmus);
>>>>>     atomic_set(&pmu->exclusive_cnt, 0);
>>>>>     ret = 0;
>>>>>  unlock:
>>>>
>>>>
>>

Peter Zijlstra Oct. 14, 2024, 11:56 a.m. UTC | #6

On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:

> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>  
> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
> +{
> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
> +	if (!passthru_pmu)
> +		return;
> +
> +	if (passthru_pmu->switch_interrupt &&
> +	    try_module_get(passthru_pmu->module)) {
> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
> +		module_put(passthru_pmu->module);
> +	}
> +}

Should we move the whole module reference to perf_pmu_(,un}register() ?

> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  	if (!pmu->event_idx)
>  		pmu->event_idx = perf_event_idx_default;
>  
> -	list_add_rcu(&pmu->entry, &pmus);
> +	/*
> +	 * Initialize passthru_pmu with the core pmu that has
> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
> +	 */
> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> +		if (!passthru_pmu)
> +			passthru_pmu = pmu;
> +
> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
> +			ret = -EINVAL;
> +			goto free_dev;

Why impose this limit? Changelog also fails to explain this.

> +		}
> +	}
> +
> +	list_add_tail_rcu(&pmu->entry, &pmus);
>  	atomic_set(&pmu->exclusive_cnt, 0);
>  	ret = 0;
>  unlock:
> -- 
> 2.46.0.rc1.232.g9752f9e123-goog
>

Peter Zijlstra Oct. 14, 2024, 11:59 a.m. UTC | #7

On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote:

> The original implementation is by design having a terrible performance
> overhead, ie., every PMU context switch at runtime requires a SRCU
> lock pair and pmu list traversal. To reduce the overhead, we put
> "passthrough" pmus in the front of the list and quickly exit the pmu
> traversal when we just pass the last "passthrough" pmu.

What was the expensive bit? The SRCU memory barrier or the list
iteration? How long is that list really?

Peter Zijlstra Oct. 14, 2024, 12:03 p.m. UTC | #8

On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> There will be a dedicated interrupt vector for guests on some platforms,
> e.g., Intel. Add an interface to switch the interrupt vector while
> entering/exiting a guest.
> 
> When PMI switch into a new guest vector, guest_lvtpc value need to be
> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> bit should be cleared also, then PMI can be generated continuously
> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> and switch_interrupt().
> 
> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
> be found. Since only one passthrough pmu is supported, we keep the
> implementation simply by tracking the pmu as a global variable.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> 
> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
> supported.]
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  include/linux/perf_event.h |  9 +++++++--
>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 75773f9890cc..aeb08f78f539 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -541,6 +541,11 @@ struct pmu {
>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>  	 */
>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
> +
> +	/*
> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
> +	 */
> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>  };

I'm thinking the guets_lvtpc argument shouldn't be part of the
interface. That should be PMU implementation data and accessed by the
method implementation.

Peter Zijlstra Oct. 14, 2024, 1:52 p.m. UTC | #9

On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>  	}
>  
> +	perf_switch_interrupt(true, guest_lvtpc);
> +
>  	__this_cpu_write(perf_in_guest, true);
>  
>  unlock:
> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>  		goto unlock;
>  
> +	perf_switch_interrupt(false, 0);
> +
>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);

This seems to suggest the method is named wrong, it should probably be
guest_enter() or somsuch.

Liang, Kan Oct. 14, 2024, 3:40 p.m. UTC | #10

On 2024-10-14 7:56 a.m., Peter Zijlstra wrote:
> On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> 
>> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>  
>> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
>> +{
>> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
>> +	if (!passthru_pmu)
>> +		return;
>> +
>> +	if (passthru_pmu->switch_interrupt &&
>> +	    try_module_get(passthru_pmu->module)) {
>> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
>> +		module_put(passthru_pmu->module);
>> +	}
>> +}
> 
> Should we move the whole module reference to perf_pmu_(,un}register() ?

A PMU module can be load/unload anytime. How should we know if the PMU
module is available when the reference check is moved to
perf_pmu_(,un}register()?

> 
>> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>>  	if (!pmu->event_idx)
>>  		pmu->event_idx = perf_event_idx_default;
>>  
>> -	list_add_rcu(&pmu->entry, &pmus);
>> +	/*
>> +	 * Initialize passthru_pmu with the core pmu that has
>> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
>> +	 */
>> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>> +		if (!passthru_pmu)
>> +			passthru_pmu = pmu;
>> +
>> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
>> +			ret = -EINVAL;
>> +			goto free_dev;
> 
> Why impose this limit? Changelog also fails to explain this.

Because the passthru_pmu is global variable. If there are two or more
PMUs with the PERF_PMU_CAP_PASSTHROUGH_VPMU, the former one will be
implicitly overwritten if without the check.

Thanks
Kan
> 
>> +		}
>> +	}
>> +
>> +	list_add_tail_rcu(&pmu->entry, &pmus);
>>  	atomic_set(&pmu->exclusive_cnt, 0);
>>  	ret = 0;
>>  unlock:
>> -- 
>> 2.46.0.rc1.232.g9752f9e123-goog
>>
>

Liang, Kan Oct. 14, 2024, 3:51 p.m. UTC | #11

On 2024-10-14 8:03 a.m., Peter Zijlstra wrote:
> On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> There will be a dedicated interrupt vector for guests on some platforms,
>> e.g., Intel. Add an interface to switch the interrupt vector while
>> entering/exiting a guest.
>>
>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>> bit should be cleared also, then PMI can be generated continuously
>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>> and switch_interrupt().
>>
>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>> be found. Since only one passthrough pmu is supported, we keep the
>> implementation simply by tracking the pmu as a global variable.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>
>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>> supported.]
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  include/linux/perf_event.h |  9 +++++++--
>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 75773f9890cc..aeb08f78f539 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -541,6 +541,11 @@ struct pmu {
>>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>  	 */
>>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
>> +
>> +	/*
>> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
>> +	 */
>> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>>  };
> 
> I'm thinking the guets_lvtpc argument shouldn't be part of the
> interface. That should be PMU implementation data and accessed by the
> method implementation.

I think the name of the perf_switch_interrupt() is too specific.
Here should be to switch the guest context. The interrupt should be just
part of the context. Maybe a interface as below

void (*switch_guest_ctx)	(bool enter, void *data); /* optional */

Thanks,
Kan

Liang, Kan Oct. 14, 2024, 3:57 p.m. UTC | #12

On 2024-10-14 9:52 a.m., Peter Zijlstra wrote:
> On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
>> @@ -5962,6 +5976,8 @@ void perf_guest_enter(void)
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
>>  
>> +	perf_switch_interrupt(true, guest_lvtpc);
>> +
>>  	__this_cpu_write(perf_in_guest, true);
>>  
>>  unlock:
>> @@ -5980,6 +5996,8 @@ void perf_guest_exit(void)
>>  	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>>  		goto unlock;
>>  
>> +	perf_switch_interrupt(false, 0);
>> +
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
> 
> This seems to suggest the method is named wrong, it should probably be
> guest_enter() or somsuch.
>

The ctx_sched_in() is to schedule in the host context after the
guest_exit(). The EVENT_GUEST is to indicate the guest ctx switch.

The name may brings some confusion. Maybe I can add a wrap function
perf_host_enter() to include the above codes.

Thanks,
Kan

Liang, Kan Oct. 14, 2024, 4:15 p.m. UTC | #13

On 2024-10-14 7:59 a.m., Peter Zijlstra wrote:
> On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote:
> 
>> The original implementation is by design having a terrible performance
>> overhead, ie., every PMU context switch at runtime requires a SRCU
>> lock pair and pmu list traversal. To reduce the overhead, we put
>> "passthrough" pmus in the front of the list and quickly exit the pmu
>> traversal when we just pass the last "passthrough" pmu.
> 
> What was the expensive bit? The SRCU memory barrier or the list
> iteration? How long is that list really?

Both. But I don't think there is any performance data.

The length of the list could vary on different platforms. For a modern
server, there could be hundreds of PMUs from uncore PMUs, CXL PMUs,
IOMMU PMUs, PMUs of accelerator devices and PMUs from all kinds of
devices. The number could keep increasing with more and more devices
supporting the PMU capability.

Two methods were considered.
- One is to add a global variable to track the "passthrough" pmu. The
idea assumes that there is only one "passthrough" pmu that requires the
switch, and the situation will not be changed in the near feature.
So the SRCU memory barrier and the list iteration can be avoided.
It's implemented in the patch

- The other one is always put the "passthrough" pmus in the front of the
list. So the unnecessary list iteration can be avoided. It does nothing
for the SRCU lock pair.

Thanks,
Kan

Peter Zijlstra Oct. 14, 2024, 5:45 p.m. UTC | #14

On Mon, Oct 14, 2024 at 12:15:11PM -0400, Liang, Kan wrote:
> 
> 
> On 2024-10-14 7:59 a.m., Peter Zijlstra wrote:
> > On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote:
> > 
> >> The original implementation is by design having a terrible performance
> >> overhead, ie., every PMU context switch at runtime requires a SRCU
> >> lock pair and pmu list traversal. To reduce the overhead, we put
> >> "passthrough" pmus in the front of the list and quickly exit the pmu
> >> traversal when we just pass the last "passthrough" pmu.
> > 
> > What was the expensive bit? The SRCU memory barrier or the list
> > iteration? How long is that list really?
> 
> Both. But I don't think there is any performance data.
> 
> The length of the list could vary on different platforms. For a modern
> server, there could be hundreds of PMUs from uncore PMUs, CXL PMUs,
> IOMMU PMUs, PMUs of accelerator devices and PMUs from all kinds of
> devices. The number could keep increasing with more and more devices
> supporting the PMU capability.
> 
> Two methods were considered.
> - One is to add a global variable to track the "passthrough" pmu. The
> idea assumes that there is only one "passthrough" pmu that requires the
> switch, and the situation will not be changed in the near feature.
> So the SRCU memory barrier and the list iteration can be avoided.
> It's implemented in the patch
> 
> - The other one is always put the "passthrough" pmus in the front of the
> list. So the unnecessary list iteration can be avoided. It does nothing
> for the SRCU lock pair.

PaulMck has patches that introduce srcu_read_lock_lite(), which would
avoid the smp_mb() in most cases.

  https://lkml.kernel.org/r/20241011173931.2050422-6-paulmck@kernel.org

We can also keep a second list, just for the passthrough pmus. A bit
like sched_cb_list.

Peter Zijlstra Oct. 14, 2024, 5:47 p.m. UTC | #15

On Mon, Oct 14, 2024 at 11:40:21AM -0400, Liang, Kan wrote:

> >> @@ -11842,7 +11860,21 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> >>  	if (!pmu->event_idx)
> >>  		pmu->event_idx = perf_event_idx_default;
> >>  
> >> -	list_add_rcu(&pmu->entry, &pmus);
> >> +	/*
> >> +	 * Initialize passthru_pmu with the core pmu that has
> >> +	 * PERF_PMU_CAP_PASSTHROUGH_VPMU capability.
> >> +	 */
> >> +	if (pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> >> +		if (!passthru_pmu)
> >> +			passthru_pmu = pmu;
> >> +
> >> +		if (WARN_ONCE(passthru_pmu != pmu, "Only one passthrough PMU is supported\n")) {
> >> +			ret = -EINVAL;
> >> +			goto free_dev;
> > 
> > Why impose this limit? Changelog also fails to explain this.
> 
> Because the passthru_pmu is global variable. If there are two or more
> PMUs with the PERF_PMU_CAP_PASSTHROUGH_VPMU, the former one will be
> implicitly overwritten if without the check.

That is not the question; but is has been answered elsewhere. For some
reason you thought the srcu+list thing was expensive.

Peter Zijlstra Oct. 14, 2024, 5:49 p.m. UTC | #16

On Mon, Oct 14, 2024 at 11:51:06AM -0400, Liang, Kan wrote:
> On 2024-10-14 8:03 a.m., Peter Zijlstra wrote:
> > On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> There will be a dedicated interrupt vector for guests on some platforms,
> >> e.g., Intel. Add an interface to switch the interrupt vector while
> >> entering/exiting a guest.
> >>
> >> When PMI switch into a new guest vector, guest_lvtpc value need to be
> >> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
> >> bit should be cleared also, then PMI can be generated continuously
> >> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
> >> and switch_interrupt().
> >>
> >> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
> >> be found. Since only one passthrough pmu is supported, we keep the
> >> implementation simply by tracking the pmu as a global variable.
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
> >> supported.]
> >>
> >> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >> ---
> >>  include/linux/perf_event.h |  9 +++++++--
> >>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
> >>  2 files changed, 41 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index 75773f9890cc..aeb08f78f539 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -541,6 +541,11 @@ struct pmu {
> >>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
> >>  	 */
> >>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
> >> +
> >> +	/*
> >> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
> >> +	 */
> >> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
> >>  };
> > 
> > I'm thinking the guets_lvtpc argument shouldn't be part of the
> > interface. That should be PMU implementation data and accessed by the
> > method implementation.
> 
> I think the name of the perf_switch_interrupt() is too specific.
> Here should be to switch the guest context. The interrupt should be just
> part of the context. Maybe a interface as below
> 
> void (*switch_guest_ctx)	(bool enter, void *data); /* optional */

I don't think you even need the data thing. For example, the x86/intel
implementation can just look at a x86_pmu data field to find the magic
value.

Peter Zijlstra Oct. 14, 2024, 5:51 p.m. UTC | #17

On Mon, Oct 14, 2024 at 11:40:21AM -0400, Liang, Kan wrote:
> 
> 
> On 2024-10-14 7:56 a.m., Peter Zijlstra wrote:
> > On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
> > 
> >> @@ -5941,8 +5942,21 @@ void perf_put_mediated_pmu(void)
> >>  }
> >>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
> >>  
> >> +static void perf_switch_interrupt(bool enter, u32 guest_lvtpc)
> >> +{
> >> +	/* Mediated passthrough PMU should have PASSTHROUGH_VPMU cap. */
> >> +	if (!passthru_pmu)
> >> +		return;
> >> +
> >> +	if (passthru_pmu->switch_interrupt &&
> >> +	    try_module_get(passthru_pmu->module)) {
> >> +		passthru_pmu->switch_interrupt(enter, guest_lvtpc);
> >> +		module_put(passthru_pmu->module);
> >> +	}
> >> +}
> > 
> > Should we move the whole module reference to perf_pmu_(,un}register() ?
> 
> A PMU module can be load/unload anytime. How should we know if the PMU
> module is available when the reference check is moved to
> perf_pmu_(,un}register()?

Feh, dunno. I never really use modules. I just think the above is naf --
doubly so because you're saying the SRCU smp_mb() are expensive; but
this module reference crap is more expensive than that.

(IOW, the SRCU smp_mb() cannot have been the problem)

Liang, Kan Oct. 15, 2024, 1:23 p.m. UTC | #18

On 2024-10-14 1:49 p.m., Peter Zijlstra wrote:
> On Mon, Oct 14, 2024 at 11:51:06AM -0400, Liang, Kan wrote:
>> On 2024-10-14 8:03 a.m., Peter Zijlstra wrote:
>>> On Thu, Aug 01, 2024 at 04:58:23AM +0000, Mingwei Zhang wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> There will be a dedicated interrupt vector for guests on some platforms,
>>>> e.g., Intel. Add an interface to switch the interrupt vector while
>>>> entering/exiting a guest.
>>>>
>>>> When PMI switch into a new guest vector, guest_lvtpc value need to be
>>>> reflected onto HW, e,g., guest clear PMI mask bit, the HW PMI mask
>>>> bit should be cleared also, then PMI can be generated continuously
>>>> for guest. So guest_lvtpc parameter is added into perf_guest_enter()
>>>> and switch_interrupt().
>>>>
>>>> At switch_interrupt(), the target pmu with PASSTHROUGH cap should
>>>> be found. Since only one passthrough pmu is supported, we keep the
>>>> implementation simply by tracking the pmu as a global variable.
>>>>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> [Simplify the commit with removal of srcu lock/unlock since only one pmu is
>>>> supported.]
>>>>
>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>>> ---
>>>>  include/linux/perf_event.h |  9 +++++++--
>>>>  kernel/events/core.c       | 36 ++++++++++++++++++++++++++++++++++--
>>>>  2 files changed, 41 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>> index 75773f9890cc..aeb08f78f539 100644
>>>> --- a/include/linux/perf_event.h
>>>> +++ b/include/linux/perf_event.h
>>>> @@ -541,6 +541,11 @@ struct pmu {
>>>>  	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>>>>  	 */
>>>>  	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
>>>> +
>>>> +	/*
>>>> +	 * Switch the interrupt vectors, e.g., guest enter/exit.
>>>> +	 */
>>>> +	void (*switch_interrupt)	(bool enter, u32 guest_lvtpc); /* optional */
>>>>  };
>>>
>>> I'm thinking the guets_lvtpc argument shouldn't be part of the
>>> interface. That should be PMU implementation data and accessed by the
>>> method implementation.
>>
>> I think the name of the perf_switch_interrupt() is too specific.
>> Here should be to switch the guest context. The interrupt should be just
>> part of the context. Maybe a interface as below
>>
>> void (*switch_guest_ctx)	(bool enter, void *data); /* optional */
> 
> I don't think you even need the data thing. For example, the x86/intel
> implementation can just look at a x86_pmu data field to find the magic
> value.

The new vector is created by KVM, not perf. So it cannot be found in the
x86_pmu data field. Perf needs it to update the interrupt vector so the
guest PMI can be handled by KVM directly.

Thanks,
Kan

Liang, Kan Oct. 15, 2024, 3:59 p.m. UTC | #19

On 2024-10-14 1:45 p.m., Peter Zijlstra wrote:
> On Mon, Oct 14, 2024 at 12:15:11PM -0400, Liang, Kan wrote:
>>
>>
>> On 2024-10-14 7:59 a.m., Peter Zijlstra wrote:
>>> On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote:
>>>
>>>> The original implementation is by design having a terrible performance
>>>> overhead, ie., every PMU context switch at runtime requires a SRCU
>>>> lock pair and pmu list traversal. To reduce the overhead, we put
>>>> "passthrough" pmus in the front of the list and quickly exit the pmu
>>>> traversal when we just pass the last "passthrough" pmu.
>>>
>>> What was the expensive bit? The SRCU memory barrier or the list
>>> iteration? How long is that list really?
>>
>> Both. But I don't think there is any performance data.
>>
>> The length of the list could vary on different platforms. For a modern
>> server, there could be hundreds of PMUs from uncore PMUs, CXL PMUs,
>> IOMMU PMUs, PMUs of accelerator devices and PMUs from all kinds of
>> devices. The number could keep increasing with more and more devices
>> supporting the PMU capability.
>>
>> Two methods were considered.
>> - One is to add a global variable to track the "passthrough" pmu. The
>> idea assumes that there is only one "passthrough" pmu that requires the
>> switch, and the situation will not be changed in the near feature.
>> So the SRCU memory barrier and the list iteration can be avoided.
>> It's implemented in the patch
>>
>> - The other one is always put the "passthrough" pmus in the front of the
>> list. So the unnecessary list iteration can be avoided. It does nothing
>> for the SRCU lock pair.
> 
> PaulMck has patches that introduce srcu_read_lock_lite(), which would
> avoid the smp_mb() in most cases.
> 
>   https://lkml.kernel.org/r/20241011173931.2050422-6-paulmck@kernel.org
> 
> We can also keep a second list, just for the passthrough pmus. A bit
> like sched_cb_list.

Maybe we can do something like pmu_event_list, which has its own lock.
So we should not need the srcu and the module reference.

Thanks,
Kan

[RFC,v3,14/58] perf: Add switch_interrupt() interface

Commit Message

Comments

Patch