perf/x86/rapl: Fix PP1 event for Intel Meteor/Lunar Lake

Message ID	20250220153857.2593704-6-lucas.demarchi@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Lucas De Marchi <lucas.demarchi@intel.com> To: linux-perf-users@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org Cc: dave.hansen@linux.intel.com, Zhang Rui <rui.zhang@intel.com>, Kan Liang <kan.liang@linux.intel.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Lucas De Marchi <lucas.demarchi@intel.com>, Ulisses Furquim <ulisses.furquim@intel.com>, intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org Subject: [PATCH] perf/x86/rapl: Fix PP1 event for Intel Meteor/Lunar Lake Date: Thu, 20 Feb 2025 07:36:30 -0800 Message-ID: <20250220153857.2593704-6-lucas.demarchi@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	perf/x86/rapl: Fix PP1 event for Intel Meteor/Lunar Lake \| expand perf/x86/rapl: Fix PP1 event for Intel Meteor/Lunar Lake

Lucas De Marchi Feb. 20, 2025, 3:36 p.m. UTC

On some boots the read of MSR_PP1_ENERGY_STATUS msr returns 0, causing
perf_msr_probe() to make the power/events/energy-gpu event non-visible.
When that happens, the msr always read 0 until the graphics module (i915
for Meteor Lake, xe for Lunar Lake) is loaded. Then it starts returning
something different and re-loading the rapl module "fixes" it.

This is tested on the following platforms with the fail rates before
this patch:

	Alder Lake S	0/20
	Arrow Lake H	0/20
	Lunar Lake M	8/20
	Meteor Lake U	6/20
	Raptor Lake P	4/20
	Raptor Lake S	0/20

For those platforms failing, use a separate msr list with .no_check
set so it doesn't check the runtime value to create the event - it will
just return 0 until the i915/xe module initializes the GPU.

The issue https://github.com/ulissesf/qmassa/issues/4 is workarounded by
reading the MSR directly since it works after xe is loaded, but the
issue with not having the perf event is still there.

Closes: https://github.com/ulissesf/qmassa/issues/4
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4241
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com
---

Maybe a clearer alternative is to just move all the platforms after
RAPTORLAKE with a gpu to use the new msr list.

 arch/x86/events/rapl.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

Dave Hansen Feb. 20, 2025, 4:28 p.m. UTC | #1

On 2/20/25 07:36, Lucas De Marchi wrote:
> On some boots the read of MSR_PP1_ENERGY_STATUS msr returns 0, causing
> perf_msr_probe() to make the power/events/energy-gpu event non-visible.
> When that happens, the msr always read 0 until the graphics module (i915
> for Meteor Lake, xe for Lunar Lake) is loaded. Then it starts returning
> something different and re-loading the rapl module "fixes" it.

What's the root cause here? Did the kernel do something funky? Or is
this a hardware bug?

Liang, Kan Feb. 20, 2025, 4:47 p.m. UTC | #2

On 2025-02-20 10:36 a.m., Lucas De Marchi wrote:
> On some boots the read of MSR_PP1_ENERGY_STATUS msr returns 0, causing
> perf_msr_probe() to make the power/events/energy-gpu event non-visible.
> When that happens, the msr always read 0 until the graphics module (i915
> for Meteor Lake, xe for Lunar Lake) is loaded. Then it starts returning
> something different and re-loading the rapl module "fixes" it.
> 
> This is tested on the following platforms with the fail rates before
> this patch:
> 
> 	Alder Lake S	0/20
> 	Arrow Lake H	0/20
> 	Lunar Lake M	8/20
> 	Meteor Lake U	6/20
> 	Raptor Lake P	4/20
> 	Raptor Lake S	0/20
> 
> For those platforms failing, use a separate msr list with .no_check
> set so it doesn't check the runtime value to create the event - it will
> just return 0 until the i915/xe module initializes the GPU.
> 
> The issue https://github.com/ulissesf/qmassa/issues/4 is workarounded by
> reading the MSR directly since it works after xe is loaded, but the
> issue with not having the perf event is still there.
> 
> Closes: https://github.com/ulissesf/qmassa/issues/4
> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4241
> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com
> ---
> 
> Maybe a clearer alternative is to just move all the platforms after
> RAPTORLAKE with a gpu to use the new msr list.
> 
>  arch/x86/events/rapl.c | 26 ++++++++++++++++++++++----
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
> index 4952faf03e82d..18e324b8fa82c 100644
> --- a/arch/x86/events/rapl.c
> +++ b/arch/x86/events/rapl.c
> @@ -588,6 +588,14 @@ static struct perf_msr intel_rapl_spr_msrs[] = {
>  	[PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, &rapl_events_psys_group,  test_msr, true, RAPL_MSR_MASK },
>  };
>  
> +static struct perf_msr intel_rapl_mtl_msrs[] = {
> +	[PERF_RAPL_PP0]  = { MSR_PP0_ENERGY_STATUS,      &rapl_events_cores_group, test_msr, false, RAPL_MSR_MASK },
> +	[PERF_RAPL_PKG]  = { MSR_PKG_ENERGY_STATUS,      &rapl_events_pkg_group,   test_msr, false, RAPL_MSR_MASK },
> +	[PERF_RAPL_RAM]  = { MSR_DRAM_ENERGY_STATUS,     &rapl_events_ram_group,   test_msr, false, RAPL_MSR_MASK },
> +	[PERF_RAPL_PP1]  = { MSR_PP1_ENERGY_STATUS,      &rapl_events_gpu_group,   test_msr, true,  RAPL_MSR_MASK },
> +	[PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, &rapl_events_psys_group,  test_msr, false, RAPL_MSR_MASK },
> +};
> +
>  /*
>   * Force to PERF_RAPL_PKG_EVENTS_MAX size due to:
>   * - perf_msr_probe(PERF_RAPL_PKG_EVENTS_MAX)
> @@ -826,6 +834,16 @@ static struct rapl_model model_spr = {
>  	.rapl_pkg_msrs	= intel_rapl_spr_msrs,
>  };
>  
> +static struct rapl_model model_rpl = {
> +	.pkg_events	= BIT(PERF_RAPL_PP0) |
> +			  BIT(PERF_RAPL_PKG) |
> +			  BIT(PERF_RAPL_RAM) |
> +			  BIT(PERF_RAPL_PP1) |
> +			  BIT(PERF_RAPL_PSYS),
> +	.msr_power_unit = MSR_RAPL_POWER_UNIT,
> +	.rapl_pkg_msrs  = intel_rapl_mtl_msrs,

It's better to make the name consistent, e.g., intel_rapl_rpl_msrs.

Thanks,
Kan
> +};
> +
>  static struct rapl_model model_amd_hygon = {
>  	.pkg_events	= BIT(PERF_RAPL_PKG),
>  	.core_events	= BIT(PERF_RAPL_CORE),
> @@ -873,13 +891,13 @@ static const struct x86_cpu_id rapl_model_match[] __initconst = {
>  	X86_MATCH_VFM(INTEL_SAPPHIRERAPIDS_X,	&model_spr),
>  	X86_MATCH_VFM(INTEL_EMERALDRAPIDS_X,	&model_spr),
>  	X86_MATCH_VFM(INTEL_RAPTORLAKE,		&model_skl),
> -	X86_MATCH_VFM(INTEL_RAPTORLAKE_P,	&model_skl),
> +	X86_MATCH_VFM(INTEL_RAPTORLAKE_P,	&model_rpl),
>  	X86_MATCH_VFM(INTEL_RAPTORLAKE_S,	&model_skl),
> -	X86_MATCH_VFM(INTEL_METEORLAKE,		&model_skl),
> -	X86_MATCH_VFM(INTEL_METEORLAKE_L,	&model_skl),
> +	X86_MATCH_VFM(INTEL_METEORLAKE,		&model_rpl),
> +	X86_MATCH_VFM(INTEL_METEORLAKE_L,	&model_rpl),
>  	X86_MATCH_VFM(INTEL_ARROWLAKE_H,	&model_skl),
>  	X86_MATCH_VFM(INTEL_ARROWLAKE,		&model_skl),
> -	X86_MATCH_VFM(INTEL_LUNARLAKE_M,	&model_skl),
> +	X86_MATCH_VFM(INTEL_LUNARLAKE_M,	&model_rpl),
>  	{},
>  };
>  MODULE_DEVICE_TABLE(x86cpu, rapl_model_match);

Lucas De Marchi Feb. 20, 2025, 6:27 p.m. UTC | #3

On Thu, Feb 20, 2025 at 08:28:01AM -0800, Dave Hansen wrote:
>On 2/20/25 07:36, Lucas De Marchi wrote:
>> On some boots the read of MSR_PP1_ENERGY_STATUS msr returns 0, causing
>> perf_msr_probe() to make the power/events/energy-gpu event non-visible.
>> When that happens, the msr always read 0 until the graphics module (i915
>> for Meteor Lake, xe for Lunar Lake) is loaded. Then it starts returning
>> something different and re-loading the rapl module "fixes" it.
>
>What's the root cause here? Did the kernel do something funky? Or is
>this a hardware bug?

 From what I can see, the kernel is reading the value and deciding that "if
it's 0, it doesn't really have that", which is not really true. For
these platforms sometimes it keeps returning 0 until the gpu is
later powered on, which only happens when xe / i915 probes.

But what I don't really understand is why the behavior changes from one
boot to another. I'm assuming it depends on some funky firmware
behavior.

Lucas De Marchi

Lucas De Marchi Feb. 20, 2025, 6:30 p.m. UTC | #4

On Thu, Feb 20, 2025 at 11:47:21AM -0500, Liang, Kan wrote:
>
>
>On 2025-02-20 10:36 a.m., Lucas De Marchi wrote:
>> On some boots the read of MSR_PP1_ENERGY_STATUS msr returns 0, causing
>> perf_msr_probe() to make the power/events/energy-gpu event non-visible.
>> When that happens, the msr always read 0 until the graphics module (i915
>> for Meteor Lake, xe for Lunar Lake) is loaded. Then it starts returning
>> something different and re-loading the rapl module "fixes" it.
>>
>> This is tested on the following platforms with the fail rates before
>> this patch:
>>
>> 	Alder Lake S	0/20
>> 	Arrow Lake H	0/20
>> 	Lunar Lake M	8/20
>> 	Meteor Lake U	6/20
>> 	Raptor Lake P	4/20
>> 	Raptor Lake S	0/20
>>
>> For those platforms failing, use a separate msr list with .no_check
>> set so it doesn't check the runtime value to create the event - it will
>> just return 0 until the i915/xe module initializes the GPU.
>>
>> The issue https://github.com/ulissesf/qmassa/issues/4 is workarounded by
>> reading the MSR directly since it works after xe is loaded, but the
>> issue with not having the perf event is still there.
>>
>> Closes: https://github.com/ulissesf/qmassa/issues/4
>> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4241
>> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com
>> ---
>>
>> Maybe a clearer alternative is to just move all the platforms after
>> RAPTORLAKE with a gpu to use the new msr list.
>>
>>  arch/x86/events/rapl.c | 26 ++++++++++++++++++++++----
>>  1 file changed, 22 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
>> index 4952faf03e82d..18e324b8fa82c 100644
>> --- a/arch/x86/events/rapl.c
>> +++ b/arch/x86/events/rapl.c
>> @@ -588,6 +588,14 @@ static struct perf_msr intel_rapl_spr_msrs[] = {
>>  	[PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, &rapl_events_psys_group,  test_msr, true, RAPL_MSR_MASK },
>>  };
>>
>> +static struct perf_msr intel_rapl_mtl_msrs[] = {
>> +	[PERF_RAPL_PP0]  = { MSR_PP0_ENERGY_STATUS,      &rapl_events_cores_group, test_msr, false, RAPL_MSR_MASK },
>> +	[PERF_RAPL_PKG]  = { MSR_PKG_ENERGY_STATUS,      &rapl_events_pkg_group,   test_msr, false, RAPL_MSR_MASK },
>> +	[PERF_RAPL_RAM]  = { MSR_DRAM_ENERGY_STATUS,     &rapl_events_ram_group,   test_msr, false, RAPL_MSR_MASK },
>> +	[PERF_RAPL_PP1]  = { MSR_PP1_ENERGY_STATUS,      &rapl_events_gpu_group,   test_msr, true,  RAPL_MSR_MASK },
>> +	[PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, &rapl_events_psys_group,  test_msr, false, RAPL_MSR_MASK },
>> +};
>> +
>>  /*
>>   * Force to PERF_RAPL_PKG_EVENTS_MAX size due to:
>>   * - perf_msr_probe(PERF_RAPL_PKG_EVENTS_MAX)
>> @@ -826,6 +834,16 @@ static struct rapl_model model_spr = {
>>  	.rapl_pkg_msrs	= intel_rapl_spr_msrs,
>>  };
>>
>> +static struct rapl_model model_rpl = {
>> +	.pkg_events	= BIT(PERF_RAPL_PP0) |
>> +			  BIT(PERF_RAPL_PKG) |
>> +			  BIT(PERF_RAPL_RAM) |
>> +			  BIT(PERF_RAPL_PP1) |
>> +			  BIT(PERF_RAPL_PSYS),
>> +	.msr_power_unit = MSR_RAPL_POWER_UNIT,
>> +	.rapl_pkg_msrs  = intel_rapl_mtl_msrs,
>
>It's better to make the name consistent, e.g., intel_rapl_rpl_msrs.

that's what happens when you decide to test on RPL just before sending
and forget to rename all the variables. Thanks for noticing.

I will rename it on next version if we decide to keep this approach.
Also please let me know what you think about moving arl and other rpl to
model_rpl, too.

thanks
Lucas De Marchi

>
>Thanks,
>Kan
>> +};
>> +
>>  static struct rapl_model model_amd_hygon = {
>>  	.pkg_events	= BIT(PERF_RAPL_PKG),
>>  	.core_events	= BIT(PERF_RAPL_CORE),
>> @@ -873,13 +891,13 @@ static const struct x86_cpu_id rapl_model_match[] __initconst = {
>>  	X86_MATCH_VFM(INTEL_SAPPHIRERAPIDS_X,	&model_spr),
>>  	X86_MATCH_VFM(INTEL_EMERALDRAPIDS_X,	&model_spr),
>>  	X86_MATCH_VFM(INTEL_RAPTORLAKE,		&model_skl),
>> -	X86_MATCH_VFM(INTEL_RAPTORLAKE_P,	&model_skl),
>> +	X86_MATCH_VFM(INTEL_RAPTORLAKE_P,	&model_rpl),
>>  	X86_MATCH_VFM(INTEL_RAPTORLAKE_S,	&model_skl),
>> -	X86_MATCH_VFM(INTEL_METEORLAKE,		&model_skl),
>> -	X86_MATCH_VFM(INTEL_METEORLAKE_L,	&model_skl),
>> +	X86_MATCH_VFM(INTEL_METEORLAKE,		&model_rpl),
>> +	X86_MATCH_VFM(INTEL_METEORLAKE_L,	&model_rpl),
>>  	X86_MATCH_VFM(INTEL_ARROWLAKE_H,	&model_skl),
>>  	X86_MATCH_VFM(INTEL_ARROWLAKE,		&model_skl),
>> -	X86_MATCH_VFM(INTEL_LUNARLAKE_M,	&model_skl),
>> +	X86_MATCH_VFM(INTEL_LUNARLAKE_M,	&model_rpl),
>>  	{},
>>  };
>>  MODULE_DEVICE_TABLE(x86cpu, rapl_model_match);
>

Dave Hansen Feb. 20, 2025, 6:53 p.m. UTC | #5

On 2/20/25 10:27, Lucas De Marchi wrote:
> On Thu, Feb 20, 2025 at 08:28:01AM -0800, Dave Hansen wrote:
>> On 2/20/25 07:36, Lucas De Marchi wrote:
>>> On some boots the read of MSR_PP1_ENERGY_STATUS msr returns 0, causing
>>> perf_msr_probe() to make the power/events/energy-gpu event non-visible.
>>> When that happens, the msr always read 0 until the graphics module (i915
>>> for Meteor Lake, xe for Lunar Lake) is loaded. Then it starts returning
>>> something different and re-loading the rapl module "fixes" it.
>>
>> What's the root cause here? Did the kernel do something funky? Or is
>> this a hardware bug?
> 
> From what I can see, the kernel is reading the value and deciding that "if
> it's 0, it doesn't really have that", which is not really true. For
> these platforms sometimes it keeps returning 0 until the gpu is
> later powered on, which only happens when xe / i915 probes.
> 
> But what I don't really understand is why the behavior changes from one
> boot to another. I'm assuming it depends on some funky firmware
> behavior.

Could we root cause this a _bit_ better, please?

Right now, it seems like you noted some weird behavior on one out of the
22 "model_skl" CPUs. You then tested on at least 4 of those CPUs and
found similar behavior. So, you copied, verbatim, the
intel_rapl_skl_msrs and model_skl structures. Then, flipped the
perf_msr->no_check bit for one of the 5 MSRs. There's no note on why the
one bit got flipped or that it's a presumed CPU issue.

To continue the trajectory that this patch sets us on, each CPU model
that comes out needs to be tested. When a new CPU shows up, which one is
it? "model_skl" with the (presumed) CPU bug fixed or "model_rpl"
without? How would someone even know how to test it? It's certainly not
documented in the code.

I don't think that's a sustainable trajectory.

We need to figure out whether the kernel is buggy or the hardware is buggy.

If the hardware is buggy, we need to go ask the hardware guys to publish
an erratum about the bug so there are *bounds* on where the issue shows
up. Basically make the hardware guys document the nasty behavior instead
of having us test every CPU.

Or, if we simply can't trust MSR_PP1_ENERGY_STATUS, let's just do the
attached patch. What's the downside on a non-buggy CPU of doing this?

perf/x86/rapl: Fix PP1 event for Intel Meteor/Lunar Lake

Commit Message

Comments

Patch