diff mbox series

[v6] drm/i915/selftests: Implement frequency logging for energy reading validation

Message ID 20241113095004.2441052-1-sk.anirban@intel.com (mailing list archive)
State New
Headers show
Series [v6] drm/i915/selftests: Implement frequency logging for energy reading validation | expand

Commit Message

Anirban, Sk Nov. 13, 2024, 9:50 a.m. UTC
Introduce RC6 & RC0 frequency logging mechanism to ensure accurate
energy readings aimed at addressing GPU energy leaks and power
measurement failures.
This enhancement will help ensure the accuracy of energy readings.

v2:
  - Improved commit message.
v3:
  - Used pr_err log to display frequency (Anshuman)
  - Sorted headers alphabetically (Sai Teja)
v4:
  - Improved commit message.
  - Fix pr_err log (Sai Teja)
v5:
  - Add error & debug logging for RC0 power and frequency checks (Anshuman)
v6:
  - Modify debug logging for RC0 power and frequency checks (Sai Teja)

Signed-off-by: Sk Anirban <sk.anirban@intel.com>
Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com>
---
 drivers/gpu/drm/i915/gt/selftest_rc6.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Comments

Nilawar, Badal Nov. 20, 2024, 8:13 a.m. UTC | #1
On 13-11-2024 15:20, Sk Anirban wrote:
> Introduce RC6 & RC0 frequency logging mechanism to ensure accurate
> energy readings aimed at addressing GPU energy leaks and power
> measurement failures.
> This enhancement will help ensure the accuracy of energy readings.
> 
> v2:
>    - Improved commit message.
> v3:
>    - Used pr_err log to display frequency (Anshuman)
>    - Sorted headers alphabetically (Sai Teja)
> v4:
>    - Improved commit message.
>    - Fix pr_err log (Sai Teja)
> v5:
>    - Add error & debug logging for RC0 power and frequency checks (Anshuman)
> v6:
>    - Modify debug logging for RC0 power and frequency checks (Sai Teja)
> 
> Signed-off-by: Sk Anirban <sk.anirban@intel.com>
> Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/selftest_rc6.c | 15 +++++++++++++--
>   1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/selftest_rc6.c b/drivers/gpu/drm/i915/gt/selftest_rc6.c
> index 1aa1446c8fb0..a8776f88d6a1 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_rc6.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_rc6.c
> @@ -8,6 +8,7 @@
>   #include "intel_gpu_commands.h"
>   #include "intel_gt_requests.h"
>   #include "intel_ring.h"
> +#include "intel_rps.h"
>   #include "selftest_rc6.h"
>   
>   #include "selftests/i915_random.h"
> @@ -38,6 +39,9 @@ int live_rc6_manual(void *arg)
>   	ktime_t dt;
>   	u64 res[2];
>   	int err = 0;
> +	u32 rc0_freq = 0;
> +	u32 rc6_freq = 0;
> +	struct intel_rps *rps = &gt->rps;
>   
>   	/*
>   	 * Our claim is that we can "encourage" the GPU to enter rc6 at will.
> @@ -66,6 +70,7 @@ int live_rc6_manual(void *arg)
>   	rc0_power = librapl_energy_uJ() - rc0_power;
>   	dt = ktime_sub(ktime_get(), dt);
>   	res[1] = rc6_residency(rc6);
> +	rc0_freq = intel_rps_read_actual_frequency(rps);
>   	if ((res[1] - res[0]) >> 10) {
>   		pr_err("RC6 residency increased by %lldus while disabled for 1000ms!\n",
>   		       (res[1] - res[0]) >> 10);
> @@ -77,7 +82,11 @@ int live_rc6_manual(void *arg)
>   		rc0_power = div64_u64(NSEC_PER_SEC * rc0_power,
>   				      ktime_to_ns(dt));
>   		if (!rc0_power) {
> -			pr_err("No power measured while in RC0\n");
> +			if (rc0_freq)
> +				pr_err("No power measured while in RC0! GPU Freq: %u in RC0\n",
> +				       rc0_freq);
> +			else
> +				pr_err("No power and freq measured while in RC0\n");
>   			err = -EINVAL;
>   			goto out_unlock;
>   		}
> @@ -91,6 +100,7 @@ int live_rc6_manual(void *arg)
>   	dt = ktime_get();
>   	rc6_power = librapl_energy_uJ();
>   	msleep(100);
> +	rc6_freq = intel_rps_read_actual_frequency(rps);

I think intention of reading frequency here is to know if device was not 
in RC6 when there is failure. But for the platforms below gen12 reading 
act frequency will cause gt wake as GEN6_RPSTAT reg requires forcewake. 
To avoid wake when device is in RC6 read actual frequency without 
applying forcewake.

Additionally add delay, may be delay of 1 seconds after re-enabling RC6 
manually and forcewake flush.

Regards,
Badal

>   	rc6_power = librapl_energy_uJ() - rc6_power;
>   	dt = ktime_sub(ktime_get(), dt);
>   	res[1] = rc6_residency(rc6);
> @@ -108,7 +118,8 @@ int live_rc6_manual(void *arg)
>   		pr_info("GPU consumed %llduW in RC0 and %llduW in RC6\n",
>   			rc0_power, rc6_power);
>   		if (2 * rc6_power > rc0_power) {
> -			pr_err("GPU leaked energy while in RC6!\n");
> +			pr_err("GPU leaked energy while in RC6! GPU Freq: %u in RC6 and %u in RC0\n",
> +			       rc6_freq, rc0_freq);
>   			err = -EINVAL;
>   			goto out_unlock;
>   		}
Gupta, Anshuman Nov. 20, 2024, 10:30 a.m. UTC | #2
> -----Original Message-----
> From: Nilawar, Badal <badal.nilawar@intel.com>
> Sent: Wednesday, November 20, 2024 1:44 PM
> To: Anirban, Sk <sk.anirban@intel.com>; intel-gfx@lists.freedesktop.org
> Cc: Gupta, Anshuman <anshuman.gupta@intel.com>; Poosa, Karthik
> <karthik.poosa@intel.com>; Pottumuttu, Sai Teja
> <sai.teja.pottumuttu@intel.com>
> Subject: Re: [PATCH v6] drm/i915/selftests: Implement frequency logging for
> energy reading validation
> 
> 
> 
> On 13-11-2024 15:20, Sk Anirban wrote:
> > Introduce RC6 & RC0 frequency logging mechanism to ensure accurate
> > energy readings aimed at addressing GPU energy leaks and power
> > measurement failures.
> > This enhancement will help ensure the accuracy of energy readings.
> >
> > v2:
> >    - Improved commit message.
> > v3:
> >    - Used pr_err log to display frequency (Anshuman)
> >    - Sorted headers alphabetically (Sai Teja)
> > v4:
> >    - Improved commit message.
> >    - Fix pr_err log (Sai Teja)
> > v5:
> >    - Add error & debug logging for RC0 power and frequency checks
> > (Anshuman)
> > v6:
> >    - Modify debug logging for RC0 power and frequency checks (Sai
> > Teja)
> >
> > Signed-off-by: Sk Anirban <sk.anirban@intel.com>
> > Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/selftest_rc6.c | 15 +++++++++++++--
> >   1 file changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gt/selftest_rc6.c
> > b/drivers/gpu/drm/i915/gt/selftest_rc6.c
> > index 1aa1446c8fb0..a8776f88d6a1 100644
> > --- a/drivers/gpu/drm/i915/gt/selftest_rc6.c
> > +++ b/drivers/gpu/drm/i915/gt/selftest_rc6.c
> > @@ -8,6 +8,7 @@
> >   #include "intel_gpu_commands.h"
> >   #include "intel_gt_requests.h"
> >   #include "intel_ring.h"
> > +#include "intel_rps.h"
> >   #include "selftest_rc6.h"
> >
> >   #include "selftests/i915_random.h"
> > @@ -38,6 +39,9 @@ int live_rc6_manual(void *arg)
> >   	ktime_t dt;
> >   	u64 res[2];
> >   	int err = 0;
> > +	u32 rc0_freq = 0;
> > +	u32 rc6_freq = 0;
> > +	struct intel_rps *rps = &gt->rps;
> >
> >   	/*
> >   	 * Our claim is that we can "encourage" the GPU to enter rc6 at will.
> > @@ -66,6 +70,7 @@ int live_rc6_manual(void *arg)
> >   	rc0_power = librapl_energy_uJ() - rc0_power;
> >   	dt = ktime_sub(ktime_get(), dt);
> >   	res[1] = rc6_residency(rc6);
> > +	rc0_freq = intel_rps_read_actual_frequency(rps);
> >   	if ((res[1] - res[0]) >> 10) {
> >   		pr_err("RC6 residency increased by %lldus while disabled for
> 1000ms!\n",
> >   		       (res[1] - res[0]) >> 10);
> > @@ -77,7 +82,11 @@ int live_rc6_manual(void *arg)
> >   		rc0_power = div64_u64(NSEC_PER_SEC * rc0_power,
> >   				      ktime_to_ns(dt));
> >   		if (!rc0_power) {
> > -			pr_err("No power measured while in RC0\n");
> > +			if (rc0_freq)
> > +				pr_err("No power measured while in RC0!
> GPU Freq: %u in RC0\n",
> > +				       rc0_freq);
If rc0 frequency is there then, this has to be pr_dbg, otherwise what is the purpose of this patch.
> > +			else
> > +				pr_err("No power and freq measured while in
> RC0\n");
> >   			err = -EINVAL;
> >   			goto out_unlock;
> >   		}
> > @@ -91,6 +100,7 @@ int live_rc6_manual(void *arg)
> >   	dt = ktime_get();
> >   	rc6_power = librapl_energy_uJ();
> >   	msleep(100);
> > +	rc6_freq = intel_rps_read_actual_frequency(rps);
> 
> I think intention of reading frequency here is to know if device was not in RC6
> when there is failure. But for the platforms below gen12 reading act frequency
> will cause gt wake as GEN6_RPSTAT reg requires forcewake.
> To avoid wake when device is in RC6 read actual frequency without applying
> forcewake.
If reading act_freq will wake the device, How to read frequency without forcewake then ?

Thanks,
Anshuaman
> 
> Additionally add delay, may be delay of 1 seconds after re-enabling RC6
> manually and forcewake flush.
> 
> Regards,
> Badal
> 
> >   	rc6_power = librapl_energy_uJ() - rc6_power;
> >   	dt = ktime_sub(ktime_get(), dt);
> >   	res[1] = rc6_residency(rc6);
> > @@ -108,7 +118,8 @@ int live_rc6_manual(void *arg)
> >   		pr_info("GPU consumed %llduW in RC0 and %llduW in
> RC6\n",
> >   			rc0_power, rc6_power);
> >   		if (2 * rc6_power > rc0_power) {
> > -			pr_err("GPU leaked energy while in RC6!\n");
> > +			pr_err("GPU leaked energy while in RC6! GPU Freq:
> %u in RC6 and %u in RC0\n",
> > +			       rc6_freq, rc0_freq);
> >   			err = -EINVAL;
> >   			goto out_unlock;
> >   		}
Gupta, Anshuman Nov. 20, 2024, 10:43 a.m. UTC | #3
> -----Original Message-----
> From: Gupta, Anshuman
> Sent: Wednesday, November 20, 2024 4:01 PM
> To: Nilawar, Badal <badal.nilawar@intel.com>; Anirban, Sk
> <sk.anirban@intel.com>; intel-gfx@lists.freedesktop.org
> Cc: Poosa, Karthik <karthik.poosa@intel.com>; Pottumuttu, Sai Teja
> <sai.teja.pottumuttu@intel.com>
> Subject: RE: [PATCH v6] drm/i915/selftests: Implement frequency logging for
> energy reading validation
> 
> 
> 
> > -----Original Message-----
> > From: Nilawar, Badal <badal.nilawar@intel.com>
> > Sent: Wednesday, November 20, 2024 1:44 PM
> > To: Anirban, Sk <sk.anirban@intel.com>;
> > intel-gfx@lists.freedesktop.org
> > Cc: Gupta, Anshuman <anshuman.gupta@intel.com>; Poosa, Karthik
> > <karthik.poosa@intel.com>; Pottumuttu, Sai Teja
> > <sai.teja.pottumuttu@intel.com>
> > Subject: Re: [PATCH v6] drm/i915/selftests: Implement frequency
> > logging for energy reading validation
> >
> >
> >
> > On 13-11-2024 15:20, Sk Anirban wrote:
> > > Introduce RC6 & RC0 frequency logging mechanism to ensure accurate
> > > energy readings aimed at addressing GPU energy leaks and power
> > > measurement failures.
> > > This enhancement will help ensure the accuracy of energy readings.
> > >
> > > v2:
> > >    - Improved commit message.
> > > v3:
> > >    - Used pr_err log to display frequency (Anshuman)
> > >    - Sorted headers alphabetically (Sai Teja)
> > > v4:
> > >    - Improved commit message.
> > >    - Fix pr_err log (Sai Teja)
> > > v5:
> > >    - Add error & debug logging for RC0 power and frequency checks
> > > (Anshuman)
> > > v6:
> > >    - Modify debug logging for RC0 power and frequency checks (Sai
> > > Teja)
> > >
> > > Signed-off-by: Sk Anirban <sk.anirban@intel.com>
> > > Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/gt/selftest_rc6.c | 15 +++++++++++++--
> > >   1 file changed, 13 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gt/selftest_rc6.c
> > > b/drivers/gpu/drm/i915/gt/selftest_rc6.c
> > > index 1aa1446c8fb0..a8776f88d6a1 100644
> > > --- a/drivers/gpu/drm/i915/gt/selftest_rc6.c
> > > +++ b/drivers/gpu/drm/i915/gt/selftest_rc6.c
> > > @@ -8,6 +8,7 @@
> > >   #include "intel_gpu_commands.h"
> > >   #include "intel_gt_requests.h"
> > >   #include "intel_ring.h"
> > > +#include "intel_rps.h"
> > >   #include "selftest_rc6.h"
> > >
> > >   #include "selftests/i915_random.h"
> > > @@ -38,6 +39,9 @@ int live_rc6_manual(void *arg)
> > >   	ktime_t dt;
> > >   	u64 res[2];
> > >   	int err = 0;
> > > +	u32 rc0_freq = 0;
> > > +	u32 rc6_freq = 0;
> > > +	struct intel_rps *rps = &gt->rps;
> > >
> > >   	/*
> > >   	 * Our claim is that we can "encourage" the GPU to enter rc6 at will.
> > > @@ -66,6 +70,7 @@ int live_rc6_manual(void *arg)
> > >   	rc0_power = librapl_energy_uJ() - rc0_power;
> > >   	dt = ktime_sub(ktime_get(), dt);
> > >   	res[1] = rc6_residency(rc6);
> > > +	rc0_freq = intel_rps_read_actual_frequency(rps);
> > >   	if ((res[1] - res[0]) >> 10) {
> > >   		pr_err("RC6 residency increased by %lldus while disabled for
> > 1000ms!\n",
> > >   		       (res[1] - res[0]) >> 10); @@ -77,7 +82,11 @@ int
> > > live_rc6_manual(void *arg)
> > >   		rc0_power = div64_u64(NSEC_PER_SEC * rc0_power,
> > >   				      ktime_to_ns(dt));
> > >   		if (!rc0_power) {
> > > -			pr_err("No power measured while in RC0\n");
> > > +			if (rc0_freq)
> > > +				pr_err("No power measured while in RC0!
> > GPU Freq: %u in RC0\n",
> > > +				       rc0_freq);
> If rc0 frequency is there then, this has to be pr_dbg, otherwise what is the
> purpose of this patch.
It should return from here without any error in case there is no rc0 power but we do have rc0 frequency.
Thanks,
Anshuman.
> > > +			else
> > > +				pr_err("No power and freq measured while in
> > RC0\n");
> > >   			err = -EINVAL;
> > >   			goto out_unlock;
> > >   		}
> > > @@ -91,6 +100,7 @@ int live_rc6_manual(void *arg)
> > >   	dt = ktime_get();
> > >   	rc6_power = librapl_energy_uJ();
> > >   	msleep(100);
> > > +	rc6_freq = intel_rps_read_actual_frequency(rps);
> >
> > I think intention of reading frequency here is to know if device was
> > not in RC6 when there is failure. But for the platforms below gen12
> > reading act frequency will cause gt wake as GEN6_RPSTAT reg requires
> forcewake.
> > To avoid wake when device is in RC6 read actual frequency without
> > applying forcewake.
> If reading act_freq will wake the device, How to read frequency without
> forcewake then ?
> 
> Thanks,
> Anshuaman
> >
> > Additionally add delay, may be delay of 1 seconds after re-enabling
> > RC6 manually and forcewake flush.
> >
> > Regards,
> > Badal
> >
> > >   	rc6_power = librapl_energy_uJ() - rc6_power;
> > >   	dt = ktime_sub(ktime_get(), dt);
> > >   	res[1] = rc6_residency(rc6);
> > > @@ -108,7 +118,8 @@ int live_rc6_manual(void *arg)
> > >   		pr_info("GPU consumed %llduW in RC0 and %llduW in
> > RC6\n",
> > >   			rc0_power, rc6_power);
> > >   		if (2 * rc6_power > rc0_power) {
> > > -			pr_err("GPU leaked energy while in RC6!\n");
> > > +			pr_err("GPU leaked energy while in RC6! GPU Freq:
> > %u in RC6 and %u in RC0\n",
> > > +			       rc6_freq, rc0_freq);
> > >   			err = -EINVAL;
> > >   			goto out_unlock;
> > >   		}
Nilawar, Badal Nov. 20, 2024, 2:50 p.m. UTC | #4
On 20-11-2024 16:00, Gupta, Anshuman wrote:
> 
> 
>> -----Original Message-----
>> From: Nilawar, Badal <badal.nilawar@intel.com>
>> Sent: Wednesday, November 20, 2024 1:44 PM
>> To: Anirban, Sk <sk.anirban@intel.com>; intel-gfx@lists.freedesktop.org
>> Cc: Gupta, Anshuman <anshuman.gupta@intel.com>; Poosa, Karthik
>> <karthik.poosa@intel.com>; Pottumuttu, Sai Teja
>> <sai.teja.pottumuttu@intel.com>
>> Subject: Re: [PATCH v6] drm/i915/selftests: Implement frequency logging for
>> energy reading validation
>>
>>
>>
>> On 13-11-2024 15:20, Sk Anirban wrote:
>>> Introduce RC6 & RC0 frequency logging mechanism to ensure accurate
>>> energy readings aimed at addressing GPU energy leaks and power
>>> measurement failures.
>>> This enhancement will help ensure the accuracy of energy readings.
>>>
>>> v2:
>>>     - Improved commit message.
>>> v3:
>>>     - Used pr_err log to display frequency (Anshuman)
>>>     - Sorted headers alphabetically (Sai Teja)
>>> v4:
>>>     - Improved commit message.
>>>     - Fix pr_err log (Sai Teja)
>>> v5:
>>>     - Add error & debug logging for RC0 power and frequency checks
>>> (Anshuman)
>>> v6:
>>>     - Modify debug logging for RC0 power and frequency checks (Sai
>>> Teja)
>>>
>>> Signed-off-by: Sk Anirban <sk.anirban@intel.com>
>>> Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/selftest_rc6.c | 15 +++++++++++++--
>>>    1 file changed, 13 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/selftest_rc6.c
>>> b/drivers/gpu/drm/i915/gt/selftest_rc6.c
>>> index 1aa1446c8fb0..a8776f88d6a1 100644
>>> --- a/drivers/gpu/drm/i915/gt/selftest_rc6.c
>>> +++ b/drivers/gpu/drm/i915/gt/selftest_rc6.c
>>> @@ -8,6 +8,7 @@
>>>    #include "intel_gpu_commands.h"
>>>    #include "intel_gt_requests.h"
>>>    #include "intel_ring.h"
>>> +#include "intel_rps.h"
>>>    #include "selftest_rc6.h"
>>>
>>>    #include "selftests/i915_random.h"
>>> @@ -38,6 +39,9 @@ int live_rc6_manual(void *arg)
>>>    	ktime_t dt;
>>>    	u64 res[2];
>>>    	int err = 0;
>>> +	u32 rc0_freq = 0;
>>> +	u32 rc6_freq = 0;
>>> +	struct intel_rps *rps = &gt->rps;
>>>
>>>    	/*
>>>    	 * Our claim is that we can "encourage" the GPU to enter rc6 at will.
>>> @@ -66,6 +70,7 @@ int live_rc6_manual(void *arg)
>>>    	rc0_power = librapl_energy_uJ() - rc0_power;
>>>    	dt = ktime_sub(ktime_get(), dt);
>>>    	res[1] = rc6_residency(rc6);
>>> +	rc0_freq = intel_rps_read_actual_frequency(rps);
>>>    	if ((res[1] - res[0]) >> 10) {
>>>    		pr_err("RC6 residency increased by %lldus while disabled for
>> 1000ms!\n",
>>>    		       (res[1] - res[0]) >> 10);
>>> @@ -77,7 +82,11 @@ int live_rc6_manual(void *arg)
>>>    		rc0_power = div64_u64(NSEC_PER_SEC * rc0_power,
>>>    				      ktime_to_ns(dt));
>>>    		if (!rc0_power) {
>>> -			pr_err("No power measured while in RC0\n");
>>> +			if (rc0_freq)
>>> +				pr_err("No power measured while in RC0!
>> GPU Freq: %u in RC0\n",
>>> +				       rc0_freq);
> If rc0 frequency is there then, this has to be pr_dbg, otherwise what is the purpose of this patch.

I too didn't understand purpose of this. How this going to help for 
accurate energy readings.

>>> +			else
>>> +				pr_err("No power and freq measured while in
>> RC0\n");
>>>    			err = -EINVAL;
>>>    			goto out_unlock;
>>>    		}
>>> @@ -91,6 +100,7 @@ int live_rc6_manual(void *arg)
>>>    	dt = ktime_get();
>>>    	rc6_power = librapl_energy_uJ();
>>>    	msleep(100);
>>> +	rc6_freq = intel_rps_read_actual_frequency(rps);
>>
>> I think intention of reading frequency here is to know if device was not in RC6
>> when there is failure. But for the platforms below gen12 reading act frequency
>> will cause gt wake as GEN6_RPSTAT reg requires forcewake.
>> To avoid wake when device is in RC6 read actual frequency without applying
>> forcewake.
> If reading act_freq will wake the device, How to read frequency without forcewake then ?

Use this api intel_rps_read_actual_frequency_fw to read freq without 
forcewake.

Regards,
Badal

> 
> Thanks,
> Anshuaman
>>
>> Additionally add delay, may be delay of 1 seconds after re-enabling RC6
>> manually and forcewake flush.
>>
>> Regards,
>> Badal
>>
>>>    	rc6_power = librapl_energy_uJ() - rc6_power;
>>>    	dt = ktime_sub(ktime_get(), dt);
>>>    	res[1] = rc6_residency(rc6);
>>> @@ -108,7 +118,8 @@ int live_rc6_manual(void *arg)
>>>    		pr_info("GPU consumed %llduW in RC0 and %llduW in
>> RC6\n",
>>>    			rc0_power, rc6_power);
>>>    		if (2 * rc6_power > rc0_power) {
>>> -			pr_err("GPU leaked energy while in RC6!\n");
>>> +			pr_err("GPU leaked energy while in RC6! GPU Freq:
>> %u in RC6 and %u in RC0\n",
>>> +			       rc6_freq, rc0_freq);
>>>    			err = -EINVAL;
>>>    			goto out_unlock;
>>>    		}
>
Anirban, Sk Nov. 20, 2024, 6:08 p.m. UTC | #5
On 20-11-2024 20:20, Nilawar, Badal wrote:
>
>
> On 20-11-2024 16:00, Gupta, Anshuman wrote:
>>
>>
>>> -----Original Message-----
>>> From: Nilawar, Badal <badal.nilawar@intel.com>
>>> Sent: Wednesday, November 20, 2024 1:44 PM
>>> To: Anirban, Sk <sk.anirban@intel.com>; intel-gfx@lists.freedesktop.org
>>> Cc: Gupta, Anshuman <anshuman.gupta@intel.com>; Poosa, Karthik
>>> <karthik.poosa@intel.com>; Pottumuttu, Sai Teja
>>> <sai.teja.pottumuttu@intel.com>
>>> Subject: Re: [PATCH v6] drm/i915/selftests: Implement frequency 
>>> logging for
>>> energy reading validation
>>>
>>>
>>>
>>> On 13-11-2024 15:20, Sk Anirban wrote:
>>>> Introduce RC6 & RC0 frequency logging mechanism to ensure accurate
>>>> energy readings aimed at addressing GPU energy leaks and power
>>>> measurement failures.
>>>> This enhancement will help ensure the accuracy of energy readings.
>>>>
>>>> v2:
>>>>     - Improved commit message.
>>>> v3:
>>>>     - Used pr_err log to display frequency (Anshuman)
>>>>     - Sorted headers alphabetically (Sai Teja)
>>>> v4:
>>>>     - Improved commit message.
>>>>     - Fix pr_err log (Sai Teja)
>>>> v5:
>>>>     - Add error & debug logging for RC0 power and frequency checks
>>>> (Anshuman)
>>>> v6:
>>>>     - Modify debug logging for RC0 power and frequency checks (Sai
>>>> Teja)
>>>>
>>>> Signed-off-by: Sk Anirban <sk.anirban@intel.com>
>>>> Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com>
>>>> ---
>>>>    drivers/gpu/drm/i915/gt/selftest_rc6.c | 15 +++++++++++++--
>>>>    1 file changed, 13 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/selftest_rc6.c
>>>> b/drivers/gpu/drm/i915/gt/selftest_rc6.c
>>>> index 1aa1446c8fb0..a8776f88d6a1 100644
>>>> --- a/drivers/gpu/drm/i915/gt/selftest_rc6.c
>>>> +++ b/drivers/gpu/drm/i915/gt/selftest_rc6.c
>>>> @@ -8,6 +8,7 @@
>>>>    #include "intel_gpu_commands.h"
>>>>    #include "intel_gt_requests.h"
>>>>    #include "intel_ring.h"
>>>> +#include "intel_rps.h"
>>>>    #include "selftest_rc6.h"
>>>>
>>>>    #include "selftests/i915_random.h"
>>>> @@ -38,6 +39,9 @@ int live_rc6_manual(void *arg)
>>>>        ktime_t dt;
>>>>        u64 res[2];
>>>>        int err = 0;
>>>> +    u32 rc0_freq = 0;
>>>> +    u32 rc6_freq = 0;
>>>> +    struct intel_rps *rps = &gt->rps;
>>>>
>>>>        /*
>>>>         * Our claim is that we can "encourage" the GPU to enter rc6 
>>>> at will.
>>>> @@ -66,6 +70,7 @@ int live_rc6_manual(void *arg)
>>>>        rc0_power = librapl_energy_uJ() - rc0_power;
>>>>        dt = ktime_sub(ktime_get(), dt);
>>>>        res[1] = rc6_residency(rc6);
>>>> +    rc0_freq = intel_rps_read_actual_frequency(rps);
>>>>        if ((res[1] - res[0]) >> 10) {
>>>>            pr_err("RC6 residency increased by %lldus while disabled 
>>>> for
>>> 1000ms!\n",
>>>>                   (res[1] - res[0]) >> 10);
>>>> @@ -77,7 +82,11 @@ int live_rc6_manual(void *arg)
>>>>            rc0_power = div64_u64(NSEC_PER_SEC * rc0_power,
>>>>                          ktime_to_ns(dt));
>>>>            if (!rc0_power) {
>>>> -            pr_err("No power measured while in RC0\n");
>>>> +            if (rc0_freq)
>>>> +                pr_err("No power measured while in RC0!
>>> GPU Freq: %u in RC0\n",
>>>> +                       rc0_freq);
>> If rc0 frequency is there then, this has to be pr_dbg, otherwise what 
>> is the purpose of this patch.
>
> I too didn't understand purpose of this. How this going to help for 
> accurate energy readings.
in case of rc0 power is 0 , I just want to confirm whether the freq is 
available there or not ? Also it is defined as pr_err because there is 
no point of validate rc0 power below if no power is being measured.
>
>>>> +            else
>>>> +                pr_err("No power and freq measured while in
>>> RC0\n");
>>>>                err = -EINVAL;
>>>>                goto out_unlock;
>>>>            }
>>>> @@ -91,6 +100,7 @@ int live_rc6_manual(void *arg)
>>>>        dt = ktime_get();
>>>>        rc6_power = librapl_energy_uJ();
>>>>        msleep(100);
>>>> +    rc6_freq = intel_rps_read_actual_frequency(rps);
>>>
>>> I think intention of reading frequency here is to know if device was 
>>> not in RC6
>>> when there is failure. But for the platforms below gen12 reading act 
>>> frequency
>>> will cause gt wake as GEN6_RPSTAT reg requires forcewake.
>>> To avoid wake when device is in RC6 read actual frequency without 
>>> applying
>>> forcewake.
>> If reading act_freq will wake the device, How to read frequency 
>> without forcewake then ?
>
> Use this api intel_rps_read_actual_frequency_fw to read freq without 
> forcewake.
>
> Regards,
> Badal
>
>>
>> Thanks,
>> Anshuaman
>>>
>>> Additionally add delay, may be delay of 1 seconds after re-enabling RC6
>>> manually and forcewake flush.
yeah, I can use that to read the actual freq but there is already a 
check involved just to cross verify whether the system has entered RC6 
or not.
>>>
>>> Regards,
>>> Badal
Thanks,
Anirban
>>>
>>>>        rc6_power = librapl_energy_uJ() - rc6_power;
>>>>        dt = ktime_sub(ktime_get(), dt);
>>>>        res[1] = rc6_residency(rc6);
>>>> @@ -108,7 +118,8 @@ int live_rc6_manual(void *arg)
>>>>            pr_info("GPU consumed %llduW in RC0 and %llduW in
>>> RC6\n",
>>>>                rc0_power, rc6_power);
>>>>            if (2 * rc6_power > rc0_power) {
>>>> -            pr_err("GPU leaked energy while in RC6!\n");
>>>> +            pr_err("GPU leaked energy while in RC6! GPU Freq:
>>> %u in RC6 and %u in RC0\n",
>>>> +                   rc6_freq, rc0_freq);
>>>>                err = -EINVAL;
>>>>                goto out_unlock;
>>>>            }
>>
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/selftest_rc6.c b/drivers/gpu/drm/i915/gt/selftest_rc6.c
index 1aa1446c8fb0..a8776f88d6a1 100644
--- a/drivers/gpu/drm/i915/gt/selftest_rc6.c
+++ b/drivers/gpu/drm/i915/gt/selftest_rc6.c
@@ -8,6 +8,7 @@ 
 #include "intel_gpu_commands.h"
 #include "intel_gt_requests.h"
 #include "intel_ring.h"
+#include "intel_rps.h"
 #include "selftest_rc6.h"
 
 #include "selftests/i915_random.h"
@@ -38,6 +39,9 @@  int live_rc6_manual(void *arg)
 	ktime_t dt;
 	u64 res[2];
 	int err = 0;
+	u32 rc0_freq = 0;
+	u32 rc6_freq = 0;
+	struct intel_rps *rps = &gt->rps;
 
 	/*
 	 * Our claim is that we can "encourage" the GPU to enter rc6 at will.
@@ -66,6 +70,7 @@  int live_rc6_manual(void *arg)
 	rc0_power = librapl_energy_uJ() - rc0_power;
 	dt = ktime_sub(ktime_get(), dt);
 	res[1] = rc6_residency(rc6);
+	rc0_freq = intel_rps_read_actual_frequency(rps);
 	if ((res[1] - res[0]) >> 10) {
 		pr_err("RC6 residency increased by %lldus while disabled for 1000ms!\n",
 		       (res[1] - res[0]) >> 10);
@@ -77,7 +82,11 @@  int live_rc6_manual(void *arg)
 		rc0_power = div64_u64(NSEC_PER_SEC * rc0_power,
 				      ktime_to_ns(dt));
 		if (!rc0_power) {
-			pr_err("No power measured while in RC0\n");
+			if (rc0_freq)
+				pr_err("No power measured while in RC0! GPU Freq: %u in RC0\n",
+				       rc0_freq);
+			else
+				pr_err("No power and freq measured while in RC0\n");
 			err = -EINVAL;
 			goto out_unlock;
 		}
@@ -91,6 +100,7 @@  int live_rc6_manual(void *arg)
 	dt = ktime_get();
 	rc6_power = librapl_energy_uJ();
 	msleep(100);
+	rc6_freq = intel_rps_read_actual_frequency(rps);
 	rc6_power = librapl_energy_uJ() - rc6_power;
 	dt = ktime_sub(ktime_get(), dt);
 	res[1] = rc6_residency(rc6);
@@ -108,7 +118,8 @@  int live_rc6_manual(void *arg)
 		pr_info("GPU consumed %llduW in RC0 and %llduW in RC6\n",
 			rc0_power, rc6_power);
 		if (2 * rc6_power > rc0_power) {
-			pr_err("GPU leaked energy while in RC6!\n");
+			pr_err("GPU leaked energy while in RC6! GPU Freq: %u in RC6 and %u in RC0\n",
+			       rc6_freq, rc0_freq);
 			err = -EINVAL;
 			goto out_unlock;
 		}