[v4,1/3] drm/i915/gt: Disable HW load balancing for CCS

Message ID	20240306012247.246003-2-andi.shyti@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Andi Shyti <andi.shyti@linux.intel.com> To: intel-gfx <intel-gfx@lists.freedesktop.org>, dri-devel <dri-devel@lists.freedesktop.org> Cc: Chris Wilson <chris.p.wilson@linux.intel.com>, Joonas Lahtinen <joonas.lahtinen@linux.intel.com>, Matt Roper <matthew.d.roper@intel.com>, John Harrison <John.C.Harrison@Intel.com>, stable@vger.kernel.org, Andi Shyti <andi.shyti@linux.intel.com>, Andi Shyti <andi.shyti@kernel.org>, Tvrtko Ursulin <tursulin@ursulin.net> Subject: [PATCH v4 1/3] drm/i915/gt: Disable HW load balancing for CCS Date: Wed, 6 Mar 2024 02:22:45 +0100 Message-ID: <20240306012247.246003-2-andi.shyti@linux.intel.com> In-Reply-To: <20240306012247.246003-1-andi.shyti@linux.intel.com> References: <20240306012247.246003-1-andi.shyti@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	Disable automatic load CCS load balancing \| expand [v4,0/3] Disable automatic load CCS load balancing [v4,1/3] drm/i915/gt: Disable HW load balancing for CCS [v4,2/3] drm/i915/gt: Refactor uabi engine class/instance list creation [v4,3/3] drm/i915/gt: Enable only one CCS for compute workload

Message ID

20240306012247.246003-2-andi.shyti@linux.intel.com (mailing list archive)

State

New, archived

Headers

From: Andi Shyti <andi.shyti@linux.intel.com>
To: intel-gfx <intel-gfx@lists.freedesktop.org>,
 dri-devel <dri-devel@lists.freedesktop.org>
Cc: Chris Wilson <chris.p.wilson@linux.intel.com>,
 Joonas Lahtinen <joonas.lahtinen@linux.intel.com>,
 Matt Roper <matthew.d.roper@intel.com>,
 John Harrison <John.C.Harrison@Intel.com>, stable@vger.kernel.org,
 Andi Shyti <andi.shyti@linux.intel.com>,
 Andi Shyti <andi.shyti@kernel.org>, Tvrtko Ursulin <tursulin@ursulin.net>
Subject: [PATCH v4 1/3] drm/i915/gt: Disable HW load balancing for CCS
Date: Wed,  6 Mar 2024 02:22:45 +0100
Message-ID: <20240306012247.246003-2-andi.shyti@linux.intel.com>
In-Reply-To: <20240306012247.246003-1-andi.shyti@linux.intel.com>
References: <20240306012247.246003-1-andi.shyti@linux.intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

Disable automatic load CCS load balancing | expand

Commit Message

Andi Shyti March 6, 2024, 1:22 a.m. UTC

The hardware should not dynamically balance the load between CCS
engines. Wa_14019159160 recommends disabling it across all
platforms.

Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: <stable@vger.kernel.org> # v6.2+
---
 drivers/gpu/drm/i915/gt/intel_gt_regs.h     | 1 +
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 5 +++++
 2 files changed, 6 insertions(+)

Comments

Matt Roper March 6, 2024, 11:46 p.m. UTC | #1

On Wed, Mar 06, 2024 at 02:22:45AM +0100, Andi Shyti wrote:
> The hardware should not dynamically balance the load between CCS
> engines. Wa_14019159160 recommends disabling it across all
> platforms.
> 
> Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: <stable@vger.kernel.org> # v6.2+
> ---
>  drivers/gpu/drm/i915/gt/intel_gt_regs.h     | 1 +
>  drivers/gpu/drm/i915/gt/intel_workarounds.c | 5 +++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> index 50962cfd1353..cf709f6c05ae 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> @@ -1478,6 +1478,7 @@
>  
>  #define GEN12_RCU_MODE				_MMIO(0x14800)
>  #define   GEN12_RCU_MODE_CCS_ENABLE		REG_BIT(0)
> +#define   XEHP_RCU_MODE_FIXED_SLICE_CCS_MODE	REG_BIT(1)
>  
>  #define CHV_FUSE_GT				_MMIO(VLV_GUNIT_BASE + 0x2168)
>  #define   CHV_FGT_DISABLE_SS0			(1 << 10)
> diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> index d67d44611c28..a2e78cf0b5f5 100644
> --- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
> +++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> @@ -2945,6 +2945,11 @@ general_render_compute_wa_init(struct intel_engine_cs *engine, struct i915_wa_li
>  
>  		/* Wa_18028616096 */
>  		wa_mcr_write_or(wal, LSC_CHICKEN_BIT_0_UDW, UGM_FRAGMENT_THRESHOLD_TO_3);
> +
> +		/*
> +		 * Wa_14019159160: disable the automatic CCS load balancing

I'm still a bit concerned that this doesn't really match what this
specific workaround is asking us to do.  There seems to be an agreement
on various internal email threads that we need to disable load
balancing, but there's no single specific workaround that officially
documents that decision.

This specific workaround asks us to do a bunch of different things, and
the third item it asks for is to disable load balancing in very specific
cases (i.e., while the RCS is active at the same time as one or more CCS
engines).  Taking this workaround in isolation, it would be valid to
keep load balancing active if you were just using the CCS engines and
leaving the RCS idle, or if balancing was turned on/off by the GuC
scheduler according to engine use at the moment, as the documented
workaround seems to assume will be the case.

So in general I think we do need to disable load balancing based on
other offline discussion, but blaming that entire change on
Wa_14019159160 seems a bit questionable since it's not really what this
specific workaround is asking us to do and someone may come back and try
to "correct" the implementation of this workaround in the future without
realizing there are other factors too.  It would be great if we could
get hardware teams to properly document this expectation somewhere
(either in a separate dedicated workaround, or in the MMIO tuning guide)
so that we'll have a more direct and authoritative source for such a
large behavioral change.

Matt

> +		 */
> +		wa_masked_en(wal, GEN12_RCU_MODE, XEHP_RCU_MODE_FIXED_SLICE_CCS_MODE);
>  	}
>  
>  	if (IS_DG2_G11(i915)) {
> -- 
> 2.43.0
>

Andi Shyti March 7, 2024, 8:02 p.m. UTC | #2

Hi Matt,

On Wed, Mar 06, 2024 at 03:46:09PM -0800, Matt Roper wrote:
> On Wed, Mar 06, 2024 at 02:22:45AM +0100, Andi Shyti wrote:
> > The hardware should not dynamically balance the load between CCS
> > engines. Wa_14019159160 recommends disabling it across all
> > platforms.
> > 
> > Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
> > Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> > Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Matt Roper <matthew.d.roper@intel.com>
> > Cc: <stable@vger.kernel.org> # v6.2+
> > ---
> >  drivers/gpu/drm/i915/gt/intel_gt_regs.h     | 1 +
> >  drivers/gpu/drm/i915/gt/intel_workarounds.c | 5 +++++
> >  2 files changed, 6 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> > index 50962cfd1353..cf709f6c05ae 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
> > @@ -1478,6 +1478,7 @@
> >  
> >  #define GEN12_RCU_MODE				_MMIO(0x14800)
> >  #define   GEN12_RCU_MODE_CCS_ENABLE		REG_BIT(0)
> > +#define   XEHP_RCU_MODE_FIXED_SLICE_CCS_MODE	REG_BIT(1)
> >  
> >  #define CHV_FUSE_GT				_MMIO(VLV_GUNIT_BASE + 0x2168)
> >  #define   CHV_FGT_DISABLE_SS0			(1 << 10)
> > diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> > index d67d44611c28..a2e78cf0b5f5 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
> > @@ -2945,6 +2945,11 @@ general_render_compute_wa_init(struct intel_engine_cs *engine, struct i915_wa_li
> >  
> >  		/* Wa_18028616096 */
> >  		wa_mcr_write_or(wal, LSC_CHICKEN_BIT_0_UDW, UGM_FRAGMENT_THRESHOLD_TO_3);
> > +
> > +		/*
> > +		 * Wa_14019159160: disable the automatic CCS load balancing
> 
> I'm still a bit concerned that this doesn't really match what this
> specific workaround is asking us to do.  There seems to be an agreement
> on various internal email threads that we need to disable load
> balancing, but there's no single specific workaround that officially
> documents that decision.
> 
> This specific workaround asks us to do a bunch of different things, and
> the third item it asks for is to disable load balancing in very specific
> cases (i.e., while the RCS is active at the same time as one or more CCS
> engines).  Taking this workaround in isolation, it would be valid to
> keep load balancing active if you were just using the CCS engines and
> leaving the RCS idle, or if balancing was turned on/off by the GuC
> scheduler according to engine use at the moment, as the documented
> workaround seems to assume will be the case.
> 
> So in general I think we do need to disable load balancing based on
> other offline discussion, but blaming that entire change on
> Wa_14019159160 seems a bit questionable since it's not really what this
> specific workaround is asking us to do and someone may come back and try
> to "correct" the implementation of this workaround in the future without
> realizing there are other factors too.  It would be great if we could
> get hardware teams to properly document this expectation somewhere
> (either in a separate dedicated workaround, or in the MMIO tuning guide)
> so that we'll have a more direct and authoritative source for such a
> large behavioral change.

On one had I think you are right, on the other hand I think this
workaround has not properly developed in what we have been
describing later.

Perhaps, one solution would be to create a new generic workaround
for all platforms with more than one CCS and put everyone at
peace. But I don't know the process.

Are you able to help here? Or Joonas?

Thanks, Matt!
Andi

John Harrison March 7, 2024, 8:18 p.m. UTC | #3

On 3/7/2024 12:02, Andi Shyti wrote:
> Hi Matt,
>
> On Wed, Mar 06, 2024 at 03:46:09PM -0800, Matt Roper wrote:
>> On Wed, Mar 06, 2024 at 02:22:45AM +0100, Andi Shyti wrote:
>>> The hardware should not dynamically balance the load between CCS
>>> engines. Wa_14019159160 recommends disabling it across all
>>> platforms.
>>>
>>> Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement")
>>> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
>>> Cc: Chris Wilson <chris.p.wilson@linux.intel.com>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>> Cc: <stable@vger.kernel.org> # v6.2+
>>> ---
>>>   drivers/gpu/drm/i915/gt/intel_gt_regs.h     | 1 +
>>>   drivers/gpu/drm/i915/gt/intel_workarounds.c | 5 +++++
>>>   2 files changed, 6 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
>>> index 50962cfd1353..cf709f6c05ae 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
>>> @@ -1478,6 +1478,7 @@
>>>   
>>>   #define GEN12_RCU_MODE				_MMIO(0x14800)
>>>   #define   GEN12_RCU_MODE_CCS_ENABLE		REG_BIT(0)
>>> +#define   XEHP_RCU_MODE_FIXED_SLICE_CCS_MODE	REG_BIT(1)
>>>   
>>>   #define CHV_FUSE_GT				_MMIO(VLV_GUNIT_BASE + 0x2168)
>>>   #define   CHV_FGT_DISABLE_SS0			(1 << 10)
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
>>> index d67d44611c28..a2e78cf0b5f5 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
>>> @@ -2945,6 +2945,11 @@ general_render_compute_wa_init(struct intel_engine_cs *engine, struct i915_wa_li
>>>   
>>>   		/* Wa_18028616096 */
>>>   		wa_mcr_write_or(wal, LSC_CHICKEN_BIT_0_UDW, UGM_FRAGMENT_THRESHOLD_TO_3);
>>> +
>>> +		/*
>>> +		 * Wa_14019159160: disable the automatic CCS load balancing
>> I'm still a bit concerned that this doesn't really match what this
>> specific workaround is asking us to do.  There seems to be an agreement
>> on various internal email threads that we need to disable load
>> balancing, but there's no single specific workaround that officially
>> documents that decision.
>>
>> This specific workaround asks us to do a bunch of different things, and
>> the third item it asks for is to disable load balancing in very specific
>> cases (i.e., while the RCS is active at the same time as one or more CCS
>> engines).  Taking this workaround in isolation, it would be valid to
>> keep load balancing active if you were just using the CCS engines and
>> leaving the RCS idle, or if balancing was turned on/off by the GuC
>> scheduler according to engine use at the moment, as the documented
>> workaround seems to assume will be the case.
>>
>> So in general I think we do need to disable load balancing based on
>> other offline discussion, but blaming that entire change on
>> Wa_14019159160 seems a bit questionable since it's not really what this
>> specific workaround is asking us to do and someone may come back and try
>> to "correct" the implementation of this workaround in the future without
>> realizing there are other factors too.  It would be great if we could
>> get hardware teams to properly document this expectation somewhere
>> (either in a separate dedicated workaround, or in the MMIO tuning guide)
>> so that we'll have a more direct and authoritative source for such a
>> large behavioral change.
> On one had I think you are right, on the other hand I think this
> workaround has not properly developed in what we have been
> describing later.
I think it is not so much that the w/a is 'not properly developed'. It's 
more that this w/a plus others when taken in combination plus knowledge 
of future directions has led to an architectural decision that is beyond 
the scope of the w/a.

As such, I think Matt is definitely correct. Tagging a code change with 
a w/a number when that change does something very different to what is 
described in the w/a is wrong and a maintenance issue waiting to happen.

At the very least, you should just put in a comment explaining the 
situation. E.g.:

  /*
  * Wa_14019159160: This w/a plus others cause significant issues with the use of
  * load balancing. Hence an architectural level decision was taking to simply
  * disable automatic CCS load balancing completely.
  */

Ideally yes, we would get an officially trackable software only 
workaround number or something created and just use that. But in the 
meantime, just clearly explaining the situation seems reasonable to me.

John.


>
> Perhaps, one solution would be to create a new generic workaround
> for all platforms with more than one CCS and put everyone at
> peace. But I don't know the process.
>
> Are you able to help here? Or Joonas?
>
> Thanks, Matt!
> Andi

Andi Shyti March 7, 2024, 11:53 p.m. UTC | #4

Hi John,

...

> > > > +
> > > > +		/*
> > > > +		 * Wa_14019159160: disable the automatic CCS load balancing
> > > I'm still a bit concerned that this doesn't really match what this
> > > specific workaround is asking us to do.  There seems to be an agreement
> > > on various internal email threads that we need to disable load
> > > balancing, but there's no single specific workaround that officially
> > > documents that decision.
> > > 
> > > This specific workaround asks us to do a bunch of different things, and
> > > the third item it asks for is to disable load balancing in very specific
> > > cases (i.e., while the RCS is active at the same time as one or more CCS
> > > engines).  Taking this workaround in isolation, it would be valid to
> > > keep load balancing active if you were just using the CCS engines and
> > > leaving the RCS idle, or if balancing was turned on/off by the GuC
> > > scheduler according to engine use at the moment, as the documented
> > > workaround seems to assume will be the case.
> > > 
> > > So in general I think we do need to disable load balancing based on
> > > other offline discussion, but blaming that entire change on
> > > Wa_14019159160 seems a bit questionable since it's not really what this
> > > specific workaround is asking us to do and someone may come back and try
> > > to "correct" the implementation of this workaround in the future without
> > > realizing there are other factors too.  It would be great if we could
> > > get hardware teams to properly document this expectation somewhere
> > > (either in a separate dedicated workaround, or in the MMIO tuning guide)
> > > so that we'll have a more direct and authoritative source for such a
> > > large behavioral change.
> > On one had I think you are right, on the other hand I think this
> > workaround has not properly developed in what we have been
> > describing later.
> I think it is not so much that the w/a is 'not properly developed'. It's
> more that this w/a plus others when taken in combination plus knowledge of
> future directions has led to an architectural decision that is beyond the
> scope of the w/a.
> 
> As such, I think Matt is definitely correct. Tagging a code change with a
> w/a number when that change does something very different to what is
> described in the w/a is wrong and a maintenance issue waiting to happen.
> 
> At the very least, you should just put in a comment explaining the
> situation. E.g.:
> 
>  /*
>  * Wa_14019159160: This w/a plus others cause significant issues with the use of
>  * load balancing. Hence an architectural level decision was taking to simply
>  * disable automatic CCS load balancing completely.
>  */
Good suggestion! I will anyway check tomorrow with Joonas if it's
worth the effort to set up a new "software" workaround.

Thanks,
Andi

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
index 50962cfd1353..cf709f6c05ae 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h
@@ -1478,6 +1478,7 @@ 
 
 #define GEN12_RCU_MODE				_MMIO(0x14800)
 #define   GEN12_RCU_MODE_CCS_ENABLE		REG_BIT(0)
+#define   XEHP_RCU_MODE_FIXED_SLICE_CCS_MODE	REG_BIT(1)
 
 #define CHV_FUSE_GT				_MMIO(VLV_GUNIT_BASE + 0x2168)
 #define   CHV_FGT_DISABLE_SS0			(1 << 10)
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index d67d44611c28..a2e78cf0b5f5 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -2945,6 +2945,11 @@  general_render_compute_wa_init(struct intel_engine_cs *engine, struct i915_wa_li
 
 		/* Wa_18028616096 */
 		wa_mcr_write_or(wal, LSC_CHICKEN_BIT_0_UDW, UGM_FRAGMENT_THRESHOLD_TO_3);
+
+		/*
+		 * Wa_14019159160: disable the automatic CCS load balancing
+		 */
+		wa_masked_en(wal, GEN12_RCU_MODE, XEHP_RCU_MODE_FIXED_SLICE_CCS_MODE);
 	}
 
 	if (IS_DG2_G11(i915)) {

[v4,1/3] drm/i915/gt: Disable HW load balancing for CCS

Commit Message

Comments

Patch