diff mbox

drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

Message ID 1420944289-832-1-git-send-email-kenneth@whitecape.org (mailing list archive)
State New, archived
Headers show

Commit Message

Kenneth Graunke Jan. 11, 2015, 2:44 a.m. UTC
This is an important optimization for avoiding read-after-write (RAW)
stalls in the HiZ buffer.  Certain workloads would run very slowly with
HiZ enabled, but run much faster with the "hiz=false" driconf option.
With this patch, they run at full speed even with HiZ.

Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
(Iris Pro 6200).

Thanks to Jesse Barnes for finding this missing bit!
Thanks to Chris Wilson for helping me find where to set it.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
---
 drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Here's an alternate patch which implements the workaround in the kernel
instead of Mesa.  It's probably better to do it there, since the kernel
does it on Haswell already.

Comments

Ben Widawsky Jan. 11, 2015, 9:49 p.m. UTC | #1
On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> This is an important optimization for avoiding read-after-write (RAW)
> stalls in the HiZ buffer.  Certain workloads would run very slowly with
> HiZ enabled, but run much faster with the "hiz=false" driconf option.
> With this patch, they run at full speed even with HiZ.
> 
> Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> (Iris Pro 6200).
> 
> Thanks to Jesse Barnes for finding this missing bit!
> Thanks to Chris Wilson for helping me find where to set it.
> 
> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> ---
>  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> Here's an alternate patch which implements the workaround in the kernel
> instead of Mesa.  It's probably better to do it there, since the kernel
> does it on Haswell already.
> 
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index dabc1d8..23020d6 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
>  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
>  
> +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> +	 *  buffer."
> +	 *
> +	 * This optimization is off by default for Broadwell; turn it on.
> +	 */
> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> +
>  	/* Wa4x4STCOptimizationDisable:bdw */
>  	WA_SET_BIT_MASKED(CACHE_MODE_1,
>  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
>  			  HDC_FORCE_NON_COHERENT |
>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
>  
> +	/* According to the CACHE_MODE_0 default value documentation, some
> +	 * CHV platforms disable this optimization by default.  Turn it on.
> +	 */
> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> +
>  	/* Improve HiZ throughput on CHV. */
>  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
>  

I think you should do this as two separate patches, 1 per platform. For the BSW
patch (given that I had the same functionality in the kernel patch I asked you
to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
which we can use for the commit):
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

I haven't looked at Broadwell docs, so I'll let someone else take care of that.

I don't know if I agree with Chris that we should call these in the workaround
section, but whatever. init_clock_gating is equally sucky.
Kenneth Graunke Jan. 12, 2015, 12:05 a.m. UTC | #2
On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
> On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> > This is an important optimization for avoiding read-after-write (RAW)
> > stalls in the HiZ buffer.  Certain workloads would run very slowly with
> > HiZ enabled, but run much faster with the "hiz=false" driconf option.
> > With this patch, they run at full speed even with HiZ.
> > 
> > Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> > (Iris Pro 6200).
> > 
> > Thanks to Jesse Barnes for finding this missing bit!
> > Thanks to Chris Wilson for helping me find where to set it.
> > 
> > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > ---
> >  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> > 
> > Here's an alternate patch which implements the workaround in the kernel
> > instead of Mesa.  It's probably better to do it there, since the kernel
> > does it on Haswell already.
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index dabc1d8..23020d6 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> >  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> >  
> > +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> > +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> > +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> > +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> > +	 *  buffer."
> > +	 *
> > +	 * This optimization is off by default for Broadwell; turn it on.
> > +	 */
> > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > +
> >  	/* Wa4x4STCOptimizationDisable:bdw */
> >  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> >  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> > @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> >  			  HDC_FORCE_NON_COHERENT |
> >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> >  
> > +	/* According to the CACHE_MODE_0 default value documentation, some
> > +	 * CHV platforms disable this optimization by default.  Turn it on.
> > +	 */
> > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > +
> >  	/* Improve HiZ throughput on CHV. */
> >  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
> >  
> 
> I think you should do this as two separate patches, 1 per platform. For the BSW
> patch (given that I had the same functionality in the kernel patch I asked you
> to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
> which we can use for the commit):
> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>

Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
and resubmit...

> I haven't looked at Broadwell docs, so I'll let someone else take care of that.
> 
> I don't know if I agree with Chris that we should call these in the workaround
> section, but whatever. init_clock_gating is equally sucky.

init_clock_gating doesn't work.  The register writes don't stick and they have
no effect at all.  Setting them here makes them actually take effect in the
context.

--Ken
Ben Widawsky Jan. 12, 2015, 1:46 a.m. UTC | #3
On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
> On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
> > On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> > > This is an important optimization for avoiding read-after-write (RAW)
> > > stalls in the HiZ buffer.  Certain workloads would run very slowly with
> > > HiZ enabled, but run much faster with the "hiz=false" driconf option.
> > > With this patch, they run at full speed even with HiZ.
> > > 
> > > Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> > > (Iris Pro 6200).
> > > 
> > > Thanks to Jesse Barnes for finding this missing bit!
> > > Thanks to Chris Wilson for helping me find where to set it.
> > > 
> > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > > ---
> > >  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> > >  1 file changed, 15 insertions(+)
> > > 
> > > Here's an alternate patch which implements the workaround in the kernel
> > > instead of Mesa.  It's probably better to do it there, since the kernel
> > > does it on Haswell already.
> > > 
> > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > index dabc1d8..23020d6 100644
> > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> > >  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> > >  
> > > +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> > > +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> > > +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> > > +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> > > +	 *  buffer."
> > > +	 *
> > > +	 * This optimization is off by default for Broadwell; turn it on.
> > > +	 */
> > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > +
> > >  	/* Wa4x4STCOptimizationDisable:bdw */
> > >  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> > >  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> > > @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> > >  			  HDC_FORCE_NON_COHERENT |
> > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> > >  
> > > +	/* According to the CACHE_MODE_0 default value documentation, some
> > > +	 * CHV platforms disable this optimization by default.  Turn it on.
> > > +	 */
> > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > +
> > >  	/* Improve HiZ throughput on CHV. */
> > >  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
> > >  
> > 
> > I think you should do this as two separate patches, 1 per platform. For the BSW
> > patch (given that I had the same functionality in the kernel patch I asked you
> > to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
> > which we can use for the commit):
> > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> 
> Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
> and resubmit...
> 

It's not my call, it's just nice to have platform specific bisection. And the
patch wasn't on the list, it was the one I kept asking you to look at in my
branch :-)

> > I haven't looked at Broadwell docs, so I'll let someone else take care of that.
> > 
> > I don't know if I agree with Chris that we should call these in the workaround
> > section, but whatever. init_clock_gating is equally sucky.
> 
> init_clock_gating doesn't work.  The register writes don't stick and they have
> no effect at all.  Setting them here makes them actually take effect in the
> context.
> 
> --Ken

Separate thread now, but are you sure? We're setting at least two context
specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
important to performance).

AFAIK it should stick, and if it doesn't it's not expected behavior. Unless you
know something I do not?
Kenneth Graunke Jan. 12, 2015, 2:53 a.m. UTC | #4
On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
> On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
> > On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
> > > On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> > > > This is an important optimization for avoiding read-after-write (RAW)
> > > > stalls in the HiZ buffer.  Certain workloads would run very slowly with
> > > > HiZ enabled, but run much faster with the "hiz=false" driconf option.
> > > > With this patch, they run at full speed even with HiZ.
> > > > 
> > > > Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> > > > (Iris Pro 6200).
> > > > 
> > > > Thanks to Jesse Barnes for finding this missing bit!
> > > > Thanks to Chris Wilson for helping me find where to set it.
> > > > 
> > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > > > ---
> > > >  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> > > >  1 file changed, 15 insertions(+)
> > > > 
> > > > Here's an alternate patch which implements the workaround in the kernel
> > > > instead of Mesa.  It's probably better to do it there, since the kernel
> > > > does it on Haswell already.
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > index dabc1d8..23020d6 100644
> > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> > > >  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> > > >  
> > > > +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> > > > +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> > > > +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> > > > +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> > > > +	 *  buffer."
> > > > +	 *
> > > > +	 * This optimization is off by default for Broadwell; turn it on.
> > > > +	 */
> > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > +
> > > >  	/* Wa4x4STCOptimizationDisable:bdw */
> > > >  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> > > >  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> > > > @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> > > >  			  HDC_FORCE_NON_COHERENT |
> > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> > > >  
> > > > +	/* According to the CACHE_MODE_0 default value documentation, some
> > > > +	 * CHV platforms disable this optimization by default.  Turn it on.
> > > > +	 */
> > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > +
> > > >  	/* Improve HiZ throughput on CHV. */
> > > >  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
> > > >  
> > > 
> > > I think you should do this as two separate patches, 1 per platform. For the BSW
> > > patch (given that I had the same functionality in the kernel patch I asked you
> > > to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
> > > which we can use for the commit):
> > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > 
> > Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
> > and resubmit...
> > 
> 
> It's not my call, it's just nice to have platform specific bisection. And the
> patch wasn't on the list, it was the one I kept asking you to look at in my
> branch :-)
>
> > > I haven't looked at Broadwell docs, so I'll let someone else take care of that.
> > > 
> > > I don't know if I agree with Chris that we should call these in the workaround
> > > section, but whatever. init_clock_gating is equally sucky.
> > 
> > init_clock_gating doesn't work.  The register writes don't stick and they have
> > no effect at all.  Setting them here makes them actually take effect in the
> > context.
> > 
> > --Ken
> 
> Separate thread now, but are you sure? We're setting at least two context
> specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
> important to performance).
> 
> AFAIK it should stick, and if it doesn't it's not expected behavior. Unless you
> know something I do not?

Jesse had suggested setting it in broadwell_init_clock_gating on January 5th,
and Valtteri tried it on January 7th.  He found "no noticeable difference".
I tried it again, and confirmed his result: there was zero performance impact.

Setting it via an LRI in Mesa did have a performance impact.  I reverted my
Mesa patch, and tried setting it here, and it had the same performance impact.
I rebooted between kernels several times to confirm.  It works here, but it
doesn't there.

I'm pretty sure I confirmed the same result with this bit.  Feel free to try.

Perhaps we should move the rest of the per-context bits here instead of
*_init_clock_gating.  We should also confirm that the other bits are actually
having an effect.

I don't know why it works on Haswell, but it does there - the HiZ RAW stall
bit is set via haswell_init_clock_gating, and it's clearly having an impact.
Maybe it has something to do with the golden context, which is new on BDW.
But I'm probably wrong about that.  Setting it when a context is active does
seem more reliable...

--Ken
Kenneth Graunke Jan. 12, 2015, 3:05 a.m. UTC | #5
On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
> On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
> > On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
> > > On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> > > > This is an important optimization for avoiding read-after-write (RAW)
> > > > stalls in the HiZ buffer.  Certain workloads would run very slowly with
> > > > HiZ enabled, but run much faster with the "hiz=false" driconf option.
> > > > With this patch, they run at full speed even with HiZ.
> > > > 
> > > > Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> > > > (Iris Pro 6200).
> > > > 
> > > > Thanks to Jesse Barnes for finding this missing bit!
> > > > Thanks to Chris Wilson for helping me find where to set it.
> > > > 
> > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > > > ---
> > > >  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> > > >  1 file changed, 15 insertions(+)
> > > > 
> > > > Here's an alternate patch which implements the workaround in the kernel
> > > > instead of Mesa.  It's probably better to do it there, since the kernel
> > > > does it on Haswell already.
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > index dabc1d8..23020d6 100644
> > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> > > >  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> > > >  
> > > > +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> > > > +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> > > > +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> > > > +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> > > > +	 *  buffer."
> > > > +	 *
> > > > +	 * This optimization is off by default for Broadwell; turn it on.
> > > > +	 */
> > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > +
> > > >  	/* Wa4x4STCOptimizationDisable:bdw */
> > > >  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> > > >  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> > > > @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> > > >  			  HDC_FORCE_NON_COHERENT |
> > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> > > >  
> > > > +	/* According to the CACHE_MODE_0 default value documentation, some
> > > > +	 * CHV platforms disable this optimization by default.  Turn it on.
> > > > +	 */
> > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > +
> > > >  	/* Improve HiZ throughput on CHV. */
> > > >  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
> > > >  
> > > 
> > > I think you should do this as two separate patches, 1 per platform. For the BSW
> > > patch (given that I had the same functionality in the kernel patch I asked you
> > > to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
> > > which we can use for the commit):
> > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > 
> > Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
> > and resubmit...
> > 
> 
> It's not my call, it's just nice to have platform specific bisection. And the
> patch wasn't on the list, it was the one I kept asking you to look at in my
> branch :-)
> 
> > > I haven't looked at Broadwell docs, so I'll let someone else take care of that.
> > > 
> > > I don't know if I agree with Chris that we should call these in the workaround
> > > section, but whatever. init_clock_gating is equally sucky.
> > 
> > init_clock_gating doesn't work.  The register writes don't stick and they have
> > no effect at all.  Setting them here makes them actually take effect in the
> > context.
> > 
> > --Ken
> 
> Separate thread now, but are you sure? We're setting at least two context
> specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
> important to performance).

It looks like we're setting:

- [BDW] GAM_ECOCHK - 0x4090
- [BDW] CHICKEN_PAR1_1 - 0x42080
- [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
- [BDW, CHV] UCGCTL6 - 0x9430
- [BDW, CHV] FF_THREAD_MODE - 0x20a0
- [CHV] DSPCLK_GATE_D - display
- [CHV] MI_ARB_VLV - display
- [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
- [CHV] UCGCTL1 - 0x9400

I searched for all of these in the "Register State Context" tables for BDW
and CHV, and I didn't see any of them listed (including FF_THREAD_MODE or
0x20a0).  So I'm pretty sure these are not part of the context, and so they
should work.

--Ken
Ben Widawsky Jan. 12, 2015, 3:09 a.m. UTC | #6
On Sun, Jan 11, 2015 at 06:53:32PM -0800, Kenneth Graunke wrote:

[snip]

> 
> Jesse had suggested setting it in broadwell_init_clock_gating on January 5th,
> and Valtteri tried it on January 7th.  He found "no noticeable difference".
> I tried it again, and confirmed his result: there was zero performance impact.
> 
> Setting it via an LRI in Mesa did have a performance impact.  I reverted my
> Mesa patch, and tried setting it here, and it had the same performance impact.
> I rebooted between kernels several times to confirm.  It works here, but it
> doesn't there.
> 
> I'm pretty sure I confirmed the same result with this bit.  Feel free to try.
> 

That's okay. I believe you, I just thought you may have known something I didn't.

> Perhaps we should move the rest of the per-context bits here instead of
> *_init_clock_gating.  We should also confirm that the other bits are actually
> having an effect.

If this is the behavior we're getting, we should absolutely do this.

> 
> I don't know why it works on Haswell, but it does there - the HiZ RAW stall
> bit is set via haswell_init_clock_gating, and it's clearly having an impact.
> Maybe it has something to do with the golden context, which is new on BDW.
> But I'm probably wrong about that.  Setting it when a context is active does
> seem more reliable...

The golden context stuff has been backported to all gens supported hardware
contexts and/or execlists, so I don't think it's that.

The big difference between the workarounds and init clock gating is the former
is done via LRI, and the latter MMIO. (also, the latter is run again on resume,
and I don't think the former is).

Thinking outloud - what's the default setting for execlists on BDW now?  For
execlists my plan (when it was my plan to have) had always been to manually set
the register in the context image before loading it. We don't do that with the
existing code, we use the old ringbuffer style of, hope it preserves the
contents. I wonder if that's the distinction between HSW.

> 
> --Ken
Ben Widawsky Jan. 12, 2015, 3:14 a.m. UTC | #7
On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
> On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
> > On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
> > > On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
> > > > On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> > > > > This is an important optimization for avoiding read-after-write (RAW)
> > > > > stalls in the HiZ buffer.  Certain workloads would run very slowly with
> > > > > HiZ enabled, but run much faster with the "hiz=false" driconf option.
> > > > > With this patch, they run at full speed even with HiZ.
> > > > > 
> > > > > Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> > > > > (Iris Pro 6200).
> > > > > 
> > > > > Thanks to Jesse Barnes for finding this missing bit!
> > > > > Thanks to Chris Wilson for helping me find where to set it.
> > > > > 
> > > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> > > > >  1 file changed, 15 insertions(+)
> > > > > 
> > > > > Here's an alternate patch which implements the workaround in the kernel
> > > > > instead of Mesa.  It's probably better to do it there, since the kernel
> > > > > does it on Haswell already.
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > index dabc1d8..23020d6 100644
> > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> > > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> > > > >  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> > > > >  
> > > > > +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> > > > > +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> > > > > +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> > > > > +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> > > > > +	 *  buffer."
> > > > > +	 *
> > > > > +	 * This optimization is off by default for Broadwell; turn it on.
> > > > > +	 */
> > > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > > +
> > > > >  	/* Wa4x4STCOptimizationDisable:bdw */
> > > > >  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> > > > >  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> > > > > @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> > > > >  			  HDC_FORCE_NON_COHERENT |
> > > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> > > > >  
> > > > > +	/* According to the CACHE_MODE_0 default value documentation, some
> > > > > +	 * CHV platforms disable this optimization by default.  Turn it on.
> > > > > +	 */
> > > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > > +
> > > > >  	/* Improve HiZ throughput on CHV. */
> > > > >  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
> > > > >  
> > > > 
> > > > I think you should do this as two separate patches, 1 per platform. For the BSW
> > > > patch (given that I had the same functionality in the kernel patch I asked you
> > > > to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
> > > > which we can use for the commit):
> > > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > > 
> > > Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
> > > and resubmit...
> > > 
> > 
> > It's not my call, it's just nice to have platform specific bisection. And the
> > patch wasn't on the list, it was the one I kept asking you to look at in my
> > branch :-)
> > 
> > > > I haven't looked at Broadwell docs, so I'll let someone else take care of that.
> > > > 
> > > > I don't know if I agree with Chris that we should call these in the workaround
> > > > section, but whatever. init_clock_gating is equally sucky.
> > > 
> > > init_clock_gating doesn't work.  The register writes don't stick and they have
> > > no effect at all.  Setting them here makes them actually take effect in the
> > > context.
> > > 
> > > --Ken
> > 
> > Separate thread now, but are you sure? We're setting at least two context
> > specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
> > important to performance).
> 
> It looks like we're setting:
> 
> - [BDW] GAM_ECOCHK - 0x4090

ECO registers are never ctx, I think

> - [BDW] CHICKEN_PAR1_1 - 0x42080

Diplay registers are never

> - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050

dword offset 0x1c in the context image

> - [BDW, CHV] UCGCTL6 - 0x9430
> - [BDW, CHV] FF_THREAD_MODE - 0x20a0

dword offset 0x2a in the context image

> - [CHV] DSPCLK_GATE_D - display
> - [CHV] MI_ARB_VLV - display

More display...

> - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050

Kinda surprised this one isn't there. I'm not sure how it can work correctly.

> - [CHV] UCGCTL1 - 0x9400
> 
> I searched for all of these in the "Register State Context" tables for BDW
> and CHV, and I didn't see any of them listed (including FF_THREAD_MODE or
> 0x20a0).  So I'm pretty sure these are not part of the context, and so they
> should work.

We must be looking at different docs.

> 
> --Ken
Ville Syrjälä Jan. 12, 2015, 12:02 p.m. UTC | #8
On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
> On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
> > On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
> > > On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
> > > > On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
> > > > > On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> > > > > > This is an important optimization for avoiding read-after-write (RAW)
> > > > > > stalls in the HiZ buffer.  Certain workloads would run very slowly with
> > > > > > HiZ enabled, but run much faster with the "hiz=false" driconf option.
> > > > > > With this patch, they run at full speed even with HiZ.
> > > > > > 
> > > > > > Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> > > > > > (Iris Pro 6200).
> > > > > > 
> > > > > > Thanks to Jesse Barnes for finding this missing bit!
> > > > > > Thanks to Chris Wilson for helping me find where to set it.
> > > > > > 
> > > > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > > > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> > > > > >  1 file changed, 15 insertions(+)
> > > > > > 
> > > > > > Here's an alternate patch which implements the workaround in the kernel
> > > > > > instead of Mesa.  It's probably better to do it there, since the kernel
> > > > > > does it on Haswell already.
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > index dabc1d8..23020d6 100644
> > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> > > > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> > > > > >  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> > > > > >  
> > > > > > +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> > > > > > +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> > > > > > +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> > > > > > +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> > > > > > +	 *  buffer."
> > > > > > +	 *
> > > > > > +	 * This optimization is off by default for Broadwell; turn it on.
> > > > > > +	 */
> > > > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > > > +
> > > > > >  	/* Wa4x4STCOptimizationDisable:bdw */
> > > > > >  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> > > > > >  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> > > > > > @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> > > > > >  			  HDC_FORCE_NON_COHERENT |
> > > > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> > > > > >  
> > > > > > +	/* According to the CACHE_MODE_0 default value documentation, some
> > > > > > +	 * CHV platforms disable this optimization by default.  Turn it on.
> > > > > > +	 */
> > > > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > > > +
> > > > > >  	/* Improve HiZ throughput on CHV. */
> > > > > >  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
> > > > > >  
> > > > > 
> > > > > I think you should do this as two separate patches, 1 per platform. For the BSW
> > > > > patch (given that I had the same functionality in the kernel patch I asked you
> > > > > to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
> > > > > which we can use for the commit):
> > > > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > > > 
> > > > Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
> > > > and resubmit...
> > > > 
> > > 
> > > It's not my call, it's just nice to have platform specific bisection. And the
> > > patch wasn't on the list, it was the one I kept asking you to look at in my
> > > branch :-)
> > > 
> > > > > I haven't looked at Broadwell docs, so I'll let someone else take care of that.
> > > > > 
> > > > > I don't know if I agree with Chris that we should call these in the workaround
> > > > > section, but whatever. init_clock_gating is equally sucky.
> > > > 
> > > > init_clock_gating doesn't work.  The register writes don't stick and they have
> > > > no effect at all.  Setting them here makes them actually take effect in the
> > > > context.
> > > > 
> > > > --Ken
> > > 
> > > Separate thread now, but are you sure? We're setting at least two context
> > > specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
> > > important to performance).
> > 
> > It looks like we're setting:
> > - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
> 
> dword offset 0x1c in the context image

power context, not logical context

> > - [BDW, CHV] FF_THREAD_MODE - 0x20a0
> 
> dword offset 0x2a in the context image

Also power context

> > - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
> 
> Kinda surprised this one isn't there. I'm not sure how it can work correctly.

We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In any
case it's a VCS register. Sadly I've not found any documentation for !RCS
power context, but I'm assuming every engine has a power context of some sort.
Ville Syrjälä Jan. 12, 2015, 12:32 p.m. UTC | #9
On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> This is an important optimization for avoiding read-after-write (RAW)
> stalls in the HiZ buffer.  Certain workloads would run very slowly with
> HiZ enabled, but run much faster with the "hiz=false" driconf option.
> With this patch, they run at full speed even with HiZ.
> 
> Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> (Iris Pro 6200).
> 
> Thanks to Jesse Barnes for finding this missing bit!
> Thanks to Chris Wilson for helping me find where to set it.
> 
> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> ---
>  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> Here's an alternate patch which implements the workaround in the kernel
> instead of Mesa.  It's probably better to do it there, since the kernel
> does it on Haswell already.
> 
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index dabc1d8..23020d6 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
>  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
>  
> +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> +	 *  buffer."
> +	 *
> +	 * This optimization is off by default for Broadwell; turn it on.
> +	 */
> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> +
>  	/* Wa4x4STCOptimizationDisable:bdw */
>  	WA_SET_BIT_MASKED(CACHE_MODE_1,
>  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
>  			  HDC_FORCE_NON_COHERENT |
>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
>  
> +	/* According to the CACHE_MODE_0 default value documentation, some
> +	 * CHV platforms disable this optimization by default.  Turn it on.
> +	 */
> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> +

I remember looking at this when the HSW version was done, and at the
time the docs claimed that the default value on gen8 was already good.
Looks like the docs have been updated since then. Also my BSW agrees
that the disable bit is now set by default.

I suppose we could assume that since it works on HSW, it'll work on
gen8. However, I find it a bit suspicious that the later steppings seem
to have changed the default to disable the optimization. IVB suffered
from real problems with the optimization enabled and hence the IVB patch
was reverted. IIRC Chia-I wrote some kind of test to demonstrate the
problem on IVB, so maybe you want to run it and make sure it still
works correctly on gen8. Here's the test:
http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/35399

>  	/* Improve HiZ throughput on CHV. */
>  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
>  
> -- 
> 2.2.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Ben Widawsky Jan. 12, 2015, 6:02 p.m. UTC | #10
On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
> On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
> > On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
> > > On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
> > > > On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
> > > > > On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
> > > > > > On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> > > > > > > This is an important optimization for avoiding read-after-write (RAW)
> > > > > > > stalls in the HiZ buffer.  Certain workloads would run very slowly with
> > > > > > > HiZ enabled, but run much faster with the "hiz=false" driconf option.
> > > > > > > With this patch, they run at full speed even with HiZ.
> > > > > > > 
> > > > > > > Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> > > > > > > (Iris Pro 6200).
> > > > > > > 
> > > > > > > Thanks to Jesse Barnes for finding this missing bit!
> > > > > > > Thanks to Chris Wilson for helping me find where to set it.
> > > > > > > 
> > > > > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > > > > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> > > > > > >  1 file changed, 15 insertions(+)
> > > > > > > 
> > > > > > > Here's an alternate patch which implements the workaround in the kernel
> > > > > > > instead of Mesa.  It's probably better to do it there, since the kernel
> > > > > > > does it on Haswell already.
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > index dabc1d8..23020d6 100644
> > > > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > > > > @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> > > > > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> > > > > > >  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> > > > > > >  
> > > > > > > +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> > > > > > > +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> > > > > > > +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> > > > > > > +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> > > > > > > +	 *  buffer."
> > > > > > > +	 *
> > > > > > > +	 * This optimization is off by default for Broadwell; turn it on.
> > > > > > > +	 */
> > > > > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > > > > +
> > > > > > >  	/* Wa4x4STCOptimizationDisable:bdw */
> > > > > > >  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> > > > > > >  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> > > > > > > @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> > > > > > >  			  HDC_FORCE_NON_COHERENT |
> > > > > > >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> > > > > > >  
> > > > > > > +	/* According to the CACHE_MODE_0 default value documentation, some
> > > > > > > +	 * CHV platforms disable this optimization by default.  Turn it on.
> > > > > > > +	 */
> > > > > > > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > > > > > > +
> > > > > > >  	/* Improve HiZ throughput on CHV. */
> > > > > > >  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
> > > > > > >  
> > > > > > 
> > > > > > I think you should do this as two separate patches, 1 per platform. For the BSW
> > > > > > patch (given that I had the same functionality in the kernel patch I asked you
> > > > > > to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
> > > > > > which we can use for the commit):
> > > > > > Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > > > > 
> > > > > Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
> > > > > and resubmit...
> > > > > 
> > > > 
> > > > It's not my call, it's just nice to have platform specific bisection. And the
> > > > patch wasn't on the list, it was the one I kept asking you to look at in my
> > > > branch :-)
> > > > 
> > > > > > I haven't looked at Broadwell docs, so I'll let someone else take care of that.
> > > > > > 
> > > > > > I don't know if I agree with Chris that we should call these in the workaround
> > > > > > section, but whatever. init_clock_gating is equally sucky.
> > > > > 
> > > > > init_clock_gating doesn't work.  The register writes don't stick and they have
> > > > > no effect at all.  Setting them here makes them actually take effect in the
> > > > > context.
> > > > > 
> > > > > --Ken
> > > > 
> > > > Separate thread now, but are you sure? We're setting at least two context
> > > > specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
> > > > important to performance).
> > > 
> > > It looks like we're setting:
> > > - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
> > 
> > dword offset 0x1c in the context image
> 
> power context, not logical context
> 
> > > - [BDW, CHV] FF_THREAD_MODE - 0x20a0
> > 
> > dword offset 0x2a in the context image
> 
> Also power context
> 
> > > - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
> > 
> > Kinda surprised this one isn't there. I'm not sure how it can work correctly.
> 
> We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In any
> case it's a VCS register. Sadly I've not found any documentation for !RCS
> power context, but I'm assuming every engine has a power context of some sort.
> 

Yeah, Ken and I resolved this offline. Any idea why the bits don't stick when
written via MMIO?

> -- 
> Ville Syrjälä
> Intel OTC
Dave Gordon Jan. 12, 2015, 6:09 p.m. UTC | #11
On 12/01/15 18:02, Ben Widawsky wrote:
> On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
>> On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
>>> On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
>>>> On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
>>>>> On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
>>>>>> On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
>>>>>>> On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
>>>>>>>> This is an important optimization for avoiding read-after-write (RAW)
>>>>>>>> stalls in the HiZ buffer.  Certain workloads would run very slowly with
>>>>>>>> HiZ enabled, but run much faster with the "hiz=false" driconf option.
>>>>>>>> With this patch, they run at full speed even with HiZ.
>>>>>>>>
>>>>>>>> Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
>>>>>>>> (Iris Pro 6200).
>>>>>>>>
>>>>>>>> Thanks to Jesse Barnes for finding this missing bit!
>>>>>>>> Thanks to Chris Wilson for helping me find where to set it.
>>>>>>>>
>>>>>>>> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
>>>>>>>> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
>>>>>>>> ---
>>>>>>>>  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
>>>>>>>>  1 file changed, 15 insertions(+)
>>>>>>>>
>>>>>>>> Here's an alternate patch which implements the workaround in the kernel
>>>>>>>> instead of Mesa.  It's probably better to do it there, since the kernel
>>>>>>>> does it on Haswell already.
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
>>>>>>>> index dabc1d8..23020d6 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
>>>>>>>> @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
>>>>>>>>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
>>>>>>>>  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
>>>>>>>>  
>>>>>>>> +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
>>>>>>>> +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
>>>>>>>> +	 *  polygons in the same 8x4 pixel/sample area to be processed without
>>>>>>>> +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
>>>>>>>> +	 *  buffer."
>>>>>>>> +	 *
>>>>>>>> +	 * This optimization is off by default for Broadwell; turn it on.
>>>>>>>> +	 */
>>>>>>>> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
>>>>>>>> +
>>>>>>>>  	/* Wa4x4STCOptimizationDisable:bdw */
>>>>>>>>  	WA_SET_BIT_MASKED(CACHE_MODE_1,
>>>>>>>>  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
>>>>>>>> @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
>>>>>>>>  			  HDC_FORCE_NON_COHERENT |
>>>>>>>>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
>>>>>>>>  
>>>>>>>> +	/* According to the CACHE_MODE_0 default value documentation, some
>>>>>>>> +	 * CHV platforms disable this optimization by default.  Turn it on.
>>>>>>>> +	 */
>>>>>>>> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
>>>>>>>> +
>>>>>>>>  	/* Improve HiZ throughput on CHV. */
>>>>>>>>  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
>>>>>>>>  
>>>>>>>
>>>>>>> I think you should do this as two separate patches, 1 per platform. For the BSW
>>>>>>> patch (given that I had the same functionality in the kernel patch I asked you
>>>>>>> to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
>>>>>>> which we can use for the commit):
>>>>>>> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
>>>>>>
>>>>>> Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
>>>>>> and resubmit...
>>>>>>
>>>>>
>>>>> It's not my call, it's just nice to have platform specific bisection. And the
>>>>> patch wasn't on the list, it was the one I kept asking you to look at in my
>>>>> branch :-)
>>>>>
>>>>>>> I haven't looked at Broadwell docs, so I'll let someone else take care of that.
>>>>>>>
>>>>>>> I don't know if I agree with Chris that we should call these in the workaround
>>>>>>> section, but whatever. init_clock_gating is equally sucky.
>>>>>>
>>>>>> init_clock_gating doesn't work.  The register writes don't stick and they have
>>>>>> no effect at all.  Setting them here makes them actually take effect in the
>>>>>> context.
>>>>>>
>>>>>> --Ken
>>>>>
>>>>> Separate thread now, but are you sure? We're setting at least two context
>>>>> specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
>>>>> important to performance).
>>>>
>>>> It looks like we're setting:
>>>> - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
>>>
>>> dword offset 0x1c in the context image
>>
>> power context, not logical context
>>
>>>> - [BDW, CHV] FF_THREAD_MODE - 0x20a0
>>>
>>> dword offset 0x2a in the context image
>>
>> Also power context
>>
>>>> - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
>>>
>>> Kinda surprised this one isn't there. I'm not sure how it can work correctly.
>>
>> We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In any
>> case it's a VCS register. Sadly I've not found any documentation for !RCS
>> power context, but I'm assuming every engine has a power context of some sort.
>>
> 
> Yeah, Ken and I resolved this offline. Any idea why the bits don't stick when
> written via MMIO?
> 
>> -- 
>> Ville Syrjälä
>> Intel OTC

Doesn't BSpec say writing via MMIO is unreliable if the engine is
running (for some definition of 'running')? Do they stick even
temporarily? If we read them back just after writing, do the MMIO
registers have the expected values? Or are they lost only after a
context switch (or some other event)?

.Dave.
Kenneth Graunke Jan. 12, 2015, 9:41 p.m. UTC | #12
On Monday, January 12, 2015 02:32:20 PM Ville Syrjälä wrote:
> On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> > This is an important optimization for avoiding read-after-write (RAW)
> > stalls in the HiZ buffer.  Certain workloads would run very slowly with
> > HiZ enabled, but run much faster with the "hiz=false" driconf option.
> > With this patch, they run at full speed even with HiZ.
> > 
> > Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> > (Iris Pro 6200).
> > 
> > Thanks to Jesse Barnes for finding this missing bit!
> > Thanks to Chris Wilson for helping me find where to set it.
> > 
> > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > ---
> >  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> > 
> > Here's an alternate patch which implements the workaround in the kernel
> > instead of Mesa.  It's probably better to do it there, since the kernel
> > does it on Haswell already.
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index dabc1d8..23020d6 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> >  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> >  
> > +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> > +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> > +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> > +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> > +	 *  buffer."
> > +	 *
> > +	 * This optimization is off by default for Broadwell; turn it on.
> > +	 */
> > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > +
> >  	/* Wa4x4STCOptimizationDisable:bdw */
> >  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> >  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> > @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> >  			  HDC_FORCE_NON_COHERENT |
> >  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> >  
> > +	/* According to the CACHE_MODE_0 default value documentation, some
> > +	 * CHV platforms disable this optimization by default.  Turn it on.
> > +	 */
> > +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > +
> 
> I remember looking at this when the HSW version was done, and at the
> time the docs claimed that the default value on gen8 was already good.
> Looks like the docs have been updated since then. Also my BSW agrees
> that the disable bit is now set by default.
> 
> I suppose we could assume that since it works on HSW, it'll work on
> gen8. However, I find it a bit suspicious that the later steppings seem
> to have changed the default to disable the optimization. IVB suffered
> from real problems with the optimization enabled and hence the IVB patch
> was reverted. IIRC Chia-I wrote some kind of test to demonstrate the
> problem on IVB, so maybe you want to run it and make sure it still
> works correctly on gen8. Here's the test:
> http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/35399

I just ran Chia-I's program on BSW, and it appears to be working fine.
Classic swrast and i965/hardware both produced an identical image.

--Ken
Ben Widawsky Jan. 13, 2015, 2:07 a.m. UTC | #13
On Mon, Jan 12, 2015 at 06:09:12PM +0000, Dave Gordon wrote:
> On 12/01/15 18:02, Ben Widawsky wrote:
> > On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
> >> On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
> >>> On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
> >>>> On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
> >>>>> On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
> >>>>>> On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
> >>>>>>> On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> >>>>>>>> This is an important optimization for avoiding read-after-write (RAW)
> >>>>>>>> stalls in the HiZ buffer.  Certain workloads would run very slowly with
> >>>>>>>> HiZ enabled, but run much faster with the "hiz=false" driconf option.
> >>>>>>>> With this patch, they run at full speed even with HiZ.
> >>>>>>>>
> >>>>>>>> Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> >>>>>>>> (Iris Pro 6200).
> >>>>>>>>
> >>>>>>>> Thanks to Jesse Barnes for finding this missing bit!
> >>>>>>>> Thanks to Chris Wilson for helping me find where to set it.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> >>>>>>>> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> >>>>>>>> ---
> >>>>>>>>  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> >>>>>>>>  1 file changed, 15 insertions(+)
> >>>>>>>>
> >>>>>>>> Here's an alternate patch which implements the workaround in the kernel
> >>>>>>>> instead of Mesa.  It's probably better to do it there, since the kernel
> >>>>>>>> does it on Haswell already.
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>>>>>>> index dabc1d8..23020d6 100644
> >>>>>>>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>>>>>>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>>>>>>> @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> >>>>>>>>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> >>>>>>>>  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> >>>>>>>>  
> >>>>>>>> +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> >>>>>>>> +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> >>>>>>>> +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> >>>>>>>> +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> >>>>>>>> +	 *  buffer."
> >>>>>>>> +	 *
> >>>>>>>> +	 * This optimization is off by default for Broadwell; turn it on.
> >>>>>>>> +	 */
> >>>>>>>> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> >>>>>>>> +
> >>>>>>>>  	/* Wa4x4STCOptimizationDisable:bdw */
> >>>>>>>>  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> >>>>>>>>  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> >>>>>>>> @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> >>>>>>>>  			  HDC_FORCE_NON_COHERENT |
> >>>>>>>>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> >>>>>>>>  
> >>>>>>>> +	/* According to the CACHE_MODE_0 default value documentation, some
> >>>>>>>> +	 * CHV platforms disable this optimization by default.  Turn it on.
> >>>>>>>> +	 */
> >>>>>>>> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> >>>>>>>> +
> >>>>>>>>  	/* Improve HiZ throughput on CHV. */
> >>>>>>>>  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
> >>>>>>>>  
> >>>>>>>
> >>>>>>> I think you should do this as two separate patches, 1 per platform. For the BSW
> >>>>>>> patch (given that I had the same functionality in the kernel patch I asked you
> >>>>>>> to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
> >>>>>>> which we can use for the commit):
> >>>>>>> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> >>>>>>
> >>>>>> Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
> >>>>>> and resubmit...
> >>>>>>
> >>>>>
> >>>>> It's not my call, it's just nice to have platform specific bisection. And the
> >>>>> patch wasn't on the list, it was the one I kept asking you to look at in my
> >>>>> branch :-)
> >>>>>
> >>>>>>> I haven't looked at Broadwell docs, so I'll let someone else take care of that.
> >>>>>>>
> >>>>>>> I don't know if I agree with Chris that we should call these in the workaround
> >>>>>>> section, but whatever. init_clock_gating is equally sucky.
> >>>>>>
> >>>>>> init_clock_gating doesn't work.  The register writes don't stick and they have
> >>>>>> no effect at all.  Setting them here makes them actually take effect in the
> >>>>>> context.
> >>>>>>
> >>>>>> --Ken
> >>>>>
> >>>>> Separate thread now, but are you sure? We're setting at least two context
> >>>>> specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
> >>>>> important to performance).
> >>>>
> >>>> It looks like we're setting:
> >>>> - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
> >>>
> >>> dword offset 0x1c in the context image
> >>
> >> power context, not logical context
> >>
> >>>> - [BDW, CHV] FF_THREAD_MODE - 0x20a0
> >>>
> >>> dword offset 0x2a in the context image
> >>
> >> Also power context
> >>
> >>>> - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
> >>>
> >>> Kinda surprised this one isn't there. I'm not sure how it can work correctly.
> >>
> >> We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In any
> >> case it's a VCS register. Sadly I've not found any documentation for !RCS
> >> power context, but I'm assuming every engine has a power context of some sort.
> >>
> > 
> > Yeah, Ken and I resolved this offline. Any idea why the bits don't stick when
> > written via MMIO?
> > 
> >> -- 
> >> Ville Syrjälä
> >> Intel OTC
> 
> Doesn't BSpec say writing via MMIO is unreliable if the engine is
> running (for some definition of 'running')? Do they stick even
> temporarily? If we read them back just after writing, do the MMIO
> registers have the expected values? Or are they lost only after a
> context switch (or some other event)?
> 
> .Dave.
> 

I don't recall ever seeing it's saying. Anecdotally, I've heard a Windows driver
developer say that once. I'm not sure what Ken did to verify, my guess is LRI
after the driver was running. Surely certain registers have to be reliable or
things like tail/ELSP writes would be problematic.

I have questioned if we're doing this at all correctly for execlists in an
earlier mail:
> Thinking outloud - what's the default setting for execlists on BDW now?  For
> execlists my plan (when it was my plan to have) had always been to manually set
> the register in the context image before loading it. We don't do that with the
> existing code, we use the old ringbuffer style of, hope it preserves the
> contents. I wonder if that's the distinction between HSW.

If you do not do it this way, the first time you load the context you will wipe
out any of the context save/restored registers (of which CACHE_MODE_0 is one). I
didn't look at the code to see if we do the right thing. I hope we do.

Anyway, I think the decision still stands that we need to move any workaround
that is critical should be moved to an LRI. Unless someone can sort out
how/why/when this fails.
Ville Syrjälä Jan. 13, 2015, 8:03 p.m. UTC | #14
On Mon, Jan 12, 2015 at 06:07:26PM -0800, Ben Widawsky wrote:
> On Mon, Jan 12, 2015 at 06:09:12PM +0000, Dave Gordon wrote:
> > On 12/01/15 18:02, Ben Widawsky wrote:
> > > On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
> > >> On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
> > >>> On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
> > >>>> On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
> > >>>>> On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
> > >>>>>> On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
> > >>>>>>> On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
> > >>>>>>>> This is an important optimization for avoiding read-after-write (RAW)
> > >>>>>>>> stalls in the HiZ buffer.  Certain workloads would run very slowly with
> > >>>>>>>> HiZ enabled, but run much faster with the "hiz=false" driconf option.
> > >>>>>>>> With this patch, they run at full speed even with HiZ.
> > >>>>>>>>
> > >>>>>>>> Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
> > >>>>>>>> (Iris Pro 6200).
> > >>>>>>>>
> > >>>>>>>> Thanks to Jesse Barnes for finding this missing bit!
> > >>>>>>>> Thanks to Chris Wilson for helping me find where to set it.
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > >>>>>>>> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > >>>>>>>> ---
> > >>>>>>>>  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++++++++++++++
> > >>>>>>>>  1 file changed, 15 insertions(+)
> > >>>>>>>>
> > >>>>>>>> Here's an alternate patch which implements the workaround in the kernel
> > >>>>>>>> instead of Mesa.  It's probably better to do it there, since the kernel
> > >>>>>>>> does it on Haswell already.
> > >>>>>>>>
> > >>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > >>>>>>>> index dabc1d8..23020d6 100644
> > >>>>>>>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > >>>>>>>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > >>>>>>>> @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs *ring)
> > >>>>>>>>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
> > >>>>>>>>  			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
> > >>>>>>>>  
> > >>>>>>>> +	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
> > >>>>>>>> +	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
> > >>>>>>>> +	 *  polygons in the same 8x4 pixel/sample area to be processed without
> > >>>>>>>> +	 *  stalling waiting for the earlier ones to write to Hierarchical Z
> > >>>>>>>> +	 *  buffer."
> > >>>>>>>> +	 *
> > >>>>>>>> +	 * This optimization is off by default for Broadwell; turn it on.
> > >>>>>>>> +	 */
> > >>>>>>>> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > >>>>>>>> +
> > >>>>>>>>  	/* Wa4x4STCOptimizationDisable:bdw */
> > >>>>>>>>  	WA_SET_BIT_MASKED(CACHE_MODE_1,
> > >>>>>>>>  			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
> > >>>>>>>> @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs *ring)
> > >>>>>>>>  			  HDC_FORCE_NON_COHERENT |
> > >>>>>>>>  			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
> > >>>>>>>>  
> > >>>>>>>> +	/* According to the CACHE_MODE_0 default value documentation, some
> > >>>>>>>> +	 * CHV platforms disable this optimization by default.  Turn it on.
> > >>>>>>>> +	 */
> > >>>>>>>> +	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
> > >>>>>>>> +
> > >>>>>>>>  	/* Improve HiZ throughput on CHV. */
> > >>>>>>>>  	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
> > >>>>>>>>  
> > >>>>>>>
> > >>>>>>> I think you should do this as two separate patches, 1 per platform. For the BSW
> > >>>>>>> patch (given that I had the same functionality in the kernel patch I asked you
> > >>>>>>> to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
> > >>>>>>> which we can use for the commit):
> > >>>>>>> Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
> > >>>>>>
> > >>>>>> Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
> > >>>>>> and resubmit...
> > >>>>>>
> > >>>>>
> > >>>>> It's not my call, it's just nice to have platform specific bisection. And the
> > >>>>> patch wasn't on the list, it was the one I kept asking you to look at in my
> > >>>>> branch :-)
> > >>>>>
> > >>>>>>> I haven't looked at Broadwell docs, so I'll let someone else take care of that.
> > >>>>>>>
> > >>>>>>> I don't know if I agree with Chris that we should call these in the workaround
> > >>>>>>> section, but whatever. init_clock_gating is equally sucky.
> > >>>>>>
> > >>>>>> init_clock_gating doesn't work.  The register writes don't stick and they have
> > >>>>>> no effect at all.  Setting them here makes them actually take effect in the
> > >>>>>> context.
> > >>>>>>
> > >>>>>> --Ken
> > >>>>>
> > >>>>> Separate thread now, but are you sure? We're setting at least two context
> > >>>>> specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
> > >>>>> important to performance).
> > >>>>
> > >>>> It looks like we're setting:
> > >>>> - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
> > >>>
> > >>> dword offset 0x1c in the context image
> > >>
> > >> power context, not logical context
> > >>
> > >>>> - [BDW, CHV] FF_THREAD_MODE - 0x20a0
> > >>>
> > >>> dword offset 0x2a in the context image
> > >>
> > >> Also power context
> > >>
> > >>>> - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
> > >>>
> > >>> Kinda surprised this one isn't there. I'm not sure how it can work correctly.
> > >>
> > >> We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In any
> > >> case it's a VCS register. Sadly I've not found any documentation for !RCS
> > >> power context, but I'm assuming every engine has a power context of some sort.
> > >>
> > > 
> > > Yeah, Ken and I resolved this offline. Any idea why the bits don't stick when
> > > written via MMIO?
> > > 
> > >> -- 
> > >> Ville Syrjälä
> > >> Intel OTC
> > 
> > Doesn't BSpec say writing via MMIO is unreliable if the engine is
> > running (for some definition of 'running')? Do they stick even
> > temporarily? If we read them back just after writing, do the MMIO
> > registers have the expected values? Or are they lost only after a
> > context switch (or some other event)?
> > 
> > .Dave.
> > 
> 
> I don't recall ever seeing it's saying. Anecdotally, I've heard a Windows driver
> developer say that once. I'm not sure what Ken did to verify, my guess is LRI
> after the driver was running. Surely certain registers have to be reliable or
> things like tail/ELSP writes would be problematic.
> 
> I have questioned if we're doing this at all correctly for execlists in an
> earlier mail:
> > Thinking outloud - what's the default setting for execlists on BDW now?  For
> > execlists my plan (when it was my plan to have) had always been to manually set
> > the register in the context image before loading it. We don't do that with the
> > existing code, we use the old ringbuffer style of, hope it preserves the
> > contents. I wonder if that's the distinction between HSW.
> 
> If you do not do it this way, the first time you load the context you will wipe
> out any of the context save/restored registers (of which CACHE_MODE_0 is one). I
> didn't look at the code to see if we do the right thing. I hope we do.
> 
> Anyway, I think the decision still stands that we need to move any workaround
> that is critical should be moved to an LRI. Unless someone can sort out
> how/why/when this fails.

BSpec says you need to use LRI to make sure there's an active
context when the register write occurs. My interpretation is that the
context restore is asynchronous, and so might not have happened yet by
the time the GT wakes up from rc6. And Bspec also seems to say that
with execlist there isn't any active context unless one was actually
submitted to ELSP and is still running, so frobbing with the register
via mmio is generally a no-no.

There is a specific magic sequence to disable CS idle messaging if you
really want to use mmio for debug purposes though. Long ago I had a
patch which tried to implement some of that, and IIRC I even offered
the patch to you when you were looking at some missing register value
weirdness. I also tried to peddle it to Mika at some point. No one
took the bait so far, but I still think we should implement it for
gem_workarounds. Currently gem_workarounds just submits some work to
the GPU and waits for it to finish after grabbing forcewake to make
sure the context became active. I'm not at all sure that guarantees
that the context remains active, at least with execlists.

In any case I believe all of this only applies to the registers in
the logical context, and everything in the power context should be
fine as long as you remember to forcewake.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index dabc1d8..23020d6 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -796,6 +796,16 @@  static int bdw_init_workarounds(struct intel_engine_cs *ring)
 			  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
 			  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
 
+	/* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
+	 * "The Hierarchical Z RAW Stall Optimization allows non-overlapping
+	 *  polygons in the same 8x4 pixel/sample area to be processed without
+	 *  stalling waiting for the earlier ones to write to Hierarchical Z
+	 *  buffer."
+	 *
+	 * This optimization is off by default for Broadwell; turn it on.
+	 */
+	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
+
 	/* Wa4x4STCOptimizationDisable:bdw */
 	WA_SET_BIT_MASKED(CACHE_MODE_1,
 			  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
@@ -836,6 +846,11 @@  static int chv_init_workarounds(struct intel_engine_cs *ring)
 			  HDC_FORCE_NON_COHERENT |
 			  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
 
+	/* According to the CACHE_MODE_0 default value documentation, some
+	 * CHV platforms disable this optimization by default.  Turn it on.
+	 */
+	WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
+
 	/* Improve HiZ throughput on CHV. */
 	WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);