diff mbox

drm/i915: Make sample_c messages go faster on Haswell.

Message ID 1414620763-2841-1-git-send-email-kenneth@whitecape.org (mailing list archive)
State New, archived
Headers show

Commit Message

Kenneth Graunke Oct. 29, 2014, 10:12 p.m. UTC
Haswell significantly improved the performance of sampler_c messages,
but the optimization appears to be off by default.  Later platforms
remove this bit, and apparently always enable the optimization.

Improves performance in "Counter Strike: Global Offensive" by 18%
at default settings on Iris Pro.  No Piglit regressions.

Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
---
 drivers/gpu/drm/i915/i915_reg.h | 1 +
 drivers/gpu/drm/i915/intel_pm.c | 4 ++++
 2 files changed, 5 insertions(+)

Comments

Ville Syrjälä Oct. 30, 2014, 8:50 a.m. UTC | #1
On Wed, Oct 29, 2014 at 03:12:43PM -0700, Kenneth Graunke wrote:
> Haswell significantly improved the performance of sampler_c messages,
> but the optimization appears to be off by default.  Later platforms
> remove this bit, and apparently always enable the optimization.
> 
> Improves performance in "Counter Strike: Global Offensive" by 18%
> at default settings on Iris Pro.  No Piglit regressions.

Nice. We need more bits like this ;)

> 
> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> ---
>  drivers/gpu/drm/i915/i915_reg.h | 1 +
>  drivers/gpu/drm/i915/intel_pm.c | 4 ++++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 77fce96..340821a 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -5952,6 +5952,7 @@ enum punit_power_well {
>  #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
>  
>  #define HALF_SLICE_CHICKEN3		0xe184
> +#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
>  #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
>  #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
>  
> diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> index 7a69eba..50c72a7 100644
> --- a/drivers/gpu/drm/i915/intel_pm.c
> +++ b/drivers/gpu/drm/i915/intel_pm.c
> @@ -5736,6 +5736,10 @@ static void haswell_init_clock_gating(struct drm_device *dev)
>  	I915_WRITE(GEN7_GT_MODE,
>  		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
>  
> +	/* Make sample_c messages faster. */

I found a name for it in the w/a database.

WaSampleCChickenBitEnable:hsw

Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

> +	I915_WRITE(HALF_SLICE_CHICKEN3,
> +		   _MASKED_BIT_ENABLE(HSW_SAMPLE_C_PERFORMANCE));
> +
>  	/* WaSwitchSolVfFArbitrationPriority:hsw */
>  	I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) | HSW_ECOCHK_ARB_PRIO_SOL);
>  
> -- 
> 2.1.2
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Ville Syrjälä Oct. 30, 2014, 9 a.m. UTC | #2
On Thu, Oct 30, 2014 at 10:50:03AM +0200, Ville Syrjälä wrote:
> On Wed, Oct 29, 2014 at 03:12:43PM -0700, Kenneth Graunke wrote:
> > Haswell significantly improved the performance of sampler_c messages,
> > but the optimization appears to be off by default.  Later platforms
> > remove this bit, and apparently always enable the optimization.
> > 
> > Improves performance in "Counter Strike: Global Offensive" by 18%
> > at default settings on Iris Pro.  No Piglit regressions.
> 
> Nice. We need more bits like this ;)
> 
> > 
> > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > ---
> >  drivers/gpu/drm/i915/i915_reg.h | 1 +
> >  drivers/gpu/drm/i915/intel_pm.c | 4 ++++
> >  2 files changed, 5 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> > index 77fce96..340821a 100644
> > --- a/drivers/gpu/drm/i915/i915_reg.h
> > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > @@ -5952,6 +5952,7 @@ enum punit_power_well {
> >  #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
> >  
> >  #define HALF_SLICE_CHICKEN3		0xe184
> > +#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
> >  #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
> >  #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
> >  
> > diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
> > index 7a69eba..50c72a7 100644
> > --- a/drivers/gpu/drm/i915/intel_pm.c
> > +++ b/drivers/gpu/drm/i915/intel_pm.c
> > @@ -5736,6 +5736,10 @@ static void haswell_init_clock_gating(struct drm_device *dev)
> >  	I915_WRITE(GEN7_GT_MODE,
> >  		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
> >  
> > +	/* Make sample_c messages faster. */
> 
> I found a name for it in the w/a database.
> 
> WaSampleCChickenBitEnable:hsw
> 
> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

Oh actually it says palette won't work when this bit is on. I'm assuming
that's the texture palette. Do we have any use of that anywhere?

> 
> > +	I915_WRITE(HALF_SLICE_CHICKEN3,
> > +		   _MASKED_BIT_ENABLE(HSW_SAMPLE_C_PERFORMANCE));
> > +
> >  	/* WaSwitchSolVfFArbitrationPriority:hsw */
> >  	I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) | HSW_ECOCHK_ARB_PRIO_SOL);
> >  
> > -- 
> > 2.1.2
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Ville Syrjälä
> Intel OTC
Kenneth Graunke Oct. 30, 2014, 9:32 a.m. UTC | #3
On Thursday, October 30, 2014 11:00:51 AM Ville Syrjälä wrote:
> On Thu, Oct 30, 2014 at 10:50:03AM +0200, Ville Syrjälä wrote:
> > On Wed, Oct 29, 2014 at 03:12:43PM -0700, Kenneth Graunke wrote:
> > > Haswell significantly improved the performance of sampler_c messages,
> > > but the optimization appears to be off by default.  Later platforms
> > > remove this bit, and apparently always enable the optimization.
> > > 
> > > Improves performance in "Counter Strike: Global Offensive" by 18%
> > > at default settings on Iris Pro.  No Piglit regressions.
> > 
> > Nice. We need more bits like this ;)
> > 
> > > 
> > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > ---
> > >  drivers/gpu/drm/i915/i915_reg.h | 1 +
> > >  drivers/gpu/drm/i915/intel_pm.c | 4 ++++
> > >  2 files changed, 5 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_reg.h 
b/drivers/gpu/drm/i915/i915_reg.h
> > > index 77fce96..340821a 100644
> > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > @@ -5952,6 +5952,7 @@ enum punit_power_well {
> > >  #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
> > >  
> > >  #define HALF_SLICE_CHICKEN3		0xe184
> > > +#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
> > >  #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
> > >  #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
> > >  
> > > diff --git a/drivers/gpu/drm/i915/intel_pm.c 
b/drivers/gpu/drm/i915/intel_pm.c
> > > index 7a69eba..50c72a7 100644
> > > --- a/drivers/gpu/drm/i915/intel_pm.c
> > > +++ b/drivers/gpu/drm/i915/intel_pm.c
> > > @@ -5736,6 +5736,10 @@ static void haswell_init_clock_gating(struct 
drm_device *dev)
> > >  	I915_WRITE(GEN7_GT_MODE,
> > >  		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
> > >  
> > > +	/* Make sample_c messages faster. */
> > 
> > I found a name for it in the w/a database.
> > 
> > WaSampleCChickenBitEnable:hsw
> > 
> > Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> Oh actually it says palette won't work when this bit is on. I'm assuming
> that's the texture palette. Do we have any use of that anywhere?

That's a good point.  3DSTATE_SAMPLER_PALETTE_LOAD and the A8P8/indexed 
formats aren't used by Mesa or xf86-video-intel, but it looks like they might 
be used by libva.

Can someone confirm that libva does use the sampler palette?

If they do, what do we do about it?

--Ken
Ville Syrjälä Oct. 30, 2014, 11:01 a.m. UTC | #4
On Thu, Oct 30, 2014 at 02:32:40AM -0700, Kenneth Graunke wrote:
> On Thursday, October 30, 2014 11:00:51 AM Ville Syrjälä wrote:
> > On Thu, Oct 30, 2014 at 10:50:03AM +0200, Ville Syrjälä wrote:
> > > On Wed, Oct 29, 2014 at 03:12:43PM -0700, Kenneth Graunke wrote:
> > > > Haswell significantly improved the performance of sampler_c messages,
> > > > but the optimization appears to be off by default.  Later platforms
> > > > remove this bit, and apparently always enable the optimization.
> > > > 
> > > > Improves performance in "Counter Strike: Global Offensive" by 18%
> > > > at default settings on Iris Pro.  No Piglit regressions.
> > > 
> > > Nice. We need more bits like this ;)
> > > 
> > > > 
> > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_reg.h | 1 +
> > > >  drivers/gpu/drm/i915/intel_pm.c | 4 ++++
> > > >  2 files changed, 5 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_reg.h 
> b/drivers/gpu/drm/i915/i915_reg.h
> > > > index 77fce96..340821a 100644
> > > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > > @@ -5952,6 +5952,7 @@ enum punit_power_well {
> > > >  #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
> > > >  
> > > >  #define HALF_SLICE_CHICKEN3		0xe184
> > > > +#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
> > > >  #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
> > > >  #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/intel_pm.c 
> b/drivers/gpu/drm/i915/intel_pm.c
> > > > index 7a69eba..50c72a7 100644
> > > > --- a/drivers/gpu/drm/i915/intel_pm.c
> > > > +++ b/drivers/gpu/drm/i915/intel_pm.c
> > > > @@ -5736,6 +5736,10 @@ static void haswell_init_clock_gating(struct 
> drm_device *dev)
> > > >  	I915_WRITE(GEN7_GT_MODE,
> > > >  		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
> > > >  
> > > > +	/* Make sample_c messages faster. */
> > > 
> > > I found a name for it in the w/a database.
> > > 
> > > WaSampleCChickenBitEnable:hsw
> > > 
> > > Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > 
> > Oh actually it says palette won't work when this bit is on. I'm assuming
> > that's the texture palette. Do we have any use of that anywhere?
> 
> That's a good point.  3DSTATE_SAMPLER_PALETTE_LOAD and the A8P8/indexed 
> formats aren't used by Mesa or xf86-video-intel, but it looks like they might 
> be used by libva.
> 
> Can someone confirm that libva does use the sampler palette?
> 
> If they do, what do we do about it?

I suppose the best option then would be to use an LRI from a batch,
which means the register would need to be added to the cmd parser
white list. This is one of the context saved registers so doing the
LRI just once per context should be enough.

Well that's assuming libva doesn't use the default context. I'm getting
another itch to drop the restore inhibit flag for default contexts.
That would actually make it possible to do these sort of things without
risking breakage to existing userspace. But I think Chris is going
scream unless the patch comes with performance data that shows it
doesn't hurt too much.
Kenneth Graunke Oct. 30, 2014, 5:32 p.m. UTC | #5
On Thursday, October 30, 2014 01:01:30 PM Ville Syrjälä wrote:
> On Thu, Oct 30, 2014 at 02:32:40AM -0700, Kenneth Graunke wrote:
> > On Thursday, October 30, 2014 11:00:51 AM Ville Syrjälä wrote:
> > > On Thu, Oct 30, 2014 at 10:50:03AM +0200, Ville Syrjälä wrote:
> > > > On Wed, Oct 29, 2014 at 03:12:43PM -0700, Kenneth Graunke wrote:
> > > > > Haswell significantly improved the performance of sampler_c 
messages,
> > > > > but the optimization appears to be off by default.  Later platforms
> > > > > remove this bit, and apparently always enable the optimization.
> > > > > 
> > > > > Improves performance in "Counter Strike: Global Offensive" by 18%
> > > > > at default settings on Iris Pro.  No Piglit regressions.
> > > > 
> > > > Nice. We need more bits like this ;)
> > > > 
> > > > > 
> > > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/i915_reg.h | 1 +
> > > > >  drivers/gpu/drm/i915/intel_pm.c | 4 ++++
> > > > >  2 files changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_reg.h 
> > b/drivers/gpu/drm/i915/i915_reg.h
> > > > > index 77fce96..340821a 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > > > @@ -5952,6 +5952,7 @@ enum punit_power_well {
> > > > >  #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
> > > > >  
> > > > >  #define HALF_SLICE_CHICKEN3		0xe184
> > > > > +#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
> > > > >  #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
> > > > >  #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
> > > > >  
> > > > > diff --git a/drivers/gpu/drm/i915/intel_pm.c 
> > b/drivers/gpu/drm/i915/intel_pm.c
> > > > > index 7a69eba..50c72a7 100644
> > > > > --- a/drivers/gpu/drm/i915/intel_pm.c
> > > > > +++ b/drivers/gpu/drm/i915/intel_pm.c
> > > > > @@ -5736,6 +5736,10 @@ static void haswell_init_clock_gating(struct 
> > drm_device *dev)
> > > > >  	I915_WRITE(GEN7_GT_MODE,
> > > > >  		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
> > > > >  
> > > > > +	/* Make sample_c messages faster. */
> > > > 
> > > > I found a name for it in the w/a database.
> > > > 
> > > > WaSampleCChickenBitEnable:hsw
> > > > 
> > > > Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > 
> > > Oh actually it says palette won't work when this bit is on. I'm assuming
> > > that's the texture palette. Do we have any use of that anywhere?
> > 
> > That's a good point.  3DSTATE_SAMPLER_PALETTE_LOAD and the A8P8/indexed 
> > formats aren't used by Mesa or xf86-video-intel, but it looks like they 
might 
> > be used by libva.
> > 
> > Can someone confirm that libva does use the sampler palette?
> > 
> > If they do, what do we do about it?
> 
> I suppose the best option then would be to use an LRI from a batch,
> which means the register would need to be added to the cmd parser
> white list. This is one of the context saved registers so doing the
> LRI just once per context should be enough.

I don't like that solution.  For one, it's impossible - you can't LRI from 
userspace batches, even if you add it to the kernel command parser's 
whitelist, because the hardware scanner is still enabled.  Given that I've 
been waiting two years for this capability, I want to find a more immediate 
solution.

Another option is to have some sort of execbuf flag...maybe a 3D/Media "usage" 
flag.  If set to 3D, write 0x6000200...if media, write 0x6000000.  Or 
something specific.  I do hate adding more junk to the execbuf path, though.

Other ideas?

> Well that's assuming libva doesn't use the default context. I'm getting
> another itch to drop the restore inhibit flag for default contexts.
> That would actually make it possible to do these sort of things without
> risking breakage to existing userspace. But I think Chris is going
> scream unless the patch comes with performance data that shows it
> doesn't hurt too much.

I suppose it wouldn't affect Mesa much, since we never use the default context 
on Gen6+.  But otherwise I'd probably want to see the data, like Chris...
Ville Syrjälä Oct. 30, 2014, 7:26 p.m. UTC | #6
On Thu, Oct 30, 2014 at 10:32:38AM -0700, Kenneth Graunke wrote:
> On Thursday, October 30, 2014 01:01:30 PM Ville Syrjälä wrote:
> > On Thu, Oct 30, 2014 at 02:32:40AM -0700, Kenneth Graunke wrote:
> > > On Thursday, October 30, 2014 11:00:51 AM Ville Syrjälä wrote:
> > > > On Thu, Oct 30, 2014 at 10:50:03AM +0200, Ville Syrjälä wrote:
> > > > > On Wed, Oct 29, 2014 at 03:12:43PM -0700, Kenneth Graunke wrote:
> > > > > > Haswell significantly improved the performance of sampler_c 
> messages,
> > > > > > but the optimization appears to be off by default.  Later platforms
> > > > > > remove this bit, and apparently always enable the optimization.
> > > > > > 
> > > > > > Improves performance in "Counter Strike: Global Offensive" by 18%
> > > > > > at default settings on Iris Pro.  No Piglit regressions.
> > > > > 
> > > > > Nice. We need more bits like this ;)
> > > > > 
> > > > > > 
> > > > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/i915_reg.h | 1 +
> > > > > >  drivers/gpu/drm/i915/intel_pm.c | 4 ++++
> > > > > >  2 files changed, 5 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_reg.h 
> > > b/drivers/gpu/drm/i915/i915_reg.h
> > > > > > index 77fce96..340821a 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > > > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > > > > @@ -5952,6 +5952,7 @@ enum punit_power_well {
> > > > > >  #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
> > > > > >  
> > > > > >  #define HALF_SLICE_CHICKEN3		0xe184
> > > > > > +#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
> > > > > >  #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
> > > > > >  #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
> > > > > >  
> > > > > > diff --git a/drivers/gpu/drm/i915/intel_pm.c 
> > > b/drivers/gpu/drm/i915/intel_pm.c
> > > > > > index 7a69eba..50c72a7 100644
> > > > > > --- a/drivers/gpu/drm/i915/intel_pm.c
> > > > > > +++ b/drivers/gpu/drm/i915/intel_pm.c
> > > > > > @@ -5736,6 +5736,10 @@ static void haswell_init_clock_gating(struct 
> > > drm_device *dev)
> > > > > >  	I915_WRITE(GEN7_GT_MODE,
> > > > > >  		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
> > > > > >  
> > > > > > +	/* Make sample_c messages faster. */
> > > > > 
> > > > > I found a name for it in the w/a database.
> > > > > 
> > > > > WaSampleCChickenBitEnable:hsw
> > > > > 
> > > > > Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > 
> > > > Oh actually it says palette won't work when this bit is on. I'm assuming
> > > > that's the texture palette. Do we have any use of that anywhere?
> > > 
> > > That's a good point.  3DSTATE_SAMPLER_PALETTE_LOAD and the A8P8/indexed 
> > > formats aren't used by Mesa or xf86-video-intel, but it looks like they 
> might 
> > > be used by libva.
> > > 
> > > Can someone confirm that libva does use the sampler palette?
> > > 
> > > If they do, what do we do about it?
> > 
> > I suppose the best option then would be to use an LRI from a batch,
> > which means the register would need to be added to the cmd parser
> > white list. This is one of the context saved registers so doing the
> > LRI just once per context should be enough.
> 
> I don't like that solution.  For one, it's impossible - you can't LRI from 
> userspace batches, even if you add it to the kernel command parser's 
> whitelist, because the hardware scanner is still enabled.  Given that I've 
> been waiting two years for this capability, I want to find a more immediate 
> solution.

Ah. I've somehow convinced myself the cmd parser might actually be doing
something besides just eating CPU cycles these days. But I guess not.

> 
> Another option is to have some sort of execbuf flag...maybe a 3D/Media "usage" 
> flag.  If set to 3D, write 0x6000200...if media, write 0x6000000.  Or 
> something specific.  I do hate adding more junk to the execbuf path, though.
> 
> Other ideas?

Fast vs. slow flag? :)

More seriously, one somewhat crappy option would be to initialize that
bit to 1 for all explicit contexts, and then have the kernel always turn
it off before executing something with the default context. It's not
unlike how we imagined the RS stuff would work since old userspace
doesn't know to turn RS off when using the default context.

But if we would do something like this, I think it would be nice if we
could just add it as a temporary hack and potentially drop it if we
manage to lose the restore inhibit flag and someone finishes the
cmd parser. So maybe Mesa could also try to set it from a batch,
which currently would be a nop or rejected by the cmd parser, but
when the cmd parser starts to work it might actually do something?

But always initializing it to 1 for all explicit context in the kernel
obviously requires that libva doesn't use an explicit context itself.
In case it does, another idea would be to just add a hack to the cmd
parser to emit the LRI on the ring when it enconters this register.
Not very pretty, but should work somewhat. It wouldn't allow mid-batch
changes to the register value though, but enough for setting it once
for each context in Mesa. And of course we'd still need the kernel to
turn it off before any default context will see it.

> 
> > Well that's assuming libva doesn't use the default context. I'm getting
> > another itch to drop the restore inhibit flag for default contexts.
> > That would actually make it possible to do these sort of things without
> > risking breakage to existing userspace. But I think Chris is going
> > scream unless the patch comes with performance data that shows it
> > doesn't hurt too much.
> 
> I suppose it wouldn't affect Mesa much, since we never use the default context 
> on Gen6+.  But otherwise I'd probably want to see the data, like Chris...

Yeah. I just have this feeling we're going to have to go there
eventually since there are plenty of registers in the context. Some of
those registers might have other interesting bits that certain users
might want to frob from batches. But I suppose we could always have the
kernel reset those too once they get added to the cmd parser white
list. Except on new platforms where the hw scanner itself has a whitelist,
so we'd have to reset them all basically, unless we want the cmd parser
to still scan every batch on those platforms.
Kenneth Graunke Oct. 30, 2014, 7:57 p.m. UTC | #7
On Thursday, October 30, 2014 09:26:01 PM Ville Syrjälä wrote:
> On Thu, Oct 30, 2014 at 10:32:38AM -0700, Kenneth Graunke wrote:
> > On Thursday, October 30, 2014 01:01:30 PM Ville Syrjälä wrote:
> > > On Thu, Oct 30, 2014 at 02:32:40AM -0700, Kenneth Graunke wrote:
> > > > On Thursday, October 30, 2014 11:00:51 AM Ville Syrjälä wrote:
> > > > > On Thu, Oct 30, 2014 at 10:50:03AM +0200, Ville Syrjälä wrote:
> > > > > > On Wed, Oct 29, 2014 at 03:12:43PM -0700, Kenneth Graunke wrote:
> > > > > > > Haswell significantly improved the performance of sampler_c 
> > messages,
> > > > > > > but the optimization appears to be off by default.  Later 
platforms
> > > > > > > remove this bit, and apparently always enable the optimization.
> > > > > > > 
> > > > > > > Improves performance in "Counter Strike: Global Offensive" by 
18%
> > > > > > > at default settings on Iris Pro.  No Piglit regressions.
> > > > > > 
> > > > > > Nice. We need more bits like this ;)
> > > > > > 
> > > > > > > 
> > > > > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/i915/i915_reg.h | 1 +
> > > > > > >  drivers/gpu/drm/i915/intel_pm.c | 4 ++++
> > > > > > >  2 files changed, 5 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/i915/i915_reg.h 
> > > > b/drivers/gpu/drm/i915/i915_reg.h
> > > > > > > index 77fce96..340821a 100644
> > > > > > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > > > > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > > > > > @@ -5952,6 +5952,7 @@ enum punit_power_well {
> > > > > > >  #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
> > > > > > >  
> > > > > > >  #define HALF_SLICE_CHICKEN3		0xe184
> > > > > > > +#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
> > > > > > >  #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
> > > > > > >  #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
> > > > > > >  
> > > > > > > diff --git a/drivers/gpu/drm/i915/intel_pm.c 
> > > > b/drivers/gpu/drm/i915/intel_pm.c
> > > > > > > index 7a69eba..50c72a7 100644
> > > > > > > --- a/drivers/gpu/drm/i915/intel_pm.c
> > > > > > > +++ b/drivers/gpu/drm/i915/intel_pm.c
> > > > > > > @@ -5736,6 +5736,10 @@ static void 
haswell_init_clock_gating(struct 
> > > > drm_device *dev)
> > > > > > >  	I915_WRITE(GEN7_GT_MODE,
> > > > > > >  		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
> > > > > > >  
> > > > > > > +	/* Make sample_c messages faster. */
> > > > > > 
> > > > > > I found a name for it in the w/a database.
> > > > > > 
> > > > > > WaSampleCChickenBitEnable:hsw
> > > > > > 
> > > > > > Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > 
> > > > > Oh actually it says palette won't work when this bit is on. I'm 
assuming
> > > > > that's the texture palette. Do we have any use of that anywhere?
> > > > 
> > > > That's a good point.  3DSTATE_SAMPLER_PALETTE_LOAD and the 
A8P8/indexed 
> > > > formats aren't used by Mesa or xf86-video-intel, but it looks like 
they 
> > might 
> > > > be used by libva.
> > > > 
> > > > Can someone confirm that libva does use the sampler palette?
> > > > 
> > > > If they do, what do we do about it?
> > > 
> > > I suppose the best option then would be to use an LRI from a batch,
> > > which means the register would need to be added to the cmd parser
> > > white list. This is one of the context saved registers so doing the
> > > LRI just once per context should be enough.
> > 
> > I don't like that solution.  For one, it's impossible - you can't LRI from 
> > userspace batches, even if you add it to the kernel command parser's 
> > whitelist, because the hardware scanner is still enabled.  Given that I've 
> > been waiting two years for this capability, I want to find a more 
immediate 
> > solution.
> 
> Ah. I've somehow convinced myself the cmd parser might actually be doing
> something besides just eating CPU cycles these days. But I guess not.
> 
> > 
> > Another option is to have some sort of execbuf flag...maybe a 3D/Media 
"usage" 
> > flag.  If set to 3D, write 0x6000200...if media, write 0x6000000.  Or 
> > something specific.  I do hate adding more junk to the execbuf path, 
though.
> > 
> > Other ideas?
> 
> Fast vs. slow flag? :)
> 
> More seriously, one somewhat crappy option would be to initialize that
> bit to 1 for all explicit contexts, and then have the kernel always turn
> it off before executing something with the default context. It's not
> unlike how we imagined the RS stuff would work since old userspace
> doesn't know to turn RS off when using the default context.

Interesting idea - that might work.  We don't need mid-batch changes either.

I don't think HALF_SLICE_CHICKEN3 is part of the logical context, FWIW.

Before we get too much further...we should check if libva is actually broken.  
I don't know if this means the sampler palette completely doesn't work, or if 
it just means sample_c doesn't work with the palette.  If it's the latter, 
we're probably fine, because I doubt libva uses sample_c.

--Ken
Ville Syrjälä Oct. 31, 2014, 9:27 a.m. UTC | #8
On Thu, Oct 30, 2014 at 12:57:04PM -0700, Kenneth Graunke wrote:
> On Thursday, October 30, 2014 09:26:01 PM Ville Syrjälä wrote:
> > On Thu, Oct 30, 2014 at 10:32:38AM -0700, Kenneth Graunke wrote:
> > > On Thursday, October 30, 2014 01:01:30 PM Ville Syrjälä wrote:
> > > > On Thu, Oct 30, 2014 at 02:32:40AM -0700, Kenneth Graunke wrote:
> > > > > On Thursday, October 30, 2014 11:00:51 AM Ville Syrjälä wrote:
> > > > > > On Thu, Oct 30, 2014 at 10:50:03AM +0200, Ville Syrjälä wrote:
> > > > > > > On Wed, Oct 29, 2014 at 03:12:43PM -0700, Kenneth Graunke wrote:
> > > > > > > > Haswell significantly improved the performance of sampler_c 
> > > messages,
> > > > > > > > but the optimization appears to be off by default.  Later 
> platforms
> > > > > > > > remove this bit, and apparently always enable the optimization.
> > > > > > > > 
> > > > > > > > Improves performance in "Counter Strike: Global Offensive" by 
> 18%
> > > > > > > > at default settings on Iris Pro.  No Piglit regressions.
> > > > > > > 
> > > > > > > Nice. We need more bits like this ;)
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
> > > > > > > > ---
> > > > > > > >  drivers/gpu/drm/i915/i915_reg.h | 1 +
> > > > > > > >  drivers/gpu/drm/i915/intel_pm.c | 4 ++++
> > > > > > > >  2 files changed, 5 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_reg.h 
> > > > > b/drivers/gpu/drm/i915/i915_reg.h
> > > > > > > > index 77fce96..340821a 100644
> > > > > > > > --- a/drivers/gpu/drm/i915/i915_reg.h
> > > > > > > > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > > > > > > > @@ -5952,6 +5952,7 @@ enum punit_power_well {
> > > > > > > >  #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
> > > > > > > >  
> > > > > > > >  #define HALF_SLICE_CHICKEN3		0xe184
> > > > > > > > +#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
> > > > > > > >  #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
> > > > > > > >  #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
> > > > > > > >  
> > > > > > > > diff --git a/drivers/gpu/drm/i915/intel_pm.c 
> > > > > b/drivers/gpu/drm/i915/intel_pm.c
> > > > > > > > index 7a69eba..50c72a7 100644
> > > > > > > > --- a/drivers/gpu/drm/i915/intel_pm.c
> > > > > > > > +++ b/drivers/gpu/drm/i915/intel_pm.c
> > > > > > > > @@ -5736,6 +5736,10 @@ static void 
> haswell_init_clock_gating(struct 
> > > > > drm_device *dev)
> > > > > > > >  	I915_WRITE(GEN7_GT_MODE,
> > > > > > > >  		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
> > > > > > > >  
> > > > > > > > +	/* Make sample_c messages faster. */
> > > > > > > 
> > > > > > > I found a name for it in the w/a database.
> > > > > > > 
> > > > > > > WaSampleCChickenBitEnable:hsw
> > > > > > > 
> > > > > > > Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > 
> > > > > > Oh actually it says palette won't work when this bit is on. I'm 
> assuming
> > > > > > that's the texture palette. Do we have any use of that anywhere?
> > > > > 
> > > > > That's a good point.  3DSTATE_SAMPLER_PALETTE_LOAD and the 
> A8P8/indexed 
> > > > > formats aren't used by Mesa or xf86-video-intel, but it looks like 
> they 
> > > might 
> > > > > be used by libva.
> > > > > 
> > > > > Can someone confirm that libva does use the sampler palette?
> > > > > 
> > > > > If they do, what do we do about it?
> > > > 
> > > > I suppose the best option then would be to use an LRI from a batch,
> > > > which means the register would need to be added to the cmd parser
> > > > white list. This is one of the context saved registers so doing the
> > > > LRI just once per context should be enough.
> > > 
> > > I don't like that solution.  For one, it's impossible - you can't LRI from 
> > > userspace batches, even if you add it to the kernel command parser's 
> > > whitelist, because the hardware scanner is still enabled.  Given that I've 
> > > been waiting two years for this capability, I want to find a more 
> immediate 
> > > solution.
> > 
> > Ah. I've somehow convinced myself the cmd parser might actually be doing
> > something besides just eating CPU cycles these days. But I guess not.
> > 
> > > 
> > > Another option is to have some sort of execbuf flag...maybe a 3D/Media 
> "usage" 
> > > flag.  If set to 3D, write 0x6000200...if media, write 0x6000000.  Or 
> > > something specific.  I do hate adding more junk to the execbuf path, 
> though.
> > > 
> > > Other ideas?
> > 
> > Fast vs. slow flag? :)
> > 
> > More seriously, one somewhat crappy option would be to initialize that
> > bit to 1 for all explicit contexts, and then have the kernel always turn
> > it off before executing something with the default context. It's not
> > unlike how we imagined the RS stuff would work since old userspace
> > doesn't know to turn RS off when using the default context.
> 
> Interesting idea - that might work.  We don't need mid-batch changes either.
> 
> I don't think HALF_SLICE_CHICKEN3 is part of the logical context, FWIW.

Oh but it is. Lots of chickens like to nest in the context.

> 
> Before we get too much further...we should check if libva is actually broken.  
> I don't know if this means the sampler palette completely doesn't work, or if 
> it just means sample_c doesn't work with the palette.  If it's the latter, 
> we're probably fine, because I doubt libva uses sample_c.

Yeah if we wouldn't break any existing userspace I guess we could just
flip the switch in the kernel. If anyone later wants to start doing
something that no longer works they'd have to deal with disabling the
bit using an LRI.
Jani Nikula Oct. 31, 2014, 9:43 a.m. UTC | #9
[off-topic]

On Fri, 31 Oct 2014, Ville Syrjälä <ville.syrjala@linux.intel.com> wrote:
> On Thu, Oct 30, 2014 at 12:57:04PM -0700, Kenneth Graunke wrote:
>> I don't think HALF_SLICE_CHICKEN3 is part of the logical context, FWIW.
>
> Oh but it is. Lots of chickens like to nest in the context.

In the context of lunch, HALF_SLICE_CHICKEN3 is indeed part of the
context. But it has certainly ceased to nest.


BR,
Jani.
Daniel Vetter Nov. 3, 2014, 12:48 p.m. UTC | #10
On Fri, Oct 31, 2014 at 11:27:33AM +0200, Ville Syrjälä wrote:
> On Thu, Oct 30, 2014 at 12:57:04PM -0700, Kenneth Graunke wrote:
> > Before we get too much further...we should check if libva is actually broken.
> > I don't know if this means the sampler palette completely doesn't work, or if
> > it just means sample_c doesn't work with the palette.  If it's the latter,
> > we're probably fine, because I doubt libva uses sample_c.
>
> Yeah if we wouldn't break any existing userspace I guess we could just
> flip the switch in the kernel. If anyone later wants to start doing
> something that no longer works they'd have to deal with disabling the
> bit using an LRI.

It very much looks like libva uses palettes, since it supports C8 and C4
image formats (well some crazy fourcc nonsense, but meh). And it does so
on all generations support by the libva driver, i.e. including hsw afaict

Cc'ing people and lists with more clue who should be able to tell whether
its not just there but actually works ...
-Daniel
Dave Gordon Nov. 3, 2014, 4:15 p.m. UTC | #11
On 30/10/14 19:26, Ville Syrjälä wrote:
> On Thu, Oct 30, 2014 at 10:32:38AM -0700, Kenneth Graunke wrote:
>> On Thursday, October 30, 2014 01:01:30 PM Ville Syrjälä wrote:
>>> On Thu, Oct 30, 2014 at 02:32:40AM -0700, Kenneth Graunke wrote:
>>>> On Thursday, October 30, 2014 11:00:51 AM Ville Syrjälä wrote:
>>>>> On Thu, Oct 30, 2014 at 10:50:03AM +0200, Ville Syrjälä wrote:
>>>>>> On Wed, Oct 29, 2014 at 03:12:43PM -0700, Kenneth Graunke wrote:
>>>>>>> Haswell significantly improved the performance of sampler_c 
>> messages,
>>>>>>> but the optimization appears to be off by default.  Later platforms
>>>>>>> remove this bit, and apparently always enable the optimization.
>>>>>>>
>>>>>>> Improves performance in "Counter Strike: Global Offensive" by 18%
>>>>>>> at default settings on Iris Pro.  No Piglit regressions.
>>>>>>
>>>>>> Nice. We need more bits like this ;)
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/i915/i915_reg.h | 1 +
>>>>>>>  drivers/gpu/drm/i915/intel_pm.c | 4 ++++
>>>>>>>  2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_reg.h 
>>>> b/drivers/gpu/drm/i915/i915_reg.h
>>>>>>> index 77fce96..340821a 100644
>>>>>>> --- a/drivers/gpu/drm/i915/i915_reg.h
>>>>>>> +++ b/drivers/gpu/drm/i915/i915_reg.h
>>>>>>> @@ -5952,6 +5952,7 @@ enum punit_power_well {
>>>>>>>  #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
>>>>>>>  
>>>>>>>  #define HALF_SLICE_CHICKEN3		0xe184
>>>>>>> +#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
>>>>>>>  #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
>>>>>>>  #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
>>>>>>>  
>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_pm.c 
>>>> b/drivers/gpu/drm/i915/intel_pm.c
>>>>>>> index 7a69eba..50c72a7 100644
>>>>>>> --- a/drivers/gpu/drm/i915/intel_pm.c
>>>>>>> +++ b/drivers/gpu/drm/i915/intel_pm.c
>>>>>>> @@ -5736,6 +5736,10 @@ static void haswell_init_clock_gating(struct 
>>>> drm_device *dev)
>>>>>>>  	I915_WRITE(GEN7_GT_MODE,
>>>>>>>  		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
>>>>>>>  
>>>>>>> +	/* Make sample_c messages faster. */
>>>>>>
>>>>>> I found a name for it in the w/a database.
>>>>>>
>>>>>> WaSampleCChickenBitEnable:hsw
>>>>>>
>>>>>> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>>
>>>>> Oh actually it says palette won't work when this bit is on. I'm assuming
>>>>> that's the texture palette. Do we have any use of that anywhere?
>>>>
>>>> That's a good point.  3DSTATE_SAMPLER_PALETTE_LOAD and the A8P8/indexed 
>>>> formats aren't used by Mesa or xf86-video-intel, but it looks like they
>>>> might be used by libva.
>>>>
>>>> Can someone confirm that libva does use the sampler palette?
>>>>
>>>> If they do, what do we do about it?
>>>
>>> I suppose the best option then would be to use an LRI from a batch,
>>> which means the register would need to be added to the cmd parser
>>> white list. This is one of the context saved registers so doing the
>>> LRI just once per context should be enough.
>>
>> I don't like that solution.  For one, it's impossible - you can't LRI from 
>> userspace batches, even if you add it to the kernel command parser's 
>> whitelist, because the hardware scanner is still enabled.  Given that I've 
>> been waiting two years for this capability, I want to find a more immediate 
>> solution.
> 
> Ah. I've somehow convinced myself the cmd parser might actually be doing
> something besides just eating CPU cycles these days. But I guess not.
> 
>>
>> Another option is to have some sort of execbuf flag...maybe a 3D/Media "usage" 
>> flag.  If set to 3D, write 0x6000200...if media, write 0x6000000.  Or 
>> something specific.  I do hate adding more junk to the execbuf path, though.
>>
>> Other ideas?
> 
> Fast vs. slow flag? :)
> 
> More seriously, one somewhat crappy option would be to initialize that
> bit to 1 for all explicit contexts, and then have the kernel always turn
> it off before executing something with the default context. It's not
> unlike how we imagined the RS stuff would work since old userspace
> doesn't know to turn RS off when using the default context.
> 
> But if we would do something like this, I think it would be nice if we
> could just add it as a temporary hack and potentially drop it if we
> manage to lose the restore inhibit flag and someone finishes the
> cmd parser. So maybe Mesa could also try to set it from a batch,
> which currently would be a nop or rejected by the cmd parser, but
> when the cmd parser starts to work it might actually do something?
> 
> But always initializing it to 1 for all explicit context in the kernel
> obviously requires that libva doesn't use an explicit context itself.
> In case it does, another idea would be to just add a hack to the cmd
> parser to emit the LRI on the ring when it enconters this register.
> Not very pretty, but should work somewhat. It wouldn't allow mid-batch
> changes to the register value though, but enough for setting it once
> for each context in Mesa. And of course we'd still need the kernel to
> turn it off before any default context will see it.

This looks like a use for the proposed DRM_I915_GEM_CONTEXT_CREATE2
ioctl, which (so far) allows setting of scheduling priority for batches
submitted in the new context, and/or declaring that the context will be
used for GPGPU batches. We could add some more bits so that the creator
could declare the context as being specifically for 3d, or for media,
so that the driver can issue the relevant LRIs for such contexts
without affecting users of the default (or any other) context.

.Dave.

>>> Well that's assuming libva doesn't use the default context. I'm getting
>>> another itch to drop the restore inhibit flag for default contexts.
>>> That would actually make it possible to do these sort of things without
>>> risking breakage to existing userspace. But I think Chris is going
>>> scream unless the patch comes with performance data that shows it
>>> doesn't hurt too much.
>>
>> I suppose it wouldn't affect Mesa much, since we never use the default context 
>> on Gen6+.  But otherwise I'd probably want to see the data, like Chris...
> 
> Yeah. I just have this feeling we're going to have to go there
> eventually since there are plenty of registers in the context. Some of
> those registers might have other interesting bits that certain users
> might want to frob from batches. But I suppose we could always have the
> kernel reset those too once they get added to the cmd parser white
> list. Except on new platforms where the hw scanner itself has a whitelist,
> so we'd have to reset them all basically, unless we want the cmd parser
> to still scan every batch on those platforms.
Xiang, Haihao Nov. 4, 2014, 8:48 a.m. UTC | #12
On Mon, 2014-11-03 at 13:48 +0100, Daniel Vetter wrote:
> On Fri, Oct 31, 2014 at 11:27:33AM +0200, Ville Syrjälä wrote:
> > On Thu, Oct 30, 2014 at 12:57:04PM -0700, Kenneth Graunke wrote:
> > > Before we get too much further...we should check if libva is actually broken.
> > > I don't know if this means the sampler palette completely doesn't work, or if
> > > it just means sample_c doesn't work with the palette.  If it's the latter,
> > > we're probably fine, because I doubt libva uses sample_c.
> >
> > Yeah if we wouldn't break any existing userspace I guess we could just
> > flip the switch in the kernel. If anyone later wants to start doing
> > something that no longer works they'd have to deal with disabling the
> > bit using an LRI.
> 
> It very much looks like libva uses palettes, since it supports C8 and C4
> image formats (well some crazy fourcc nonsense, but meh). And it does so
> on all generations support by the libva driver, i.e. including hsw afaict
> 
> Cc'ing people and lists with more clue who should be able to tell whether
> its not just there but actually works ...

Yes, libva driver uses sample (00000) and palette on HSW

Thanks
Haihao


> -Daniel
Matt Turner Nov. 7, 2014, 6:46 p.m. UTC | #13
On Wed, Oct 29, 2014 at 2:12 PM, Kenneth Graunke <kenneth@whitecape.org> wrote:
> Haswell significantly improved the performance of sampler_c messages,
> but the optimization appears to be off by default.  Later platforms
> remove this bit, and apparently always enable the optimization.
>
> Improves performance in "Counter Strike: Global Offensive" by 18%
> at default settings on Iris Pro.  No Piglit regressions.

Discussion seems to have ended, but Mesa really needs this.

So, Kernel Team, how are we going to make sure Mesa gets this 18%
performance improvement?
Daniel Vetter Nov. 11, 2014, 10:16 a.m. UTC | #14
On Fri, Nov 07, 2014 at 10:46:31AM -0800, Matt Turner wrote:
> On Wed, Oct 29, 2014 at 2:12 PM, Kenneth Graunke <kenneth@whitecape.org> wrote:
> > Haswell significantly improved the performance of sampler_c messages,
> > but the optimization appears to be off by default.  Later platforms
> > remove this bit, and apparently always enable the optimization.
> >
> > Improves performance in "Counter Strike: Global Offensive" by 18%
> > at default settings on Iris Pro.  No Piglit regressions.
> 
> Discussion seems to have ended, but Mesa really needs this.
> 
> So, Kernel Team, how are we going to make sure Mesa gets this 18%
> performance improvement?

Someone writes the patches+tests and everything, gets it all reviewed and
I'll merge it.

I'd prefer if we could do this with the cmd parser, but that thing seems
to be eternally stuck. So a new execbuf flag is imo ok too. execbuf
already tracks a few other lonesome bits and makes sure userspace actually
gets the setting it requests, so should work out.
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 77fce96..340821a 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -5952,6 +5952,7 @@  enum punit_power_well {
 #define  HSW_ROW_CHICKEN3_L3_GLOBAL_ATOMICS_DISABLE    (1 << 6)
 
 #define HALF_SLICE_CHICKEN3		0xe184
+#define   HSW_SAMPLE_C_PERFORMANCE	(1<<9)
 #define   GEN8_CENTROID_PIXEL_OPT_DIS	(1<<8)
 #define   GEN8_SAMPLER_POWER_BYPASS_DIS	(1<<1)
 
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 7a69eba..50c72a7 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -5736,6 +5736,10 @@  static void haswell_init_clock_gating(struct drm_device *dev)
 	I915_WRITE(GEN7_GT_MODE,
 		   GEN6_WIZ_HASHING_MASK | GEN6_WIZ_HASHING_16x4);
 
+	/* Make sample_c messages faster. */
+	I915_WRITE(HALF_SLICE_CHICKEN3,
+		   _MASKED_BIT_ENABLE(HSW_SAMPLE_C_PERFORMANCE));
+
 	/* WaSwitchSolVfFArbitrationPriority:hsw */
 	I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) | HSW_ECOCHK_ARB_PRIO_SOL);