Message ID | 1434735435-14728-2-git-send-email-arun.siluvery@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jun 19, 2015 at 06:37:10PM +0100, Arun Siluvery wrote: > Some of the WA are to be applied during context save but before restore and > some at the end of context save/restore but before executing the instructions > in the ring, WA batch buffers are created for this purpose and these WA cannot > be applied using normal means. Each context has two registers to load the > offsets of these batch buffers. If they are non-zero, HW understands that it > need to execute these batches. > > v1: In this version two separate ring_buffer objects were used to load WA > instructions for indirect and per context batch buffers and they were part > of every context. > > v2: Chris suggested to include additional page in context and use it to load > these WA instead of creating separate objects. This will simplify lot of things > as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC > is planning to use a similar setup to share data between GuC and driver and > WA batch buffers can probably share that page. However after discussions with > Dave who is implementing GuC changes, he suggested to use an independent page > for the reasons - GuC area might grow and these WA are initialized only once and > are not changed afterwards so we can share them share across all contexts. > > The page is updated with WA during render ring init. This has an advantage of > not adding more special cases to default_context. > > We don't know upfront the number of WA we will applying using these batch buffers. > For this reason the size was fixed earlier but it is not a good idea. To fix this, > the functions that load instructions are modified to report the no of commands > inserted and the size is now calculated after the batch is updated. A macro is > introduced to add commands to these batch buffers which also checks for overflow > and returns error. > We have a full page dedicated for these WA so that should be sufficient for > good number of WA, anything more means we have major issues. > The list for Gen8 is small, same for Gen9 also, maybe few more gets added > going forward but not close to filling entire page. Chris suggested a two-pass > approach but we agreed to go with single page setup as it is a one-off routine > and simpler code wins. > > One additional option is offset field which is helpful if we would like to > have multiple batches at different offsets within the page and select them > based on some criteria. This is not a requirement at this point but could > help in future (Dave). > > Chris provided some helpful macros and suggestions which further simplified > the code, they will also help in reducing code duplication when WA for > other Gen are added. Add detailed comments explaining restrictions. > > (Many thanks to Chris, Dave and Thomas for their reviews and inputs) > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Dave Gordon <david.s.gordon@intel.com> > Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com> > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com> Sigh, after all that, I found one minor thing, but nevertheless Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> > +#define wa_ctx_emit(batch, cmd) { \ > + if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) { \ > + return -ENOSPC; \ > + } \ > + batch[index++] = (cmd); \ > + } We should have wrapped this in do { } while(0) - think of all those trialing semicolons we have in the code! Fortunately we haven't used this in a if (foo) wa_ctx_emit(bar); else wa_ctx_emit(baz); yet. -Chris
On Fri, Jun 19, 2015 at 06:50:36PM +0100, Chris Wilson wrote: > On Fri, Jun 19, 2015 at 06:37:10PM +0100, Arun Siluvery wrote: > > Some of the WA are to be applied during context save but before restore and > > some at the end of context save/restore but before executing the instructions > > in the ring, WA batch buffers are created for this purpose and these WA cannot > > be applied using normal means. Each context has two registers to load the > > offsets of these batch buffers. If they are non-zero, HW understands that it > > need to execute these batches. > > > > v1: In this version two separate ring_buffer objects were used to load WA > > instructions for indirect and per context batch buffers and they were part > > of every context. > > > > v2: Chris suggested to include additional page in context and use it to load > > these WA instead of creating separate objects. This will simplify lot of things > > as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC > > is planning to use a similar setup to share data between GuC and driver and > > WA batch buffers can probably share that page. However after discussions with > > Dave who is implementing GuC changes, he suggested to use an independent page > > for the reasons - GuC area might grow and these WA are initialized only once and > > are not changed afterwards so we can share them share across all contexts. > > > > The page is updated with WA during render ring init. This has an advantage of > > not adding more special cases to default_context. > > > > We don't know upfront the number of WA we will applying using these batch buffers. > > For this reason the size was fixed earlier but it is not a good idea. To fix this, > > the functions that load instructions are modified to report the no of commands > > inserted and the size is now calculated after the batch is updated. A macro is > > introduced to add commands to these batch buffers which also checks for overflow > > and returns error. > > We have a full page dedicated for these WA so that should be sufficient for > > good number of WA, anything more means we have major issues. > > The list for Gen8 is small, same for Gen9 also, maybe few more gets added > > going forward but not close to filling entire page. Chris suggested a two-pass > > approach but we agreed to go with single page setup as it is a one-off routine > > and simpler code wins. > > > > One additional option is offset field which is helpful if we would like to > > have multiple batches at different offsets within the page and select them > > based on some criteria. This is not a requirement at this point but could > > help in future (Dave). > > > > Chris provided some helpful macros and suggestions which further simplified > > the code, they will also help in reducing code duplication when WA for > > other Gen are added. Add detailed comments explaining restrictions. > > > > (Many thanks to Chris, Dave and Thomas for their reviews and inputs) > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Dave Gordon <david.s.gordon@intel.com> > > Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com> > > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com> > > Sigh, after all that, I found one minor thing, but nevertheless > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> > > > +#define wa_ctx_emit(batch, cmd) { \ > > + if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) { \ > > + return -ENOSPC; \ > > + } \ > > + batch[index++] = (cmd); \ > > + } > > We should have wrapped this in do { } while(0) - think of all those > trialing semicolons we have in the code! Fortunately we haven't used > this in a if (foo) wa_ctx_emit(bar); else wa_ctx_emit(baz); yet. Uh yes, this is a critical one. Arun, can you please do a follow-up patch to wrap your macro in a do {} while(0) like Chris suggested? I'll apply the paches meanwhile. Thanks, Daniel
On 22/06/2015 16:36, Daniel Vetter wrote: > On Fri, Jun 19, 2015 at 06:50:36PM +0100, Chris Wilson wrote: >> On Fri, Jun 19, 2015 at 06:37:10PM +0100, Arun Siluvery wrote: >>> Some of the WA are to be applied during context save but before restore and >>> some at the end of context save/restore but before executing the instructions >>> in the ring, WA batch buffers are created for this purpose and these WA cannot >>> be applied using normal means. Each context has two registers to load the >>> offsets of these batch buffers. If they are non-zero, HW understands that it >>> need to execute these batches. >>> >>> v1: In this version two separate ring_buffer objects were used to load WA >>> instructions for indirect and per context batch buffers and they were part >>> of every context. >>> >>> v2: Chris suggested to include additional page in context and use it to load >>> these WA instead of creating separate objects. This will simplify lot of things >>> as we need not explicity pin/unpin them. Thomas Daniel further pointed that GuC >>> is planning to use a similar setup to share data between GuC and driver and >>> WA batch buffers can probably share that page. However after discussions with >>> Dave who is implementing GuC changes, he suggested to use an independent page >>> for the reasons - GuC area might grow and these WA are initialized only once and >>> are not changed afterwards so we can share them share across all contexts. >>> >>> The page is updated with WA during render ring init. This has an advantage of >>> not adding more special cases to default_context. >>> >>> We don't know upfront the number of WA we will applying using these batch buffers. >>> For this reason the size was fixed earlier but it is not a good idea. To fix this, >>> the functions that load instructions are modified to report the no of commands >>> inserted and the size is now calculated after the batch is updated. A macro is >>> introduced to add commands to these batch buffers which also checks for overflow >>> and returns error. >>> We have a full page dedicated for these WA so that should be sufficient for >>> good number of WA, anything more means we have major issues. >>> The list for Gen8 is small, same for Gen9 also, maybe few more gets added >>> going forward but not close to filling entire page. Chris suggested a two-pass >>> approach but we agreed to go with single page setup as it is a one-off routine >>> and simpler code wins. >>> >>> One additional option is offset field which is helpful if we would like to >>> have multiple batches at different offsets within the page and select them >>> based on some criteria. This is not a requirement at this point but could >>> help in future (Dave). >>> >>> Chris provided some helpful macros and suggestions which further simplified >>> the code, they will also help in reducing code duplication when WA for >>> other Gen are added. Add detailed comments explaining restrictions. >>> >>> (Many thanks to Chris, Dave and Thomas for their reviews and inputs) >>> >>> Cc: Chris Wilson <chris@chris-wilson.co.uk> >>> Cc: Dave Gordon <david.s.gordon@intel.com> >>> Signed-off-by: Rafael Barbalho <rafael.barbalho@intel.com> >>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com> >> >> Sigh, after all that, I found one minor thing, but nevertheless >> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> >> >>> +#define wa_ctx_emit(batch, cmd) { \ >>> + if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) { \ >>> + return -ENOSPC; \ >>> + } \ >>> + batch[index++] = (cmd); \ >>> + } >> >> We should have wrapped this in do { } while(0) - think of all those >> trialing semicolons we have in the code! Fortunately we haven't used >> this in a if (foo) wa_ctx_emit(bar); else wa_ctx_emit(baz); yet. > > Uh yes, this is a critical one. Arun, can you please do a follow-up patch > to wrap your macro in a do {} while(0) like Chris suggested? I'll apply > the paches meanwhile. Hi Daniel, Already sent the updated patch. I think I got the message-id wrong, the updated patch that I sent is showing up as the last message in this series. regards Arun > > Thanks, Daniel >
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 0413b8f..c9255fc 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -211,6 +211,7 @@ enum { FAULT_AND_CONTINUE /* Unsupported */ }; #define GEN8_CTX_ID_SHIFT 32 +#define CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT 0x17 static int intel_lr_context_pin(struct intel_engine_cs *ring, struct intel_context *ctx); @@ -1077,6 +1078,190 @@ static int intel_logical_ring_workarounds_emit(struct intel_engine_cs *ring, return 0; } +#define wa_ctx_emit(batch, cmd) { \ + if (WARN_ON(index >= (PAGE_SIZE / sizeof(uint32_t)))) { \ + return -ENOSPC; \ + } \ + batch[index++] = (cmd); \ + } + +static inline uint32_t wa_ctx_start(struct i915_wa_ctx_bb *wa_ctx, + uint32_t offset, + uint32_t start_alignment) +{ + return wa_ctx->offset = ALIGN(offset, start_alignment); +} + +static inline int wa_ctx_end(struct i915_wa_ctx_bb *wa_ctx, + uint32_t offset, + uint32_t size_alignment) +{ + wa_ctx->size = offset - wa_ctx->offset; + + WARN(wa_ctx->size % size_alignment, + "wa_ctx_bb failed sanity checks: size %d is not aligned to %d\n", + wa_ctx->size, size_alignment); + return 0; +} + +/** + * gen8_init_indirectctx_bb() - initialize indirect ctx batch with WA + * + * @ring: only applicable for RCS + * @wa_ctx: structure representing wa_ctx + * offset: specifies start of the batch, should be cache-aligned. This is updated + * with the offset value received as input. + * size: size of the batch in DWORDS but HW expects in terms of cachelines + * @batch: page in which WA are loaded + * @offset: This field specifies the start of the batch, it should be + * cache-aligned otherwise it is adjusted accordingly. + * Typically we only have one indirect_ctx and per_ctx batch buffer which are + * initialized at the beginning and shared across all contexts but this field + * helps us to have multiple batches at different offsets and select them based + * on a criteria. At the moment this batch always start at the beginning of the page + * and at this point we don't have multiple wa_ctx batch buffers. + * + * The number of WA applied are not known at the beginning; we use this field + * to return the no of DWORDS written. + + * It is to be noted that this batch does not contain MI_BATCH_BUFFER_END + * so it adds NOOPs as padding to make it cacheline aligned. + * MI_BATCH_BUFFER_END will be added to perctx batch and both of them together + * makes a complete batch buffer. + * + * Return: non-zero if we exceed the PAGE_SIZE limit. + */ + +static int gen8_init_indirectctx_bb(struct intel_engine_cs *ring, + struct i915_wa_ctx_bb *wa_ctx, + uint32_t *const batch, + uint32_t *offset) +{ + uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS); + + /* FIXME: Replace me with WA */ + wa_ctx_emit(batch, MI_NOOP); + + /* Pad to end of cacheline */ + while (index % CACHELINE_DWORDS) + wa_ctx_emit(batch, MI_NOOP); + + /* + * MI_BATCH_BUFFER_END is not required in Indirect ctx BB because + * execution depends on the length specified in terms of cache lines + * in the register CTX_RCS_INDIRECT_CTX + */ + + return wa_ctx_end(wa_ctx, *offset = index, CACHELINE_DWORDS); +} + +/** + * gen8_init_perctx_bb() - initialize per ctx batch with WA + * + * @ring: only applicable for RCS + * @wa_ctx: structure representing wa_ctx + * offset: specifies start of the batch, should be cache-aligned. + * size: size of the batch in DWORDS but HW expects in terms of cachelines + * @offset: This field specifies the start of this batch. + * This batch is started immediately after indirect_ctx batch. Since we ensure + * that indirect_ctx ends on a cacheline this batch is aligned automatically. + * + * The number of DWORDS written are returned using this field. + * + * This batch is terminated with MI_BATCH_BUFFER_END and so we need not add padding + * to align it with cacheline as padding after MI_BATCH_BUFFER_END is redundant. + */ +static int gen8_init_perctx_bb(struct intel_engine_cs *ring, + struct i915_wa_ctx_bb *wa_ctx, + uint32_t *const batch, + uint32_t *offset) +{ + uint32_t index = wa_ctx_start(wa_ctx, *offset, CACHELINE_DWORDS); + + wa_ctx_emit(batch, MI_BATCH_BUFFER_END); + + return wa_ctx_end(wa_ctx, *offset = index, 1); +} + +static int lrc_setup_wa_ctx_obj(struct intel_engine_cs *ring, u32 size) +{ + int ret; + + ring->wa_ctx.obj = i915_gem_alloc_object(ring->dev, PAGE_ALIGN(size)); + if (!ring->wa_ctx.obj) { + DRM_DEBUG_DRIVER("alloc LRC WA ctx backing obj failed.\n"); + return -ENOMEM; + } + + ret = i915_gem_obj_ggtt_pin(ring->wa_ctx.obj, PAGE_SIZE, 0); + if (ret) { + DRM_DEBUG_DRIVER("pin LRC WA ctx backing obj failed: %d\n", + ret); + drm_gem_object_unreference(&ring->wa_ctx.obj->base); + return ret; + } + + return 0; +} + +static void lrc_destroy_wa_ctx_obj(struct intel_engine_cs *ring) +{ + if (ring->wa_ctx.obj) { + i915_gem_object_ggtt_unpin(ring->wa_ctx.obj); + drm_gem_object_unreference(&ring->wa_ctx.obj->base); + ring->wa_ctx.obj = NULL; + } +} + +static int intel_init_workaround_bb(struct intel_engine_cs *ring) +{ + int ret; + uint32_t *batch; + uint32_t offset; + struct page *page; + struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx; + + WARN_ON(ring->id != RCS); + + ret = lrc_setup_wa_ctx_obj(ring, PAGE_SIZE); + if (ret) { + DRM_DEBUG_DRIVER("Failed to setup context WA page: %d\n", ret); + return ret; + } + + page = i915_gem_object_get_page(wa_ctx->obj, 0); + batch = kmap_atomic(page); + offset = 0; + + if (INTEL_INFO(ring->dev)->gen == 8) { + ret = gen8_init_indirectctx_bb(ring, + &wa_ctx->indirect_ctx, + batch, + &offset); + if (ret) + goto out; + + ret = gen8_init_perctx_bb(ring, + &wa_ctx->per_ctx, + batch, + &offset); + if (ret) + goto out; + } else { + WARN(INTEL_INFO(ring->dev)->gen >= 8, + "WA batch buffer is not initialized for Gen%d\n", + INTEL_INFO(ring->dev)->gen); + lrc_destroy_wa_ctx_obj(ring); + } + +out: + kunmap_atomic(batch); + if (ret) + lrc_destroy_wa_ctx_obj(ring); + + return ret; +} + static int gen8_init_common_ring(struct intel_engine_cs *ring) { struct drm_device *dev = ring->dev; @@ -1411,6 +1596,8 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *ring) kunmap(sg_page(ring->status_page.obj->pages->sgl)); ring->status_page.obj = NULL; } + + lrc_destroy_wa_ctx_obj(ring); } static int logical_ring_init(struct drm_device *dev, struct intel_engine_cs *ring) @@ -1474,7 +1661,22 @@ static int logical_render_ring_init(struct drm_device *dev) if (ret) return ret; - return intel_init_pipe_control(ring); + ret = intel_init_workaround_bb(ring); + if (ret) { + /* + * We continue even if we fail to initialize WA batch + * because we only expect rare glitches but nothing + * critical to prevent us from using GPU + */ + DRM_ERROR("WA batch buffer initialization failed: %d\n", + ret); + } + + ret = intel_init_pipe_control(ring); + if (ret) + lrc_destroy_wa_ctx_obj(ring); + + return ret; } static int logical_bsd_ring_init(struct drm_device *dev) @@ -1754,15 +1956,27 @@ populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o reg_state[CTX_SECOND_BB_STATE] = ring->mmio_base + 0x118; reg_state[CTX_SECOND_BB_STATE+1] = 0; if (ring->id == RCS) { - /* TODO: according to BSpec, the register state context - * for CHV does not have these. OTOH, these registers do - * exist in CHV. I'm waiting for a clarification */ reg_state[CTX_BB_PER_CTX_PTR] = ring->mmio_base + 0x1c0; reg_state[CTX_BB_PER_CTX_PTR+1] = 0; reg_state[CTX_RCS_INDIRECT_CTX] = ring->mmio_base + 0x1c4; reg_state[CTX_RCS_INDIRECT_CTX+1] = 0; reg_state[CTX_RCS_INDIRECT_CTX_OFFSET] = ring->mmio_base + 0x1c8; reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] = 0; + if (ring->wa_ctx.obj) { + struct i915_ctx_workarounds *wa_ctx = &ring->wa_ctx; + uint32_t ggtt_offset = i915_gem_obj_ggtt_offset(wa_ctx->obj); + + reg_state[CTX_RCS_INDIRECT_CTX+1] = + (ggtt_offset + wa_ctx->indirect_ctx.offset * sizeof(uint32_t)) | + (wa_ctx->indirect_ctx.size / CACHELINE_DWORDS); + + reg_state[CTX_RCS_INDIRECT_CTX_OFFSET+1] = + CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT << 6; + + reg_state[CTX_BB_PER_CTX_PTR+1] = + (ggtt_offset + wa_ctx->per_ctx.offset * sizeof(uint32_t)) | + 0x01; + } } reg_state[CTX_LRI_HEADER_1] = MI_LOAD_REGISTER_IMM(9); reg_state[CTX_LRI_HEADER_1] |= MI_LRI_FORCE_POSTED; diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 39f6dfc..06467c6 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -12,6 +12,7 @@ * workarounds! */ #define CACHELINE_BYTES 64 +#define CACHELINE_DWORDS (CACHELINE_BYTES / sizeof(uint32_t)) /* * Gen2 BSpec "1. Programming Environment" / 1.4.4.6 "Ring Buffer Use" @@ -119,6 +120,25 @@ struct intel_ringbuffer { struct intel_context; +/* + * we use a single page to load ctx workarounds so all of these + * values are referred in terms of dwords + * + * struct i915_wa_ctx_bb: + * offset: specifies batch starting position, also helpful in case + * if we want to have multiple batches at different offsets based on + * some criteria. It is not a requirement at the moment but provides + * an option for future use. + * size: size of the batch in DWORDS + */ +struct i915_ctx_workarounds { + struct i915_wa_ctx_bb { + u32 offset; + u32 size; + } indirect_ctx, per_ctx; + struct drm_i915_gem_object *obj; +}; + struct intel_engine_cs { const char *name; enum intel_ring_id { @@ -142,6 +162,7 @@ struct intel_engine_cs { struct i915_gem_batch_pool batch_pool; struct intel_hw_status_page status_page; + struct i915_ctx_workarounds wa_ctx; unsigned irq_refcount; /* protected by dev_priv->irq_lock */ u32 irq_enable_mask; /* bitmask to enable ring interrupt */