Message ID | 20210826032327.18078-12-matthew.brost@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Clean up GuC CI failures, simplify locking, and kernel DOC | expand |
On 8/25/2021 8:23 PM, Matthew Brost wrote: > When the GuC does a media reset, it copies a golden context state back > into the corrupted context's state. The address of the golden context > and the size of the engine state restore are passed in via the GuC ADS. > The i915 had a bug where it passed in the whole size of the golden > context, not the size of the engine state to restore resulting in a > memory corruption. > > Also copy the entire golden context on init rather than just the engine > state that is restored. > > Fixes: 481d458caede ("drm/i915/guc: Add golden context to GuC ADS") > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > --- > drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c | 28 +++++++++++++++++----- > 1 file changed, 22 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c > index 6926919bcac6..df2734bfe078 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c > @@ -358,6 +358,11 @@ static int guc_prep_golden_context(struct intel_guc *guc, > u8 engine_class, guc_class; > struct guc_gt_system_info *info, local_info; > > + /* Skip execlist and PPGTT registers + HWSP */ > + const u32 lr_hw_context_size = 80 * sizeof(u32); > + const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE + > + lr_hw_context_size; > + > /* > * Reserve the memory for the golden contexts and point GuC at it but > * leave it empty for now. The context data will be filled in later > @@ -396,7 +401,18 @@ static int guc_prep_golden_context(struct intel_guc *guc, > if (!blob) > continue; > > - blob->ads.eng_state_size[guc_class] = real_size; > + /* > + * This interface is slightly confusing. We need to pass the > + * base address of the golden context and the engine state size > + * which is not the size of the whole golden context, it is a > + * subset that the GuC uses when doing a watchdog reset. The > + * engine state size must match the size of the golden context > + * minus the first part of the golden context that the GuC does > + * not retore during reset. Currently no real way to verify this > + * other than reading the GuC spec / code and ensuring the > + * 'skip_size' below matches the value used in the GuC code. > + */ This last statement is incorrect. The skipped size is the PPHWSP + the execlists context. The size of the execlists context is defined in the specs (as part of the full context layout) and it is therefore not a magic number only available in the GuC code. With the comment fixed: Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Daniele > + blob->ads.eng_state_size[guc_class] = real_size - skip_size; > blob->ads.golden_context_lrca[guc_class] = addr_ggtt; > addr_ggtt += alloc_size; > } > @@ -437,8 +453,8 @@ static void guc_init_golden_context(struct intel_guc *guc) > u8 *ptr; > > /* Skip execlist and PPGTT registers + HWSP */ > - const u32 lr_hw_context_size = 80 * sizeof(u32); > - const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE + > + __maybe_unused const u32 lr_hw_context_size = 80 * sizeof(u32); > + __maybe_unused const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE + > lr_hw_context_size; > > if (!intel_uc_uses_guc_submission(>->uc)) > @@ -476,12 +492,12 @@ static void guc_init_golden_context(struct intel_guc *guc) > continue; > } > > - GEM_BUG_ON(blob->ads.eng_state_size[guc_class] != real_size); > + GEM_BUG_ON(blob->ads.eng_state_size[guc_class] != > + real_size - skip_size); > GEM_BUG_ON(blob->ads.golden_context_lrca[guc_class] != addr_ggtt); > addr_ggtt += alloc_size; > > - shmem_read(engine->default_state, skip_size, ptr + skip_size, > - real_size - skip_size); > + shmem_read(engine->default_state, 0, ptr, real_size); > ptr += alloc_size; > } >
On 8/25/2021 20:23, Matthew Brost wrote: > When the GuC does a media reset, it copies a golden context state back > into the corrupted context's state. The address of the golden context > and the size of the engine state restore are passed in via the GuC ADS. > The i915 had a bug where it passed in the whole size of the golden > context, not the size of the engine state to restore resulting in a > memory corruption. > > Also copy the entire golden context on init rather than just the engine > state that is restored. > > Fixes: 481d458caede ("drm/i915/guc: Add golden context to GuC ADS") > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > --- > drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c | 28 +++++++++++++++++----- > 1 file changed, 22 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c > index 6926919bcac6..df2734bfe078 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c > @@ -358,6 +358,11 @@ static int guc_prep_golden_context(struct intel_guc *guc, > u8 engine_class, guc_class; > struct guc_gt_system_info *info, local_info; > > + /* Skip execlist and PPGTT registers + HWSP */ > + const u32 lr_hw_context_size = 80 * sizeof(u32); > + const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE + > + lr_hw_context_size; > + > /* > * Reserve the memory for the golden contexts and point GuC at it but > * leave it empty for now. The context data will be filled in later > @@ -396,7 +401,18 @@ static int guc_prep_golden_context(struct intel_guc *guc, > if (!blob) > continue; > > - blob->ads.eng_state_size[guc_class] = real_size; > + /* > + * This interface is slightly confusing. We need to pass the > + * base address of the golden context and the engine state size > + * which is not the size of the whole golden context, it is a > + * subset that the GuC uses when doing a watchdog reset. The > + * engine state size must match the size of the golden context > + * minus the first part of the golden context that the GuC does > + * not retore during reset. Currently no real way to verify this > + * other than reading the GuC spec / code and ensuring the > + * 'skip_size' below matches the value used in the GuC code. > + */ > + blob->ads.eng_state_size[guc_class] = real_size - skip_size; > blob->ads.golden_context_lrca[guc_class] = addr_ggtt; > addr_ggtt += alloc_size; > } > @@ -437,8 +453,8 @@ static void guc_init_golden_context(struct intel_guc *guc) > u8 *ptr; > > /* Skip execlist and PPGTT registers + HWSP */ > - const u32 lr_hw_context_size = 80 * sizeof(u32); > - const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE + > + __maybe_unused const u32 lr_hw_context_size = 80 * sizeof(u32); > + __maybe_unused const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE + > lr_hw_context_size; Not sure why the 'maybe unused'? The values are not only used in BUG_ONs or such that could vanish. More importantly, you now have two sets of definitions for these magic numbers. That seems like a very bad idea. They should be moved into a helper function rather than repeated. John. > > if (!intel_uc_uses_guc_submission(>->uc)) > @@ -476,12 +492,12 @@ static void guc_init_golden_context(struct intel_guc *guc) > continue; > } > > - GEM_BUG_ON(blob->ads.eng_state_size[guc_class] != real_size); > + GEM_BUG_ON(blob->ads.eng_state_size[guc_class] != > + real_size - skip_size); > GEM_BUG_ON(blob->ads.golden_context_lrca[guc_class] != addr_ggtt); > addr_ggtt += alloc_size; > > - shmem_read(engine->default_state, skip_size, ptr + skip_size, > - real_size - skip_size); > + shmem_read(engine->default_state, 0, ptr, real_size); > ptr += alloc_size; > } >
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c index 6926919bcac6..df2734bfe078 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c @@ -358,6 +358,11 @@ static int guc_prep_golden_context(struct intel_guc *guc, u8 engine_class, guc_class; struct guc_gt_system_info *info, local_info; + /* Skip execlist and PPGTT registers + HWSP */ + const u32 lr_hw_context_size = 80 * sizeof(u32); + const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE + + lr_hw_context_size; + /* * Reserve the memory for the golden contexts and point GuC at it but * leave it empty for now. The context data will be filled in later @@ -396,7 +401,18 @@ static int guc_prep_golden_context(struct intel_guc *guc, if (!blob) continue; - blob->ads.eng_state_size[guc_class] = real_size; + /* + * This interface is slightly confusing. We need to pass the + * base address of the golden context and the engine state size + * which is not the size of the whole golden context, it is a + * subset that the GuC uses when doing a watchdog reset. The + * engine state size must match the size of the golden context + * minus the first part of the golden context that the GuC does + * not retore during reset. Currently no real way to verify this + * other than reading the GuC spec / code and ensuring the + * 'skip_size' below matches the value used in the GuC code. + */ + blob->ads.eng_state_size[guc_class] = real_size - skip_size; blob->ads.golden_context_lrca[guc_class] = addr_ggtt; addr_ggtt += alloc_size; } @@ -437,8 +453,8 @@ static void guc_init_golden_context(struct intel_guc *guc) u8 *ptr; /* Skip execlist and PPGTT registers + HWSP */ - const u32 lr_hw_context_size = 80 * sizeof(u32); - const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE + + __maybe_unused const u32 lr_hw_context_size = 80 * sizeof(u32); + __maybe_unused const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE + lr_hw_context_size; if (!intel_uc_uses_guc_submission(>->uc)) @@ -476,12 +492,12 @@ static void guc_init_golden_context(struct intel_guc *guc) continue; } - GEM_BUG_ON(blob->ads.eng_state_size[guc_class] != real_size); + GEM_BUG_ON(blob->ads.eng_state_size[guc_class] != + real_size - skip_size); GEM_BUG_ON(blob->ads.golden_context_lrca[guc_class] != addr_ggtt); addr_ggtt += alloc_size; - shmem_read(engine->default_state, skip_size, ptr + skip_size, - real_size - skip_size); + shmem_read(engine->default_state, 0, ptr, real_size); ptr += alloc_size; }
When the GuC does a media reset, it copies a golden context state back into the corrupted context's state. The address of the golden context and the size of the engine state restore are passed in via the GuC ADS. The i915 had a bug where it passed in the whole size of the golden context, not the size of the engine state to restore resulting in a memory corruption. Also copy the entire golden context on init rather than just the engine state that is restored. Fixes: 481d458caede ("drm/i915/guc: Add golden context to GuC ADS") Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c | 28 +++++++++++++++++----- 1 file changed, 22 insertions(+), 6 deletions(-)