Message ID | 20170223231137.7682-1-michel.thierry@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Reviewed-by: Sagar Arun Kamble <sagar.a.kamble@intel.com> On 2/24/2017 4:41 AM, Michel Thierry wrote: > There was no way to check if the platform is running the latest firmware. > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com> > Signed-off-by: Michel Thierry <michel.thierry@intel.com> > --- > drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > index 2b1d15668192..e022187916ee 100644 > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, > CSR_VERSION_MINOR(csr->version)); > } > > + if (HAS_GUC_UCODE(dev_priv)) { > + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; > + > + err_printf(m, "GuC loaded: %s\n", > + yesno(guc_fw->load_status == > + INTEL_UC_FIRMWARE_SUCCESS)); > + err_printf(m, "GuC fw version: %d.%d\n", > + guc_fw->major_ver_found, guc_fw->minor_ver_found); > + } > + > err_printf(m, "EIR: 0x%08x\n", error->eir); > err_printf(m, "IER: 0x%08x\n", error->ier); > for (i = 0; i < error->ngtier; i++)
On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote: > Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com> > > On 2/24/2017 4:41 AM, Michel Thierry wrote: > > There was no way to check if the platform is running the latest firmware. > > Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com> > Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com> > Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com> > --- > drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > index 2b1d15668192..e022187916ee 100644 > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, > CSR_VERSION_MINOR(csr->version)); > } > > + if (HAS_GUC_UCODE(dev_priv)) { > + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; > + > + err_printf(m, "GuC loaded: %s\n", > + yesno(guc_fw->load_status == > + INTEL_UC_FIRMWARE_SUCCESS)); > + err_printf(m, "GuC fw version: %d.%d\n", > + guc_fw->major_ver_found, guc_fw->minor_ver_found); > + } > + Hmm. The firmware may change between the hang and cat /sys/class/drm/card0/error (as it will be reloaded after the reset). -Chris
On Thu, Feb 23, 2017 at 03:11:37PM -0800, Michel Thierry wrote: > There was no way to check if the platform is running the latest firmware. Can we also add similar patch for the HuC ? > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com> > Signed-off-by: Michel Thierry <michel.thierry@intel.com> > --- > drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > index 2b1d15668192..e022187916ee 100644 > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, > CSR_VERSION_MINOR(csr->version)); > } > > + if (HAS_GUC_UCODE(dev_priv)) { > + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; I would preffer to use HAS_GUC and intel_guc* here. > + > + err_printf(m, "GuC loaded: %s\n", > + yesno(guc_fw->load_status == > + INTEL_UC_FIRMWARE_SUCCESS)); Hmm, as we do have more detailed load status, why limiting it to yes/no only? -Michal > + err_printf(m, "GuC fw version: %d.%d\n", > + guc_fw->major_ver_found, guc_fw->minor_ver_found); > + } > + > err_printf(m, "EIR: 0x%08x\n", error->eir); > err_printf(m, "IER: 0x%08x\n", error->ier); > for (i = 0; i < error->ngtier; i++) > -- > 2.11.0 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote: > On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote: > > Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com> > > > > On 2/24/2017 4:41 AM, Michel Thierry wrote: > > > > There was no way to check if the platform is running the latest firmware. > > > > Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com> > > Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com> > > Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com> > > --- > > drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > > index 2b1d15668192..e022187916ee 100644 > > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > > @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, > > CSR_VERSION_MINOR(csr->version)); > > } > > > > + if (HAS_GUC_UCODE(dev_priv)) { > > + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; > > + > > + err_printf(m, "GuC loaded: %s\n", > > + yesno(guc_fw->load_status == > > + INTEL_UC_FIRMWARE_SUCCESS)); > > + err_printf(m, "GuC fw version: %d.%d\n", > > + guc_fw->major_ver_found, guc_fw->minor_ver_found); > > + } > > + > > Hmm. The firmware may change between the hang and cat > /sys/class/drm/card0/error (as it will be reloaded after the reset). Btw, maybe we should add counter that will be incremented on each fw reload and reported here ? -Michal
On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote: > On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote: > > On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote: > > > Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com> > > > > > > On 2/24/2017 4:41 AM, Michel Thierry wrote: > > > > > > There was no way to check if the platform is running the latest firmware. > > > > > > Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com> > > > Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com> > > > Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com> > > > --- > > > drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ > > > 1 file changed, 10 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > > > index 2b1d15668192..e022187916ee 100644 > > > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > > > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > > > @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, > > > CSR_VERSION_MINOR(csr->version)); > > > } > > > > > > + if (HAS_GUC_UCODE(dev_priv)) { > > > + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; > > > + > > > + err_printf(m, "GuC loaded: %s\n", > > > + yesno(guc_fw->load_status == > > > + INTEL_UC_FIRMWARE_SUCCESS)); > > > + err_printf(m, "GuC fw version: %d.%d\n", > > > + guc_fw->major_ver_found, guc_fw->minor_ver_found); > > > + } > > > + > > > > Hmm. The firmware may change between the hang and cat > > /sys/class/drm/card0/error (as it will be reloaded after the reset). > > Btw, maybe we should add counter that will be incremented on each fw reload > and reported here ? If it occurs to you that we need it for post-mortem debugging and having it is worth more than any potential confusion.... I can see the need for knowing what guc/huc/dmc/etc was running at the time of a hang - I just hope that what was previously running before an earlier reset doesn't contribute. But that's why we focus on the first error in a system... -Chris
On 2/24/2017 4:19 PM, Chris Wilson wrote: > On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote: >> On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote: >>> On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote: >>>> Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com> >>>> >>>> On 2/24/2017 4:41 AM, Michel Thierry wrote: >>>> >>>> There was no way to check if the platform is running the latest firmware. >>>> >>>> Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com> >>>> Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com> >>>> Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com> >>>> --- >>>> drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ >>>> 1 file changed, 10 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >>>> index 2b1d15668192..e022187916ee 100644 >>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >>>> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, >>>> CSR_VERSION_MINOR(csr->version)); >>>> } >>>> >>>> + if (HAS_GUC_UCODE(dev_priv)) { >>>> + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; >>>> + >>>> + err_printf(m, "GuC loaded: %s\n", >>>> + yesno(guc_fw->load_status == >>>> + INTEL_UC_FIRMWARE_SUCCESS)); >>>> + err_printf(m, "GuC fw version: %d.%d\n", >>>> + guc_fw->major_ver_found, guc_fw->minor_ver_found); >>>> + } >>>> + >>> Hmm. The firmware may change between the hang and cat >>> /sys/class/drm/card0/error (as it will be reloaded after the reset). >> Btw, maybe we should add counter that will be incremented on each fw reload >> and reported here ? > If it occurs to you that we need it for post-mortem debugging and having > it is worth more than any potential confusion.... > > I can see the need for knowing what guc/huc/dmc/etc was running at the > time of a hang - I just hope that what was previously running before an > earlier reset doesn't contribute. But that's why we focus on the first > error in a system... > -Chris > GT reset count is present already in error state. GuC kernel parameters are present and this change will help us identify which firmware issue was encountered. So I feel printing ver_found should be enough.
On 2/24/2017 2:40 AM, Michal Wajdeczko wrote: > On Thu, Feb 23, 2017 at 03:11:37PM -0800, Michel Thierry wrote: >> There was no way to check if the platform is running the latest firmware. > > Can we also add similar patch for the HuC ? > Please don't tell me the HuC can hang the gpu too. >> >> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com> >> Signed-off-by: Michel Thierry <michel.thierry@intel.com> >> --- >> drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >> index 2b1d15668192..e022187916ee 100644 >> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, >> CSR_VERSION_MINOR(csr->version)); >> } >> >> + if (HAS_GUC_UCODE(dev_priv)) { >> + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; > > I would preffer to use HAS_GUC and intel_guc* here. > > >> + >> + err_printf(m, "GuC loaded: %s\n", >> + yesno(guc_fw->load_status == >> + INTEL_UC_FIRMWARE_SUCCESS)); > > Hmm, as we do have more detailed load status, why limiting it to yes/no only? > > > -Michal > >> + err_printf(m, "GuC fw version: %d.%d\n", >> + guc_fw->major_ver_found, guc_fw->minor_ver_found); >> + } >> + >> err_printf(m, "EIR: 0x%08x\n", error->eir); >> err_printf(m, "IER: 0x%08x\n", error->ier); >> for (i = 0; i < error->ngtier; i++) >> -- >> 2.11.0 >> >> _______________________________________________ >> Intel-gfx mailing list >> Intel-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
On 2/24/2017 8:15 AM, Michel Thierry wrote: > > > On 2/24/2017 2:40 AM, Michal Wajdeczko wrote: >> On Thu, Feb 23, 2017 at 03:11:37PM -0800, Michel Thierry wrote: >>> There was no way to check if the platform is running the latest >>> firmware. >> >> Can we also add similar patch for the HuC ? >> > > Please don't tell me the HuC can hang the gpu too. > >>> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>> Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com> >>> Signed-off-by: Michel Thierry <michel.thierry@intel.com> >>> --- >>> drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c >>> b/drivers/gpu/drm/i915/i915_gpu_error.c >>> index 2b1d15668192..e022187916ee 100644 >>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >>> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct >>> drm_i915_error_state_buf *m, >>> CSR_VERSION_MINOR(csr->version)); >>> } >>> >>> + if (HAS_GUC_UCODE(dev_priv)) { >>> + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; >> >> I would preffer to use HAS_GUC and intel_guc* here. >> >> >>> + >>> + err_printf(m, "GuC loaded: %s\n", >>> + yesno(guc_fw->load_status == >>> + INTEL_UC_FIRMWARE_SUCCESS)); >> >> Hmm, as we do have more detailed load status, why limiting it to >> yes/no only? Will it help in the post-mortem debug? My idea was, if the fw didn't load, we can take it completely out of the picture. >> -Michal >> >>> + err_printf(m, "GuC fw version: %d.%d\n", >>> + guc_fw->major_ver_found, guc_fw->minor_ver_found); >>> + } >>> + >>> err_printf(m, "EIR: 0x%08x\n", error->eir); >>> err_printf(m, "IER: 0x%08x\n", error->ier); >>> for (i = 0; i < error->ngtier; i++) >>> -- >>> 2.11.0 >>> >>> _______________________________________________ >>> Intel-gfx mailing list >>> Intel-gfx@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
On 2/24/2017 2:49 AM, Chris Wilson wrote: > On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote: >> On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote: >>> On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote: >>>> Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com> >>>> >>>> On 2/24/2017 4:41 AM, Michel Thierry wrote: >>>> >>>> There was no way to check if the platform is running the latest firmware. >>>> >>>> Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com> >>>> Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com> >>>> Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com> >>>> --- >>>> drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ >>>> 1 file changed, 10 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >>>> index 2b1d15668192..e022187916ee 100644 >>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >>>> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, >>>> CSR_VERSION_MINOR(csr->version)); >>>> } >>>> >>>> + if (HAS_GUC_UCODE(dev_priv)) { >>>> + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; >>>> + >>>> + err_printf(m, "GuC loaded: %s\n", >>>> + yesno(guc_fw->load_status == >>>> + INTEL_UC_FIRMWARE_SUCCESS)); >>>> + err_printf(m, "GuC fw version: %d.%d\n", >>>> + guc_fw->major_ver_found, guc_fw->minor_ver_found); >>>> + } >>>> + >>> >>> Hmm. The firmware may change between the hang and cat >>> /sys/class/drm/card0/error (as it will be reloaded after the reset). >> >> Btw, maybe we should add counter that will be incremented on each fw reload >> and reported here ? > > If it occurs to you that we need it for post-mortem debugging and having > it is worth more than any potential confusion.... > > I can see the need for knowing what guc/huc/dmc/etc was running at the > time of a hang - I just hope that what was previously running before an > earlier reset doesn't contribute. But that's why we focus on the first > error in a system... Can the firmware change? Last time I checked the filename was hard-coded in the driver. It's true that the load process could fail and then the information be incorrect.
On Fri, Feb 24, 2017 at 08:30:43AM -0800, Michel Thierry wrote: > On 2/24/2017 2:49 AM, Chris Wilson wrote: > >On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote: > >>On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote: > >>>On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote: > >>>> Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com> > >>>> > >>>> On 2/24/2017 4:41 AM, Michel Thierry wrote: > >>>> > >>>> There was no way to check if the platform is running the latest firmware. > >>>> > >>>> Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com> > >>>> Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com> > >>>> Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com> > >>>> --- > >>>> drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ > >>>> 1 file changed, 10 insertions(+) > >>>> > >>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > >>>> index 2b1d15668192..e022187916ee 100644 > >>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c > >>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > >>>> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, > >>>> CSR_VERSION_MINOR(csr->version)); > >>>> } > >>>> > >>>> + if (HAS_GUC_UCODE(dev_priv)) { > >>>> + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; > >>>> + > >>>> + err_printf(m, "GuC loaded: %s\n", > >>>> + yesno(guc_fw->load_status == > >>>> + INTEL_UC_FIRMWARE_SUCCESS)); > >>>> + err_printf(m, "GuC fw version: %d.%d\n", > >>>> + guc_fw->major_ver_found, guc_fw->minor_ver_found); > >>>> + } > >>>> + > >>> > >>>Hmm. The firmware may change between the hang and cat > >>>/sys/class/drm/card0/error (as it will be reloaded after the reset). > >> > >>Btw, maybe we should add counter that will be incremented on each fw reload > >>and reported here ? > > > >If it occurs to you that we need it for post-mortem debugging and having > >it is worth more than any potential confusion.... > > > >I can see the need for knowing what guc/huc/dmc/etc was running at the > >time of a hang - I just hope that what was previously running before an > >earlier reset doesn't contribute. But that's why we focus on the first > >error in a system... > > Can the firmware change? > Last time I checked the filename was hard-coded in the driver. It's > true that the load process could fail and then the information be > incorrect. Assume it won't be hardcoded for ever (or at least no more than a week)... And yes, the filesystem state may have changed since the previous load. -Chris
On 2/24/2017 9:15 AM, Chris Wilson wrote: > On Fri, Feb 24, 2017 at 08:30:43AM -0800, Michel Thierry wrote: >> On 2/24/2017 2:49 AM, Chris Wilson wrote: >>> On Fri, Feb 24, 2017 at 11:43:32AM +0100, Michal Wajdeczko wrote: >>>> On Fri, Feb 24, 2017 at 09:13:29AM +0000, Chris Wilson wrote: >>>>> On Fri, Feb 24, 2017 at 09:13:05AM +0530, Kamble, Sagar A wrote: >>>>>> Reviewed-by: Sagar Arun Kamble [1]<sagar.a.kamble@intel.com> >>>>>> >>>>>> On 2/24/2017 4:41 AM, Michel Thierry wrote: >>>>>> >>>>>> There was no way to check if the platform is running the latest firmware. >>>>>> >>>>>> Cc: Tvrtko Ursulin [2]<tvrtko.ursulin@intel.com> >>>>>> Cc: Arkadiusz Hiler [3]<arkadiusz.hiler@intel.com> >>>>>> Signed-off-by: Michel Thierry [4]<michel.thierry@intel.com> >>>>>> --- >>>>>> drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ >>>>>> 1 file changed, 10 insertions(+) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c >>>>>> index 2b1d15668192..e022187916ee 100644 >>>>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c >>>>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c >>>>>> @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, >>>>>> CSR_VERSION_MINOR(csr->version)); >>>>>> } >>>>>> >>>>>> + if (HAS_GUC_UCODE(dev_priv)) { >>>>>> + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; >>>>>> + >>>>>> + err_printf(m, "GuC loaded: %s\n", >>>>>> + yesno(guc_fw->load_status == >>>>>> + INTEL_UC_FIRMWARE_SUCCESS)); >>>>>> + err_printf(m, "GuC fw version: %d.%d\n", >>>>>> + guc_fw->major_ver_found, guc_fw->minor_ver_found); >>>>>> + } >>>>>> + >>>>> >>>>> Hmm. The firmware may change between the hang and cat >>>>> /sys/class/drm/card0/error (as it will be reloaded after the reset). >>>> >>>> Btw, maybe we should add counter that will be incremented on each fw reload >>>> and reported here ? >>> >>> If it occurs to you that we need it for post-mortem debugging and having >>> it is worth more than any potential confusion.... >>> >>> I can see the need for knowing what guc/huc/dmc/etc was running at the >>> time of a hang - I just hope that what was previously running before an >>> earlier reset doesn't contribute. But that's why we focus on the first >>> error in a system... >> >> Can the firmware change? >> Last time I checked the filename was hard-coded in the driver. It's >> true that the load process could fail and then the information be >> incorrect. > > Assume it won't be hardcoded for ever (or at least no more than a week)... > And yes, the filesystem state may have changed since the previous load. ok, I'll add an i915_capture_fw_state to collect the information before the reset (for dmc/guc/huc).
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 2b1d15668192..e022187916ee 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -632,6 +632,16 @@ int i915_error_state_to_str(struct drm_i915_error_state_buf *m, CSR_VERSION_MINOR(csr->version)); } + if (HAS_GUC_UCODE(dev_priv)) { + struct intel_uc_fw *guc_fw = &dev_priv->guc.fw; + + err_printf(m, "GuC loaded: %s\n", + yesno(guc_fw->load_status == + INTEL_UC_FIRMWARE_SUCCESS)); + err_printf(m, "GuC fw version: %d.%d\n", + guc_fw->major_ver_found, guc_fw->minor_ver_found); + } + err_printf(m, "EIR: 0x%08x\n", error->eir); err_printf(m, "IER: 0x%08x\n", error->ier); for (i = 0; i < error->ngtier; i++)
There was no way to check if the platform is running the latest firmware. Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com> Signed-off-by: Michel Thierry <michel.thierry@intel.com> --- drivers/gpu/drm/i915/i915_gpu_error.c | 10 ++++++++++ 1 file changed, 10 insertions(+)