Message ID | 20230222210120.407780-1-alan.previn.teres.alexis@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915/gsc: Fix the Driver-FLR completion | expand |
On 2/22/2023 1:01 PM, Alan Previn wrote: > The Driver-FLR flow may inadvertently exit early before the full > completion of the re-init of the internal HW state if we only poll > GU_DEBUG Bit31 (polling for it to toggle from 0 -> 1). Instead > we need a two-step completion wait-for-completion flow that also > involves GU_CNTL. See the patch and new code comments for detail. > This is new direction from HW architecture folks. > > v2: - Add error message for the teardown timeout (Anshuman) > - Don't duplicate code in comments (Jani) LGTM, Tested-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> > Fixes: 5a44fcd73498 ("drm/i915/gsc: Do a driver-FLR on unload if GSC was loaded") > --- > drivers/gpu/drm/i915/intel_uncore.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c > index f018da7ebaac..f3c46352db89 100644 > --- a/drivers/gpu/drm/i915/intel_uncore.c > +++ b/drivers/gpu/drm/i915/intel_uncore.c > @@ -2749,14 +2749,25 @@ static void driver_initiated_flr(struct intel_uncore *uncore) > /* Trigger the actual Driver-FLR */ > intel_uncore_rmw_fw(uncore, GU_CNTL, 0, DRIVERFLR); > > + /* Wait for hardware teardown to complete */ > + ret = intel_wait_for_register_fw(uncore, GU_CNTL, > + DRIVERFLR_STATUS, 0, > + flr_timeout_ms); > + if (ret) { > + drm_err(&i915->drm, "Driver-FLR-teardown wait completion failed! %d\n", ret); > + return; > + } > + > + /* Wait for hardware/firmware re-init to complete */ > ret = intel_wait_for_register_fw(uncore, GU_DEBUG, > DRIVERFLR_STATUS, DRIVERFLR_STATUS, > flr_timeout_ms); > if (ret) { > - drm_err(&i915->drm, "wait for Driver-FLR completion failed! %d\n", ret); > + drm_err(&i915->drm, "Driver-FLR-reinit wait completion failed! %d\n", ret); > return; > } > > + /* Clear sticky completion status */ > intel_uncore_write_fw(uncore, GU_DEBUG, DRIVERFLR_STATUS); > } >
On Wed, 2023-02-22 at 13:01 -0800, Teres Alexis, Alan Previn wrote: > The Driver-FLR flow may inadvertently exit early before the full > completion of the re-init of the internal HW state if we only poll > GU_DEBUG Bit31 (polling for it to toggle from 0 -> 1). Instead > we need a two-step completion wait-for-completion flow that also > involves GU_CNTL. See the patch and new code comments for detail. > This is new direction from HW architecture folks. > > v2: - Add error message for the teardown timeout (Anshuman) > - Don't duplicate code in comments (Jani) > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> > Fixes: 5a44fcd73498 ("drm/i915/gsc: Do a driver-FLR on unload if GSC was loaded") > --- > drivers/gpu/drm/i915/intel_uncore.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c > index f018da7ebaac..f3c46352db89 100644 > --- a/drivers/gpu/drm/i915/intel_uncore.c > +++ b/drivers/gpu/drm/i915/intel_uncore.c > @@ -2749,14 +2749,25 @@ static void driver_initiated_flr(struct intel_uncore *uncore) > /* Trigger the actual Driver-FLR */ So i got offline feedback from Daniele during internal reviews before this went upstream that a runtime-pm ought to be taken, although not required functionally speaking during unload, should be there so we don't get complains from uncore when hitting up those registers. I'll recheck with Daniele. alan:snip
On 2/22/2023 1:01 PM, Alan Previn wrote: > The Driver-FLR flow may inadvertently exit early before the full > completion of the re-init of the internal HW state if we only poll > GU_DEBUG Bit31 (polling for it to toggle from 0 -> 1). Instead > we need a two-step completion wait-for-completion flow that also > involves GU_CNTL. See the patch and new code comments for detail. > This is new direction from HW architecture folks. > > v2: - Add error message for the teardown timeout (Anshuman) > - Don't duplicate code in comments (Jani) > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> > Fixes: 5a44fcd73498 ("drm/i915/gsc: Do a driver-FLR on unload if GSC was loaded") I'm not sure if we need a fixes tag, given that this is MTL specific code and that's still under force probe. > --- > drivers/gpu/drm/i915/intel_uncore.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c > index f018da7ebaac..f3c46352db89 100644 > --- a/drivers/gpu/drm/i915/intel_uncore.c > +++ b/drivers/gpu/drm/i915/intel_uncore.c > @@ -2749,14 +2749,25 @@ static void driver_initiated_flr(struct intel_uncore *uncore) > /* Trigger the actual Driver-FLR */ > intel_uncore_rmw_fw(uncore, GU_CNTL, 0, DRIVERFLR); > > + /* Wait for hardware teardown to complete */ > + ret = intel_wait_for_register_fw(uncore, GU_CNTL, > + DRIVERFLR_STATUS, 0, shouldn't this bit be DRIVERFLR instead of DRIVERFLR_STATUS ? I know they're both BIT(31), but DRIVERFLR_STATUS is the definition for the GU_DEBUG bit, while this wait is on GU_CNTL. > + flr_timeout_ms); > + if (ret) { > + drm_err(&i915->drm, "Driver-FLR-teardown wait completion failed! %d\n", ret); > + return; > + } > + > + /* Wait for hardware/firmware re-init to complete */ > ret = intel_wait_for_register_fw(uncore, GU_DEBUG, > DRIVERFLR_STATUS, DRIVERFLR_STATUS, > flr_timeout_ms); I was wondering if we could reduce the timing here to avoid 2 waits of 3 seconds, as the 3 seconds should be for the full process. However, the specs don't say how much each step can take, so I agree that to be safe is better to have both timeouts at 3 seconds. If the FLR fails the HW is toast anyway, so waiting a few seconds more to detect it on driver unload is not going to have additional consequences that we wouldn't already have. With the bit in the wait above fixed: Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Daniele > if (ret) { > - drm_err(&i915->drm, "wait for Driver-FLR completion failed! %d\n", ret); > + drm_err(&i915->drm, "Driver-FLR-reinit wait completion failed! %d\n", ret); > return; > } > > + /* Clear sticky completion status */ > intel_uncore_write_fw(uncore, GU_DEBUG, DRIVERFLR_STATUS); > } >
Thanks Daniele, you are right about the fixes tag - i totally forgot that MTL is still force-probe. Will respin with the bit definition fix, remove the fixes-tag and leave out the get/put runtime-pm from rev3 (as per your comment on rev3). Rev4 coming right up. ...alan P.S. I had the same thought process about the dual-poll-waiting timeout - but based on my discussions with the hw folk and what is the now-updated hw specs, we start to understand that depending on the SOC implementation fabric (outside the GPU IP), we will most likely have the longer polling one either one of those bits but not both. But since that's outside the GPU-IP, its just more scalable to keep both these polls. On Thu, 2023-02-23 at 15:49 -0800, Ceraolo Spurio, Daniele wrote: > > On 2/22/2023 1:01 PM, Alan Previn wrote: > > alan:snip
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index f018da7ebaac..f3c46352db89 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -2749,14 +2749,25 @@ static void driver_initiated_flr(struct intel_uncore *uncore) /* Trigger the actual Driver-FLR */ intel_uncore_rmw_fw(uncore, GU_CNTL, 0, DRIVERFLR); + /* Wait for hardware teardown to complete */ + ret = intel_wait_for_register_fw(uncore, GU_CNTL, + DRIVERFLR_STATUS, 0, + flr_timeout_ms); + if (ret) { + drm_err(&i915->drm, "Driver-FLR-teardown wait completion failed! %d\n", ret); + return; + } + + /* Wait for hardware/firmware re-init to complete */ ret = intel_wait_for_register_fw(uncore, GU_DEBUG, DRIVERFLR_STATUS, DRIVERFLR_STATUS, flr_timeout_ms); if (ret) { - drm_err(&i915->drm, "wait for Driver-FLR completion failed! %d\n", ret); + drm_err(&i915->drm, "Driver-FLR-reinit wait completion failed! %d\n", ret); return; } + /* Clear sticky completion status */ intel_uncore_write_fw(uncore, GU_DEBUG, DRIVERFLR_STATUS); }
The Driver-FLR flow may inadvertently exit early before the full completion of the re-init of the internal HW state if we only poll GU_DEBUG Bit31 (polling for it to toggle from 0 -> 1). Instead we need a two-step completion wait-for-completion flow that also involves GU_CNTL. See the patch and new code comments for detail. This is new direction from HW architecture folks. v2: - Add error message for the teardown timeout (Anshuman) - Don't duplicate code in comments (Jani) Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com> Fixes: 5a44fcd73498 ("drm/i915/gsc: Do a driver-FLR on unload if GSC was loaded") --- drivers/gpu/drm/i915/intel_uncore.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)