Message ID | 20230223172120.3304293-2-daniele.ceraolospurio@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix a couple of issues with the GSC worker timing | expand |
On Thu, 2023-02-23 at 09:21 -0800, Ceraolo Spurio, Daniele wrote: > If we unload the driver and wedge before the GSC worker is complete, > the worker will hit an error on its submission to the GSC engine and > then exit. This is hard to hit for a user, but it is reproducible > with skipping selftests. The error is handled gracefully by the > worker, so there are no functional issues, but we still end up with > an error message in dmesg, which is something we want to avoid as > this is a supported scenario. We could modify the worker to better > handle a wedging occurring during its execution, but that gets > complicated for a couple of reasons: > - We do want the error on runtime wedging, because there are > implications for subsystems outside of GT (i.e., PXP, HDCP), it's > only the error on driver unload that we want to silence. > - The worker is responsible for multiple submissions (GSC FW load, > HuC auth, SW proxy), so all of those will have to be adapted to > handle the wedged_on_fini scenario. > Therefore, it's much simpler to just wait for the worker to be done > before wedging on driver removal, also considering that the worker > will likely already be idle in the great majority of non-selftest > scenarios. > > Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > alan:snip > > --- a/drivers/gpu/drm/i915/gt/intel_gt.c > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c > @@ -784,6 +784,29 @@ void intel_gt_driver_unregister(struct intel_gt *gt) > intel_rps_driver_unregister(>->rps); > intel_gsc_fini(>->gsc); > > + /* > + * If we unload the driver and wedge before the GSC worker is complete, alan:snip > + * scenarios. > + */ > + intel_gsc_uc_flush_work(>->uc.gsc); alan: nit: from a code readiblity, having intel_gsc_fini before intel_gsc_uc_flush_work almost reads as if code should have been inverted eventhough we know that the former is for the old mei-comp based framework and the latter is based on the new gsc-cs based framework. i cant think of how else to resolve this other unintrusively other than adding a comment. Other than that, also considering we've had offline testing already verify this and the next patch too, LGTM: Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index f7f271708fc7..89ccb95e146c 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -784,6 +784,29 @@ void intel_gt_driver_unregister(struct intel_gt *gt) intel_rps_driver_unregister(>->rps); intel_gsc_fini(>->gsc); + /* + * If we unload the driver and wedge before the GSC worker is complete, + * the worker will hit an error on its submission to the GSC engine and + * then exit. This is hard to hit for a user, but it is reproducible + * with skipping selftests. The error is handled gracefully by the + * worker, so there are no functional issues, but we still end up with + * an error message in dmesg, which is something we want to avoid as + * this is a supported scenario. We could modify the worker to better + * handle a wedging occurring during its execution, but that gets + * complicated for a couple of reasons: + * - We do want the error on runtime wedging, because there are + * implications for subsystems outside of GT (i.e., PXP, HDCP), it's + * only the error on driver unload that we want to silence. + * - The worker is responsible for multiple submissions (GSC FW load, + * HuC auth, SW proxy), so all of those will have to be adapted to + * handle the wedged_on_fini scenario. + * Therefore, it's much simpler to just wait for the worker to be done + * before wedging on driver removal, also considering that the worker + * will likely already be idle in the great majority of non-selftest + * scenarios. + */ + intel_gsc_uc_flush_work(>->uc.gsc); + /* * Upon unregistering the device to prevent any new users, cancel * all in-flight requests so that we can quickly unbind the active diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c index 8afd42cbded9..92e1571fdc46 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c @@ -116,7 +116,7 @@ void intel_gsc_uc_fini(struct intel_gsc_uc *gsc) intel_uc_fw_fini(&gsc->fw); } -void intel_gsc_uc_suspend(struct intel_gsc_uc *gsc) +void intel_gsc_uc_flush_work(struct intel_gsc_uc *gsc) { if (!intel_uc_fw_is_loadable(&gsc->fw)) return; diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h index 03fd0a8e8db1..c8b025343ea6 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h @@ -26,6 +26,7 @@ void intel_gsc_uc_init_early(struct intel_gsc_uc *gsc); int intel_gsc_uc_init(struct intel_gsc_uc *gsc); void intel_gsc_uc_fini(struct intel_gsc_uc *gsc); void intel_gsc_uc_suspend(struct intel_gsc_uc *gsc); +void intel_gsc_uc_flush_work(struct intel_gsc_uc *gsc); void intel_gsc_uc_load_start(struct intel_gsc_uc *gsc); static inline bool intel_gsc_uc_is_supported(struct intel_gsc_uc *gsc) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c index 6648691bd645..5fa5c0999212 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c @@ -672,7 +672,7 @@ void intel_uc_suspend(struct intel_uc *uc) int err; /* flush the GSC worker */ - intel_gsc_uc_suspend(&uc->gsc); + intel_gsc_uc_flush_work(&uc->gsc); if (!intel_guc_is_ready(guc)) { guc->interrupts.enabled = false;
If we unload the driver and wedge before the GSC worker is complete, the worker will hit an error on its submission to the GSC engine and then exit. This is hard to hit for a user, but it is reproducible with skipping selftests. The error is handled gracefully by the worker, so there are no functional issues, but we still end up with an error message in dmesg, which is something we want to avoid as this is a supported scenario. We could modify the worker to better handle a wedging occurring during its execution, but that gets complicated for a couple of reasons: - We do want the error on runtime wedging, because there are implications for subsystems outside of GT (i.e., PXP, HDCP), it's only the error on driver unload that we want to silence. - The worker is responsible for multiple submissions (GSC FW load, HuC auth, SW proxy), so all of those will have to be adapted to handle the wedged_on_fini scenario. Therefore, it's much simpler to just wait for the worker to be done before wedging on driver removal, also considering that the worker will likely already be idle in the great majority of non-selftest scenarios. Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> --- drivers/gpu/drm/i915/gt/intel_gt.c | 23 +++++++++++++++++++++++ drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h | 1 + drivers/gpu/drm/i915/gt/uc/intel_uc.c | 2 +- 4 files changed, 26 insertions(+), 2 deletions(-)