Message ID | 20241030223846.2272374-1-zhanjun.dong@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] drm/i915/guc: Flush ct receive tasklet during reset preparation | expand |
Just some minor nits on header. Otherwise, LGTM: Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> On Wed, 2024-10-30 at 15:38 -0700, Zhanjun Dong wrote: > GuC to host communication is interrupt driven, the handling has 3 > parts: interrupt context, tasklet and request queue worker. > During GuC reset prepare, interrupt is disabled before destroy > contexts steps start. The IRQ and worker flushed to finish alan: "and worker are flushed to finish" > in progress message handling if there are. The tasklet flush is alan: "any outstanding in-progress message handling. But, the tasklet flush..." > missing, it might causes 2 race conditions: > 1. Tasklet runs after IRQ flushed, add request to queue after worker > flush started, causes unexpected G2H message request processing, > meanwhile, reset prepare code already get the context destroyed. > This will causes error reported about bad context state. > 2. Tasklet runs after intel_guc_submission_reset_prepare, > ct_try_receive_message start to run, while intel_uc_reset_prepare > already finished guc sanitize and set ct->enable to false. This will > causes warning on incorrect ct->enable state. > > Add the missing tasklet flush to flush all 3 parts. > > Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> > Cc: John Harrison <John.C.Harrison@Intel.com> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > --- > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > index 9ede6f240d79..353a9167c9a4 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > @@ -1688,6 +1688,10 @@ void intel_guc_submission_reset_prepare(struct > intel_guc *guc) alan: i still feel like we should be just killing off the guc at this point (via GT_RESTT) before any of the following reset prep sequences. But as per offline conversation, we agreed that might be too intrusive a change for i915 while new design ideas are being concentrated on Xe. > spin_lock_irq(guc_to_gt(guc)->irq_lock); > spin_unlock_irq(guc_to_gt(guc)->irq_lock); > > + /* Flush tasklet */ > + tasklet_disable(&guc->ct.receive_tasklet); > + tasklet_enable(&guc->ct.receive_tasklet); > + > guc_flush_submissions(guc); > guc_flush_destroyed_contexts(guc); > flush_work(&guc->ct.requests.worker);
On 10/30/2024 3:38 PM, Zhanjun Dong wrote: > GuC to host communication is interrupt driven, the handling has 3 > parts: interrupt context, tasklet and request queue worker. > During GuC reset prepare, interrupt is disabled before destroy > contexts steps start. The IRQ and worker flushed to finish > in progress message handling if there are. The tasklet flush is > missing, it might causes 2 race conditions: > 1. Tasklet runs after IRQ flushed, add request to queue after worker > flush started, causes unexpected G2H message request processing, > meanwhile, reset prepare code already get the context destroyed. > This will causes error reported about bad context state. > 2. Tasklet runs after intel_guc_submission_reset_prepare, > ct_try_receive_message start to run, while intel_uc_reset_prepare > already finished guc sanitize and set ct->enable to false. This will > causes warning on incorrect ct->enable state. > > Add the missing tasklet flush to flush all 3 parts. > > Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> > Cc: John Harrison <John.C.Harrison@Intel.com> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > --- > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > index 9ede6f240d79..353a9167c9a4 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > @@ -1688,6 +1688,10 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc) > spin_lock_irq(guc_to_gt(guc)->irq_lock); > spin_unlock_irq(guc_to_gt(guc)->irq_lock); > > + /* Flush tasklet */ > + tasklet_disable(&guc->ct.receive_tasklet); > + tasklet_enable(&guc->ct.receive_tasklet); > + It looks like we might have the same problem around suspend/resume, because AFAICS the tasklet is never stopped anywhere except driver unload. Maybe it's worth adding the tasklet disabling/enabling to the interrupt disabling/enabling functions, i.e. guc->interrupts.disable/enable(), so it's automatically called any time we want to disable GuC interrupts? not a blocker. Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Daniele > guc_flush_submissions(guc); > guc_flush_destroyed_contexts(guc); > flush_work(&guc->ct.requests.worker);
On 2024-11-04 6:20 p.m., Daniele Ceraolo Spurio wrote: > > > > On 10/30/2024 3:38 PM, Zhanjun Dong wrote: >> GuC to host communication is interrupt driven, the handling has 3 >> parts: interrupt context, tasklet and request queue worker. >> During GuC reset prepare, interrupt is disabled before destroy >> contexts steps start. The IRQ and worker flushed to finish >> in progress message handling if there are. The tasklet flush is >> missing, it might causes 2 race conditions: >> 1. Tasklet runs after IRQ flushed, add request to queue after worker >> flush started, causes unexpected G2H message request processing, >> meanwhile, reset prepare code already get the context destroyed. >> This will causes error reported about bad context state. >> 2. Tasklet runs after intel_guc_submission_reset_prepare, >> ct_try_receive_message start to run, while intel_uc_reset_prepare >> already finished guc sanitize and set ct->enable to false. This will >> causes warning on incorrect ct->enable state. >> >> Add the missing tasklet flush to flush all 3 parts. >> >> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> >> Cc: John Harrison <John.C.Harrison@Intel.com> >> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> >> --- >> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/ >> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c >> index 9ede6f240d79..353a9167c9a4 100644 >> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c >> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c >> @@ -1688,6 +1688,10 @@ void intel_guc_submission_reset_prepare(struct >> intel_guc *guc) >> spin_lock_irq(guc_to_gt(guc)->irq_lock); >> spin_unlock_irq(guc_to_gt(guc)->irq_lock); >> + /* Flush tasklet */ >> + tasklet_disable(&guc->ct.receive_tasklet); >> + tasklet_enable(&guc->ct.receive_tasklet); >> + > > It looks like we might have the same problem around suspend/resume, > because AFAICS the tasklet is never stopped anywhere except driver > unload. Maybe it's worth adding the tasklet disabling/enabling to the > interrupt disabling/enabling functions, i.e. guc->interrupts.disable/ > enable(), so it's automatically called any time we want to disable GuC > interrupts? not a blocker. > > Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > > Daniele > Thanks Daniele for review. I like the idea to put tasklet disabling/enabling to the > interrupt disabling/enabling functions. Let me do some investigation on suspend/resume workflow and run some test first. It might take some time. This patch might fix multiple issues, I would like to get it merged after we got positive CI.Full result. Regards, Zhanjun Dong >> guc_flush_submissions(guc); >> guc_flush_destroyed_contexts(guc); >> flush_work(&guc->ct.requests.worker); >
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 9ede6f240d79..353a9167c9a4 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1688,6 +1688,10 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc) spin_lock_irq(guc_to_gt(guc)->irq_lock); spin_unlock_irq(guc_to_gt(guc)->irq_lock); + /* Flush tasklet */ + tasklet_disable(&guc->ct.receive_tasklet); + tasklet_enable(&guc->ct.receive_tasklet); + guc_flush_submissions(guc); guc_flush_destroyed_contexts(guc); flush_work(&guc->ct.requests.worker);
GuC to host communication is interrupt driven, the handling has 3 parts: interrupt context, tasklet and request queue worker. During GuC reset prepare, interrupt is disabled before destroy contexts steps start. The IRQ and worker flushed to finish in progress message handling if there are. The tasklet flush is missing, it might causes 2 race conditions: 1. Tasklet runs after IRQ flushed, add request to queue after worker flush started, causes unexpected G2H message request processing, meanwhile, reset prepare code already get the context destroyed. This will causes error reported about bad context state. 2. Tasklet runs after intel_guc_submission_reset_prepare, ct_try_receive_message start to run, while intel_uc_reset_prepare already finished guc sanitize and set ct->enable to false. This will causes warning on incorrect ct->enable state. Add the missing tasklet flush to flush all 3 parts. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++ 1 file changed, 4 insertions(+)