Message ID | 20240110210216.4125092-1-John.C.Harrison@Intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915/gt: Restart the heartbeat timer when forcing a pulse | expand |
Hi John, On Wednesday, 10 January 2024 22:02:16 CET John.C.Harrison@Intel.com wrote: > From: John Harrison <John.C.Harrison@Intel.com> > > The context persistence code does things like send super high priority > heartbeat pulses to ensure any leaked context can still be pre-empted > and thus isn't a total denial of service but only a minor denial of > service. Unfortunately, it wasn't bothering to restart the heatbeat > worker with a fresh timeout. Thus, if a persistent context happened to > be closed just before the heartbeat was going to go ping anyway then > the forced pulse would get a negligble execution time. And as the > forced pulse is super high priority, the worker thread's next step is > a reset. Which means a potentially innocent system randomly goes boom > when attempting to close a context. So, force a re-schedule of the > worker thread with the appropriate timeout. I haven't looked too much in heartbeat pulses code before, but I think I can understand your change. I've also got a positive opinion from Chris on it. I can provide my Ack, assuming the pre-merge failure reported by CI is not related, but could you please comment that failure first and/or ask BUG Filing to handle it so we have it cleaned up? Thanks, Janusz > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com> > --- > drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > index 1a8e2b7db0138..4ae2fa0b61dd4 100644 > --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > @@ -290,6 +290,9 @@ static int __intel_engine_pulse(struct intel_engine_cs *engine) > heartbeat_commit(rq, &attr); > GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER); > > + /* Ensure the forced pulse gets a full period to execute */ > + next_heartbeat(engine); > + > return 0; > } > >
On 1/31/2024 10:48, Janusz Krzysztofik wrote: > Hi John, > > On Wednesday, 10 January 2024 22:02:16 CET John.C.Harrison@Intel.com wrote: >> From: John Harrison <John.C.Harrison@Intel.com> >> >> The context persistence code does things like send super high priority >> heartbeat pulses to ensure any leaked context can still be pre-empted >> and thus isn't a total denial of service but only a minor denial of >> service. Unfortunately, it wasn't bothering to restart the heatbeat >> worker with a fresh timeout. Thus, if a persistent context happened to >> be closed just before the heartbeat was going to go ping anyway then >> the forced pulse would get a negligble execution time. And as the >> forced pulse is super high priority, the worker thread's next step is >> a reset. Which means a potentially innocent system randomly goes boom >> when attempting to close a context. So, force a re-schedule of the >> worker thread with the appropriate timeout. > I haven't looked too much in heartbeat pulses code before, but I think I can > understand your change. I've also got a positive opinion from Chris on it. > I can provide my Ack, assuming the pre-merge failure reported by CI is not > related, but could you please comment that failure first and/or ask BUG Filing > to handle it so we have it cleaned up? Pretty confident the CI failure is unrelated. Not seeing how a change to the heartbeat timing of persistence context clean up could cause a HDMI test to fail to complete. However, I was really hoping for a full code review by someone who understands this code and would be able to say whether there could be unexpected side effects of the change. Or even if there is a better / safer way to fix the problem. @Andi Shyti, you were fingered as being someone who might have such knowledge. Can you comment? Thanks, John. > Thanks, > Janusz > > >> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> >> --- >> drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c >> index 1a8e2b7db0138..4ae2fa0b61dd4 100644 >> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c >> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c >> @@ -290,6 +290,9 @@ static int __intel_engine_pulse(struct intel_engine_cs *engine) >> heartbeat_commit(rq, &attr); >> GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER); >> >> + /* Ensure the forced pulse gets a full period to execute */ >> + next_heartbeat(engine); >> + >> return 0; >> } >> >> > > >
Hi John, On Wed, Jan 10, 2024 at 01:02:16PM -0800, John.C.Harrison@Intel.com wrote: > From: John Harrison <John.C.Harrison@Intel.com> > > The context persistence code does things like send super high priority > heartbeat pulses to ensure any leaked context can still be pre-empted > and thus isn't a total denial of service but only a minor denial of > service. Unfortunately, it wasn't bothering to restart the heatbeat /heatbeat/heartbeat/ > worker with a fresh timeout. Thus, if a persistent context happened to > be closed just before the heartbeat was going to go ping anyway then > the forced pulse would get a negligble execution time. And as the > forced pulse is super high priority, the worker thread's next step is > a reset. Which means a potentially innocent system randomly goes boom > when attempting to close a context. So, force a re-schedule of the > worker thread with the appropriate timeout. > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com> > --- > drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > index 1a8e2b7db0138..4ae2fa0b61dd4 100644 > --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > @@ -290,6 +290,9 @@ static int __intel_engine_pulse(struct intel_engine_cs *engine) > heartbeat_commit(rq, &attr); > GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER); > > + /* Ensure the forced pulse gets a full period to execute */ > + next_heartbeat(engine); I think it makes sense to have this extra heardbeat here and, as I've been mulling over it, I don't any harm. The failure doesn't look related, either. Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Thanks, Andi > return 0; > } > > -- > 2.43.0
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c index 1a8e2b7db0138..4ae2fa0b61dd4 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c @@ -290,6 +290,9 @@ static int __intel_engine_pulse(struct intel_engine_cs *engine) heartbeat_commit(rq, &attr); GEM_BUG_ON(rq->sched.attr.priority < I915_PRIORITY_BARRIER); + /* Ensure the forced pulse gets a full period to execute */ + next_heartbeat(engine); + return 0; }