[11/20] drm/i915: Fake lost context event interrupts through forced CSB checking.

From: Tomas Elf <tomas.elf@intel.com>

From: Tomas Elf <tomas.elf@intel.com>

*** General ***
A recurring issue during long-duration operations testing of concurrent
rendering tasks with intermittent hangs is that context completion interrupts
following engine resets are sometimes lost. This becomes a real problem since
the hardware might have completed a previously hung context following a
per-engine hang recovery and then gone idle somehow without sending an
interrupt telling the driver about this. At this point the driver would be
stuck waiting for context completion, thinking that the context is still
active, even though the hardware would be idle and waiting for more work. The
periodic hang checker would detect hangs caused by stuck workloads in the GPU
as well as these inconsistencies causing software hangs. The difference lies in
how we handle these two types of hangs.

*** Rectification ***
The way hangs caused by inconsistent context submission states are resolved is
by checking the context submission state consistency as a pre-stage to the
engine recovery path. If the state is not consistent at that point then the
normal form of engine recovery is not attempted. Instead, an attempt to rectify
the inconsistency is made by faking the presumed lost context event interrupt -
or more specifically by calling the context event interrupt handler manually
from the hang recovery path outside of the normal interrupt execution context.
The reason this works is because regardless of whether or not an IRQ goes
missing the hardware always updates the CSB buffer during context state
transitions, which means that in the case of a missing IRQ there would be
outstanding CSB events waiting to be processed, out of which one might be the
context completion event belonging to the context currently blocking the work
submission flow in one of the execlist queues. The faked context event
interrupt would then end up in the interrupt handler, which would process the
outstanding events and purge the stuck context.

If this rectification attempt fails (because there are no outstanding CSB
events, at least none that could account for the inconsistency) then the engine
recovery is failed and the error handler falls back to legacy full GPU reset
mode. Assuming the full GPU reset is successful this form of recovery will
always cause the system to become consistent since the GPU is reset and forced
into an idle state and all pending driver work is discarded, which would
consistently reflect the idle GPU hardware state.

If the rectification attempt succeeds, meaning that unprocessed CSB events were
found and acted upon which lead to old contexts being purged from the execlist
queue and new work being submitted to hardware, then the inconsistency
rectification is considered to have successfully resolved the detected hang
that brought on the hang recovery. Therefore the engine recovery is ended early
at that point and no further attempts at resolving the hang are made and the
hang detection is cleared, allowing the driver to resume executing.

*** Detection ***
In principle a context submission status inconsistency is detected by comparing
the ID of the context in the head request of an execlist queue with the context
ID currently in the EXECLIST_STATUS register of the same engine (the latter
denoting the ID of the context currently running on the hardware). If the two
do not match it is assumed that an interrupt was missed and that the driver is
now stuck in an inconsistent state. Of course, the driver and hardware can
go in and out of consistency momentarily many times per second as contexts
start and complete in the driver independently from the actual GPU
hardware. The only way an inconsistency detection can be trusted is by
first making sure that the detected state is stable, either by observing
sustained, initial signs of a hang in the periodic hang checker or at the
onset of the hang recovery path, at which point it has been decided that
the execution is hung and that the driver is stable in that state.

*** WARNING ***
In time-constrained scenarios waiting until the onset of hang recovery before
detecting and potentially rectifying context submission state inconsistencies
might cause problematic side-effects. For example, in Android the
SurfaceFlinger/HWC compositor has a hard time limit of 3 seconds after which
any unresolved hangs might cause display freezes (due to dropped display flip
requests), which can only be resolved by a reboot. If hang detection and hang
recovery takes upwards of 3 seconds then there is a distinct risk that handling
inconsistencies this late might cause issues. Whether or not this will become a
problem remains to be shown in practice. So far no issues have been spotted in
other environments such as X but it is worth being aware of.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c      | 182 +++++++++++++++++++++++++++--------
 drivers/gpu/drm/i915/i915_irq.c      |  24 +----
 drivers/gpu/drm/i915/intel_lrc.c     |  83 +++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h     |   2 +-
 drivers/gpu/drm/i915/intel_lrc_tdr.h |   3 +
 5 files changed, 228 insertions(+), 66 deletions(-)

[11/20] drm/i915: Fake lost context event interrupts through forced CSB checking.

Commit Message

Patch