mbox series

[v3,0/6] Default request/fence expiry + watchdog

Message ID 20210318170419.2107512-1-tvrtko.ursulin@linux.intel.com (mailing list archive)
Headers show
Series Default request/fence expiry + watchdog | expand

Message

Tvrtko Ursulin March 18, 2021, 5:04 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

"Watchdog" aka "restoring hangcheck" aka default request/fence expiry - second
post of a somewhat controversial feature, now upgraded to patch status.

I quote the "watchdog" becuase in classical sense watchdog would allow userspace
to ping it and so remain alive.

I quote "restoring hangcheck" because this series, contrary to the old
hangcheck, is not looking at whether the workload is making any progress from
the kernel side either. (Although disclaimer my memory may be leaky - Daniel
suspects old hangcheck had some stricter, more indiscriminatory, angles to it.
But apart from being prone to both false negatives and false positives I can't
remember that myself.)

Short version - ask is to fail any user submissions after a set time period. In
this RFC that time is twelve seconds.

Time counts from the moment user submission is "runnable" (implicit and explicit
dependencies have been cleared) and keeps counting regardless of the GPU
contetion caused by other users of the system.

So semantics are really a bit weak, but again, I understand this is really
really wanted by the DRM core even if I am not convinced it is a good idea.

There are some dangers with doing this - text borrowed from a patch in the
series:

  This can have an effect that workloads which used to work fine will
  suddenly start failing. Even workloads comprised of short batches but in
  long dependency chains can be terminated.

  And becuase of lack of agreement on usefulness and safety of fence error
  propagation this partial execution can be invisible to userspace even if
  it is "listening" to returned fence status.

  Another interaction is with hangcheck where care needs to be taken timeout
  is not set lower or close to three times the heartbeat interval. Otherwise
  a hang in any application can cause complete termination of all
  submissions from unrelated clients. Any users modifying the per engine
  heartbeat intervals therefore need to be aware of this potential denial of
  service to avoid inadvertently enabling it.

  Given all this I am personally not convinced the scheme is a good idea.
  Intuitively it feels object importers would be better positioned to
  enforce the time they are willing to wait for something to complete.

v2:
 * Dropped context param.
 * Improved commit messages and Kconfig text.

v3:
 * Log timeouts.
 * Bump timeout to 20s to see if it helps Tigerlake.
 * Fix sentinel assert.

Test-with: 20210318162400.2065097-1-tvrtko.ursulin@linux.intel.com
Cc: Daniel Vetter <daniel.vetter@ffwll.ch

Chris Wilson (1):
  drm/i915: Individual request cancellation

Tvrtko Ursulin (5):
  drm/i915: Restrict sentinel requests further
  drm/i915: Handle async cancellation in sentinel assert
  drm/i915: Request watchdog infrastructure
  drm/i915: Fail too long user submissions by default
  drm/i915: Allow configuring default request expiry via modparam

 drivers/gpu/drm/i915/Kconfig.profile          |  14 ++
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  39 ++++
 .../gpu/drm/i915/gem/i915_gem_context_types.h |   4 +
 drivers/gpu/drm/i915/gt/intel_context_param.h |  11 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 .../drm/i915/gt/intel_execlists_submission.c  |  18 +-
 .../drm/i915/gt/intel_execlists_submission.h  |   2 +
 drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
 drivers/gpu/drm/i915/gt/intel_gt.h            |   2 +
 drivers/gpu/drm/i915/gt/intel_gt_requests.c   |  26 +++
 drivers/gpu/drm/i915/gt/intel_gt_types.h      |   7 +
 drivers/gpu/drm/i915/i915_params.c            |   5 +
 drivers/gpu/drm/i915/i915_params.h            |   1 +
 drivers/gpu/drm/i915/i915_request.c           | 108 +++++++++-
 drivers/gpu/drm/i915/i915_request.h           |  12 +-
 drivers/gpu/drm/i915/selftests/i915_request.c | 201 ++++++++++++++++++
 17 files changed, 450 insertions(+), 8 deletions(-)

Comments

Tvrtko Ursulin March 22, 2021, 1:37 p.m. UTC | #1
On 19/03/2021 01:17, Patchwork wrote:

Okay with 20s default expiration the hangcheck tests on Tigerlake pass 
and we are left with these failures:

>       IGT changes
> 
> 
>         Possible regressions
> 
>   *
> 
>     igt@gem_ctx_ringsize@idle@bcs0:
> 
>       o shard-skl: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9870/shard-skl10/igt@gem_ctx_ringsize@idle@bcs0.html>
>         -> INCOMPLETE
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19806/shard-skl7/igt@gem_ctx_ringsize@idle@bcs0.html>

Too many runnable requests on a slow Skylake SKU with command parsing 
active. Too many to finish withing the 20s default expiration that is. 
This is actually the same root cause as the below tests tries to 
explicitly demonstrate:

>   *
> 
>     {igt@gem_watchdog@far-fence@bcs0} (NEW):
> 
>       o shard-glk: NOTRUN -> FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19806/shard-glk7/igt@gem_watchdog@far-fence@bcs0.html>
>   *
> 
>     {igt@gem_watchdog@far-fence@vcs0} (NEW):
> 
>       o shard-apl: NOTRUN -> FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19806/shard-apl1/igt@gem_watchdog@far-fence@vcs0.html>
>         +2 similar issues
>   *
> 
>     {igt@gem_watchdog@far-fence@vecs0} (NEW):
> 
>       o shard-kbl: NOTRUN -> FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19806/shard-kbl7/igt@gem_watchdog@far-fence@vecs0.html>
>         +2 similar issues

The vulnerability default expiration adds compared to the current state 
is applicable to heaviliy loaded systems where GPU is shared between 
multiple clients.

Otherwise series seems to work. Failing tests can be blacklisted going 
forward. Ack to merge and merge itself, after review, I leave to 
maintainers since personally I am not supportive of this mechanism.

Regards,

Tvrtko
Daniel Vetter March 22, 2021, 1:41 p.m. UTC | #2
On Mon, Mar 22, 2021 at 01:37:58PM +0000, Tvrtko Ursulin wrote:
> 
> On 19/03/2021 01:17, Patchwork wrote:
> 
> Okay with 20s default expiration the hangcheck tests on Tigerlake pass and
> we are left with these failures:
> 
> >       IGT changes
> > 
> > 
> >         Possible regressions
> > 
> >   *
> > 
> >     igt@gem_ctx_ringsize@idle@bcs0:
> > 
> >       o shard-skl: PASS
> >         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9870/shard-skl10/igt@gem_ctx_ringsize@idle@bcs0.html>
> >         -> INCOMPLETE
> >         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19806/shard-skl7/igt@gem_ctx_ringsize@idle@bcs0.html>
> 
> Too many runnable requests on a slow Skylake SKU with command parsing
> active. Too many to finish withing the 20s default expiration that is. This
> is actually the same root cause as the below tests tries to explicitly
> demonstrate:
> 
> >   *
> > 
> >     {igt@gem_watchdog@far-fence@bcs0} (NEW):
> > 
> >       o shard-glk: NOTRUN -> FAIL
> >         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19806/shard-glk7/igt@gem_watchdog@far-fence@bcs0.html>
> >   *
> > 
> >     {igt@gem_watchdog@far-fence@vcs0} (NEW):
> > 
> >       o shard-apl: NOTRUN -> FAIL
> >         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19806/shard-apl1/igt@gem_watchdog@far-fence@vcs0.html>
> >         +2 similar issues
> >   *
> > 
> >     {igt@gem_watchdog@far-fence@vecs0} (NEW):
> > 
> >       o shard-kbl: NOTRUN -> FAIL
> >         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_19806/shard-kbl7/igt@gem_watchdog@far-fence@vecs0.html>
> >         +2 similar issues
> 
> The vulnerability default expiration adds compared to the current state is
> applicable to heaviliy loaded systems where GPU is shared between multiple
> clients.
> 
> Otherwise series seems to work. Failing tests can be blacklisted going
> forward. Ack to merge and merge itself, after review, I leave to maintainers
> since personally I am not supportive of this mechanism.

Yeah I think we have some leftovers to look at after this has landed on
igt side, since with 20s we're rather long on the timeout side, and some
of the tests need to be resurrected with the preempt-ctx execbuf mode I
think.
-Daniel