mbox series

[00/21] Clean up GuC CI failures, simplify locking, and kernel DOC

Message ID 20210815201559.1150-1-matthew.brost@intel.com (mailing list archive)
Headers show
Series Clean up GuC CI failures, simplify locking, and kernel DOC | expand

Message

Matthew Brost Aug. 15, 2021, 8:15 p.m. UTC
Daniel Vetter pointed out that locking in the GuC submission code was
overly complicated, let's clean this up a bit before introducing more
features in the GuC submission backend.

Also fix some CI failures, port fixes from our internal tree, and add a
few more selftests for coverage.

Lastly, add some kernel DOC explaining how the GuC submission backend
works.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Matthew Brost (21):
  drm/i915/guc: Fix blocked context accounting
  drm/i915/guc: outstanding G2H accounting
  drm/i915/guc: Unwind context requests in reverse order
  drm/i915/guc: Don't drop ce->guc_active.lock when unwinding context
  drm/i915/guc: Workaround reset G2H is received after schedule done G2H
  drm/i915/selftests: Add a cancel request selftest that triggers a
    reset
  drm/i915/guc: Don't enable scheduling on a banned context, guc_id
    invalid, not registered
  drm/i915/selftests: Fix memory corruption in live_lrc_isolation
  drm/i915/selftests: Add initial GuC selftest for scrubbing lost G2H
  drm/i915/guc: Take context ref when cancelling request
  drm/i915/guc: Don't touch guc_state.sched_state without a lock
  drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
  drm/i915: Allocate error capture in atomic context
  drm/i915/guc: Flush G2H work queue during reset
  drm/i915/guc: Release submit fence from an IRQ
  drm/i915/guc: Move guc_blocked fence to struct guc_state
  drm/i915/guc: Rework and simplify locking
  drm/i915/guc: Proper xarray usage for contexts_lookup
  drm/i915/guc: Drop pin count check trick between sched_disable and
    re-pin
  drm/i915/guc: Move GuC priority fields in context under guc_active
  drm/i915/guc: Add GuC kernel doc

 drivers/gpu/drm/i915/gt/intel_context.c       |   5 +-
 drivers/gpu/drm/i915/gt/intel_context_types.h |  68 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c        |  29 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  19 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 688 +++++++++++-------
 drivers/gpu/drm/i915/gt/uc/selftest_guc.c     | 126 ++++
 drivers/gpu/drm/i915/i915_gpu_error.c         |  37 +-
 drivers/gpu/drm/i915/i915_request.h           |  23 +-
 drivers/gpu/drm/i915/i915_trace.h             |   8 +-
 .../drm/i915/selftests/i915_live_selftests.h  |   1 +
 drivers/gpu/drm/i915/selftests/i915_request.c |  94 +++
 .../i915/selftests/intel_scheduler_helpers.c  |  12 +
 .../i915/selftests/intel_scheduler_helpers.h  |   2 +
 13 files changed, 805 insertions(+), 307 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/uc/selftest_guc.c

Comments

Matthew Brost Aug. 15, 2021, 9:54 p.m. UTC | #1
On Sun, Aug 15, 2021 at 09:15:31PM +0000, Patchwork wrote:
> Patch Details
> 
> Series:  Clean up GuC CI failures, simplify locking, and kernel DOC
> URL:     https://patchwork.freedesktop.org/series/93704/
> State:   failure
> Details: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20826/index.html
> 
> CI Bug Log - changes from CI_DRM_10484 -> Patchwork_20826
> 
> Summary
> 
> FAILURE
> 
> Serious unknown changes coming with Patchwork_20826 absolutely need to be
> verified manually.
> 
> If you think the reported changes have nothing to do with the changes
> introduced in Patchwork_20826, please notify your bug team to allow them
> to document this new failure mode, which will reduce false positives in CI.
> 
> External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_20826/
> index.html
> 
> Possible new issues
> 
> Here are the unknown changes that may have been introduced in Patchwork_20826:
> 
> IGT changes
> 
> Possible regressions
> 
>   • igt@i915_selftest@live@requests:
> 
>       □ fi-cfl-guc: PASS -> DMESG-FAIL
> 

New selftext, __cancel_reset, is exposing a bug (or at minimum different
behavior) in the execlists implementation for canceling requests. Will
fix in new rev.

Matt

>       □ fi-kbl-soraka: PASS -> DMESG-FAIL
> 
>       □ fi-bxt-dsi: PASS -> DMESG-FAIL
> 
>       □ fi-tgl-1115g4: PASS -> DMESG-FAIL
> 
>       □ fi-cml-u2: PASS -> DMESG-FAIL
> 
>       □ fi-kbl-8809g: PASS -> DMESG-FAIL
> 
>       □ fi-cfl-8700k: PASS -> DMESG-FAIL
> 
>       □ fi-cfl-8109u: PASS -> DMESG-FAIL
> 
>       □ fi-icl-u2: PASS -> DMESG-FAIL
> 
>       □ fi-kbl-7500u: PASS -> DMESG-FAIL
> 
>       □ fi-bsw-nick: PASS -> DMESG-FAIL
> 
>       □ fi-icl-y: PASS -> DMESG-FAIL
> 
>       □ fi-kbl-guc: PASS -> DMESG-FAIL
> 
>       □ fi-kbl-7567u: PASS -> DMESG-FAIL
> 
>       □ fi-skl-guc: PASS -> DMESG-FAIL
> 
>       □ fi-bdw-5557u: PASS -> DMESG-FAIL
> 
>       □ fi-glk-dsi: PASS -> DMESG-FAIL
> 
>       □ fi-bsw-kefka: PASS -> DMESG-FAIL
> 
>       □ fi-skl-6700k2: PASS -> DMESG-FAIL
> 
> Warnings
> 
>   • igt@i915_selftest@live@workarounds:
>       □ fi-rkl-guc: DMESG-FAIL (i915#3928) -> INCOMPLETE
> 
> Suppressed
> 
> The following results come from untrusted machines, tests, or statuses.
> They do not affect the overall result.
> 
>   • igt@i915_selftest@live@requests:
> 
>       □ {fi-tgl-dsi}: PASS -> DMESG-FAIL
> 
>       □ {fi-jsl-1}: PASS -> DMESG-FAIL
> 
>       □ {fi-ehl-2}: PASS -> DMESG-FAIL
> 
> New tests
> 
> New tests have been introduced between CI_DRM_10484 and Patchwork_20826:
> 
> New IGT tests (1)
> 
>   • igt@i915_selftest@live@guc:
>       □ Statuses : 30 pass(s)
>       □ Exec time: [0.42, 5.06] s
> 
> Known issues
> 
> Here are the changes found in Patchwork_20826 that come from known issues:
> 
> IGT changes
> 
> Issues hit
> 
>   • igt@i915_module_load@reload:
> 
>       □ fi-kbl-soraka: PASS -> DMESG-WARN (i915#1982)
>   • igt@i915_selftest@live@execlists:
> 
>       □ fi-icl-y: PASS -> DMESG-FAIL (i915#1993)
> 
> Possible fixes
> 
>   • igt@i915_module_load@reload:
>       □ {fi-tgl-dsi}: DMESG-WARN (i915#1982 / k.org#205379) -> PASS
> 
> {name}: This element is suppressed. This means it is ignored when computing
> the status of the difference (SUCCESS, WARNING, or FAILURE).
> 
> Participating hosts (37 -> 34)
> 
> Missing (3): fi-bdw-samus fi-bsw-cyan bat-jsl-1
> 
> Build changes
> 
>   • Linux: CI_DRM_10484 -> Patchwork_20826
> 
> CI-20190529: 20190529
> CI_DRM_10484: 7de02d5cb1f35bd3f068237444063844dea47ddc @ git://
> anongit.freedesktop.org/gfx-ci/linux
> IGT_6175: c91f99c74b966f635d7e2eb898bf0f78383d281b @ https://
> gitlab.freedesktop.org/drm/igt-gpu-tools.git
> Patchwork_20826: f7ff315bfe3a76713c1f0a16cd92b0908d28e4c6 @ git://
> anongit.freedesktop.org/gfx-ci/linux
> 
> == Linux commits ==
> 
> f7ff315bfe3a drm/i915/guc: Add GuC kernel doc
> af14e3698d19 drm/i915/guc: Move GuC priority fields in context under guc_active
> eb8a352e7c1f drm/i915/guc: Drop pin count check trick between sched_disable and
> re-pin
> 7057a0daff8c drm/i915/guc: Proper xarray usage for contexts_lookup
> d97ab34c8bac drm/i915/guc: Rework and simplify locking
> 4c980575f7af drm/i915/guc: Move guc_blocked fence to struct guc_state
> 1c2b4c0ac62a drm/i915/guc: Release submit fence from an IRQ
> 14dc302536e1 drm/i915/guc: Flush G2H work queue during reset
> c0ad63d810e6 drm/i915: Allocate error capture in atomic context
> 6c1c488a3654 drm/i915/guc: Reset LRC descriptor if register returns -ENODEV
> 025e88fa74d3 drm/i915/guc: Don't touch guc_state.sched_state without a lock
> b929abcf3b59 drm/i915/guc: Take context ref when cancelling request
> b5e8c08dff35 drm/i915/selftests: Add initial GuC selftest for scrubbing lost
> G2H
> cddf94c9bda0 drm/i915/selftests: Fix memory corruption in live_lrc_isolation
> c18da32e671c drm/i915/guc: Don't enable scheduling on a banned context, guc_id
> invalid, not registered
> 0c0928ba1ba8 drm/i915/selftests: Add a cancel request selftest that triggers a
> reset
> 8ec967ce47da drm/i915/guc: Workaround reset G2H is received after schedule done
> G2H
> 2739b8f8966f drm/i915/guc: Don't drop ce->guc_active.lock when unwinding
> context
> 1d414637d838 drm/i915/guc: Unwind context requests in reverse order
> 7fe738f56abc drm/i915/guc: outstanding G2H accounting
> c9f3859e5dce drm/i915/guc: Fix blocked context accounting
> 
> SECURITY NOTE: file ~/.netrc must not be accessible by others