mbox series

[v2,0/2] drm/i915: Failsafe migration blits

Message ID 20211101183851.291015-1-thomas.hellstrom@linux.intel.com (mailing list archive)
Headers show
Series drm/i915: Failsafe migration blits | expand

Message

Thomas Hellstrom Nov. 1, 2021, 6:38 p.m. UTC
This patch series introduces failsafe migration blits.
The reason for this seemingly strange concept is that if the initial
clearing or readback of LMEM fails for some reason[1], and we then set up
either GPU- or CPU ptes to the allocated LMEM, we can expose old
contents from other clients.

So after each migration blit to LMEM, attach a dma-fence callback that
checks the migration fence error value and if it's an error,
performs a memcpy blit, instead.

Patch 1 splits out the TTM move code into separate files
Patch 2 implements the failsafe blits and related self-tests

[1] There are at least two ways we could trigger exposure of uninitialized
LMEM assuming the migration blits themselves never trigger a gpu hang.

a) A gpu operation preceding a pipelined eviction blit resets and sets the
error fence to -EIO, and the error is propagated across the TTM manager to
the clear / swapin blit of a newly allocated TTM resource. It aborts and
leaves the memory uninitialized.

b) Something wedges the GT while a migration blit is submitted. It ends up
never executed and TTM can fault user-space cpu-ptes into uninitialized
memory.

Thomas Hellström (2):
  drm/i915/ttm: Reorganize the ttm move code
  drm/i915/ttm: Failsafe migration blits

 drivers/gpu/drm/i915/Makefile                 |   1 +
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c       | 328 ++---------
 drivers/gpu/drm/i915/gem/i915_gem_ttm.h       |  35 ++
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c  | 520 ++++++++++++++++++
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.h  |  43 ++
 .../drm/i915/gem/selftests/i915_gem_migrate.c |  24 +-
 6 files changed, 670 insertions(+), 281 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
 create mode 100644 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.h

Comments

Thomas Hellstrom Nov. 2, 2021, 8:18 a.m. UTC | #1
On 11/2/21 08:47, Patchwork wrote:
> Project List - Patchwork *Patch Details*
> *Series:* 	drm/i915: Failsafe migration blits (rev3)
> *URL:* 	https://patchwork.freedesktop.org/series/95617/
> *State:* 	failure
> *Details:* 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21501/index.html
>
>
>   CI Bug Log - changes from CI_DRM_10826_full -> Patchwork_21501_full
>
>
>     Summary
>
> *FAILURE*
>
> Serious unknown changes coming with Patchwork_21501_full absolutely 
> need to be
> verified manually.
>
> If you think the reported changes have nothing to do with the changes
> introduced in Patchwork_21501_full, please notify your bug team to 
> allow them
> to document this new failure mode, which will reduce false positives 
> in CI.
>
>
>     Participating hosts (10 -> 10)
>
> No changes in participating hosts
>
>
>     Possible new issues
>
> Here are the unknown changes that may have been introduced in 
> Patchwork_21501_full:
>
>
>       IGT changes
>
>
>         Possible regressions
>
>   * igt@gem_workarounds@suspend-resume-fd:
>       o shard-snb: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10826/shard-snb2/igt@gem_workarounds@suspend-resume-fd.html>
>         -> TIMEOUT
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21501/shard-snb2/igt@gem_workarounds@suspend-resume-fd.html>
>
Lakshmi,

This failure is unrelated.

Thanks,
Thomas
Vudum, Lakshminarayana Nov. 2, 2021, 3:41 p.m. UTC | #2
Filed below bug and re-reported.
https://gitlab.freedesktop.org/drm/intel/-/issues/4420
igt@gem_workarounds@suspend-resume-fd - timeout - Received signal SIGQUIT. Per-test timeout exceeded. Killing the current test with SIGQUIT.

Thanks,
Lakshmi.
From: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Sent: Tuesday, November 2, 2021 2:19 AM
To: intel-gfx@lists.freedesktop.org; Vudum, Lakshminarayana <lakshminarayana.vudum@intel.com>
Subject: Re: ✗ Fi.CI.IGT: failure for drm/i915: Failsafe migration blits (rev3)



On 11/2/21 08:47, Patchwork wrote:
Patch Details
Series:

drm/i915: Failsafe migration blits (rev3)

URL:

https://patchwork.freedesktop.org/series/95617/

State:

failure

Details:

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21501/index.html

CI Bug Log - changes from CI_DRM_10826_full -> Patchwork_21501_full
Summary

FAILURE

Serious unknown changes coming with Patchwork_21501_full absolutely need to be
verified manually.

If you think the reported changes have nothing to do with the changes
introduced in Patchwork_21501_full, please notify your bug team to allow them
to document this new failure mode, which will reduce false positives in CI.

Participating hosts (10 -> 10)

No changes in participating hosts

Possible new issues

Here are the unknown changes that may have been introduced in Patchwork_21501_full:

IGT changes
Possible regressions

  *   igt@gem_workarounds@suspend-resume-fd:

     *   shard-snb: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10826/shard-snb2/igt@gem_workarounds@suspend-resume-fd.html> -> TIMEOUT<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21501/shard-snb2/igt@gem_workarounds@suspend-resume-fd.html>

Lakshmi,

This failure is unrelated.

Thanks,
Thomas