mbox series

[0/6] drm/i915: Failsafe migration blits

Message ID 20211008133530.664509-1-thomas.hellstrom@linux.intel.com (mailing list archive)
Headers show
Series drm/i915: Failsafe migration blits | expand

Message

Thomas Hellstrom Oct. 8, 2021, 1:35 p.m. UTC
This patch series introduces failsafe migration blits.
The reason for this seemingly strange concept is that if the initial
clearing or readback of LMEM fails for some reason, and we then set up
either GPU- or CPU ptes to the allocated LMEM, we can expose old
contents from other clients.

So after each migration blit we attach a struct dma-fence-work that checks
the error value and if it's an error, perform a memcpy blit, instead.

This comes with some needed infrastructure updates:

Patch 1, updates dma_fence_work to do the work even if there is an error.
The work callback needs to check for error and take action accordingly.
Patch 2, Introduces refcounted sg-tables. The sg-tables are needed async for
the memcpy.
Patch 3, Introduces the failsafe migration blits and selftests.
Patch 4, Adds the possibility to attach the struct dma_fence_work to a timeline.
Patch 5, Attached the migration fence to a timeline since TTM requires that
for upcoming async eviction.
Patch 6 Adds an optimization for coalescing-only struct dma_fence_work.

Worth to consider during review: Patch 4-6 are probably better done in the
context of struct dma_fence_array. Both since we perhaps shouldn't add
irq work to yet another fence data structure and also because the
i915 command submission can individualize struct dma_fence_arrays.

Also the memcpy solution here isn't a final one as it only works if the
aperture covers all of lmem. We probably need to work on a solution where
we intercept move_fence errors and refuse GPU- and CPU mappings.

Thomas Hellström (6):
  drm/i915: Update dma_fence_work
  drm/i915: Introduce refcounted sg-tables
  drm/i915/ttm: Failsafe migration blits
  drm/i915: Add a struct dma_fence_work timeline
  drm/i915/ttm: Attach the migration fence to a region timeline on
    eviction
  drm/i915: Use irq work for coalescing-only dma-fence-work

 drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   5 +
 .../gpu/drm/i915/gem/i915_gem_object_types.h  |   3 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c       | 467 ++++++++++++++----
 drivers/gpu/drm/i915/gem/i915_gem_ttm.h       |   4 +
 .../drm/i915/gem/selftests/i915_gem_migrate.c |  24 +-
 drivers/gpu/drm/i915/i915_scatterlist.c       |  62 ++-
 drivers/gpu/drm/i915/i915_scatterlist.h       |  76 ++-
 drivers/gpu/drm/i915/i915_sw_fence_work.c     | 145 +++++-
 drivers/gpu/drm/i915/i915_sw_fence_work.h     |  61 +++
 drivers/gpu/drm/i915/i915_vma.c               |  12 +-
 drivers/gpu/drm/i915/intel_memory_region.c    |  43 ++
 drivers/gpu/drm/i915/intel_memory_region.h    |   7 +
 drivers/gpu/drm/i915/intel_region_ttm.c       |  15 +-
 drivers/gpu/drm/i915/intel_region_ttm.h       |   5 +-
 drivers/gpu/drm/i915/selftests/mock_region.c  |  12 +-
 15 files changed, 776 insertions(+), 165 deletions(-)

Comments

Dave Airlie Oct. 14, 2021, 1:50 a.m. UTC | #1
On Fri, 8 Oct 2021 at 23:36, Thomas Hellström
<thomas.hellstrom@linux.intel.com> wrote:
>
> This patch series introduces failsafe migration blits.
> The reason for this seemingly strange concept is that if the initial
> clearing or readback of LMEM fails for some reason, and we then set up
> either GPU- or CPU ptes to the allocated LMEM, we can expose old
> contents from other clients.

Can we enumerate "for some reason" here?

This feels like "security" with no defined threat model. Maybe if the
cover letter contains more details on the threat model it would make
more sense.

Dave.
Thomas Hellstrom Oct. 14, 2021, 7:29 a.m. UTC | #2
Hi, Dave,

On 10/14/21 03:50, Dave Airlie wrote:
> On Fri, 8 Oct 2021 at 23:36, Thomas Hellström
> <thomas.hellstrom@linux.intel.com> wrote:
>> This patch series introduces failsafe migration blits.
>> The reason for this seemingly strange concept is that if the initial
>> clearing or readback of LMEM fails for some reason, and we then set up
>> either GPU- or CPU ptes to the allocated LMEM, we can expose old
>> contents from other clients.
> Can we enumerate "for some reason" here?
>
> This feels like "security" with no defined threat model. Maybe if the
> cover letter contains more details on the threat model it would make
> more sense.

TBH, I'd be quite happy if we could find a way to skip this series (or 
even a reworked version) completely.

Assuming that the migration request setup code is bug-free enough to not 
never cause an engine reset, there are at least two ways I can see the 
migration fail:

1) The migration fence we will be depending on when fully async 
(ttm->moving) may signal with error after the following:
malicious_batchbuffer_causing_reset -> async eviction -> allocation -> 
async clearing

2) malicious_batchbuffers_causing_gt_wedge submitted to copy engine -> 
migration_blit submitted to  copy_engine. If wedging the gt, the 
migration blit will never be executed, fence->error will end up with 
-EIO but TTM will happily fault the pages to user-space.

Now we had other versions around looking at the ttm_bo->moving errors at 
vma binding and cpu faulting, but this was the direction chosen after 
discussions with our arch team. Either way we'd probably want to block 
the error propagation after async_eviction.

I can of course add 1) and 2) above to the cover-letter, but if you have 
any additional input on the best way to handle this, that'd be appreciated.

Thanks,

Thomas





> Dave.