diff mbox series

[RFC,160/162] drm/i915/dg1: Fix GPU hang due to shmemfs page drop

Message ID 20201127120718.454037-161-matthew.auld@intel.com (mailing list archive)
State New, archived
Headers show
Series DG1 + LMEM enabling | expand

Commit Message

Matthew Auld Nov. 27, 2020, 12:07 p.m. UTC
From: Venkata Ramana Nayana <venkata.ramana.nayana@intel.com>

This is to fix a bug in upstream
commit a6326a4f8ffb ("drm/i915/gt: Keep a no-frills swappable copy of the default context state")

We allocate context state obj ce->state from lmem, so in __engines_record_defaults(),
we call shmem_create_from_object(). Because it is lmem object, this call will
create a new shmemfs file, copy the contents into it, and return the file
pointer and assign to engine->default_state. Of course ce->state lmem object
is freed at the end of function __engines_record_redefaults().

Because a new shmemfs file is create for engine->default_state,
and more importantly, we DON'T mark the pages dirty after we write into it,
the OS page cache eviction will drop these pages.

Now with the test move forward, it will create new request/context, and will
copy the saved engine->default_state into ce->state. If the default_state
pages are dropped during page cache eviction, the copying will get new pages,
and copy garbage from the new pages. Next, ce->state will have wrong
instruction and causes GPU to hang.

The fixing is very simple, we just mark the shmemfs pages to be dirty when
writing into it, and also mark the pages to accessed when read/write to them.

Fixes: a6326a4f8ffb("drm/i915/gt: Keep a no-frills swappable copy of the default context state")
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Ramalingam C <ramalingam.c@intel.com>
Cc: Chris Wilson <chris@intel.com>
Signed-off-by: CQ Tang <cq.tang@intel.com>
Signed-off-by: Venkata Ramana Nayana <venkata.ramana.nayana@intel.com>
---
 drivers/gpu/drm/i915/gt/shmem_utils.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Chris Wilson Nov. 27, 2020, 2:44 p.m. UTC | #1
Quoting Matthew Auld (2020-11-27 12:07:16)
> From: Venkata Ramana Nayana <venkata.ramana.nayana@intel.com>
> 
> This is to fix a bug in upstream
> commit a6326a4f8ffb ("drm/i915/gt: Keep a no-frills swappable copy of the default context state")
> 
> We allocate context state obj ce->state from lmem, so in __engines_record_defaults(),
> we call shmem_create_from_object(). Because it is lmem object, this call will
> create a new shmemfs file, copy the contents into it, and return the file
> pointer and assign to engine->default_state. Of course ce->state lmem object
> is freed at the end of function __engines_record_redefaults().
> 
> Because a new shmemfs file is create for engine->default_state,
> and more importantly, we DON'T mark the pages dirty after we write into it,
> the OS page cache eviction will drop these pages.
> 
> Now with the test move forward, it will create new request/context, and will
> copy the saved engine->default_state into ce->state. If the default_state
> pages are dropped during page cache eviction, the copying will get new pages,
> and copy garbage from the new pages. Next, ce->state will have wrong
> instruction and causes GPU to hang.
> 
> The fixing is very simple, we just mark the shmemfs pages to be dirty when
> writing into it, and also mark the pages to accessed when read/write to them.
> 
> Fixes: a6326a4f8ffb("drm/i915/gt: Keep a no-frills swappable copy of the default context state")

A bug fix, send it. But please write a concise changelog first.

I missed setting the dirty bit, and so the contents were not being saved
on swap out as expected. Impact is severe; any context created after
resume may be gibberish.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/shmem_utils.c b/drivers/gpu/drm/i915/gt/shmem_utils.c
index 1fbc070a4651..e24c2c2342bb 100644
--- a/drivers/gpu/drm/i915/gt/shmem_utils.c
+++ b/drivers/gpu/drm/i915/gt/shmem_utils.c
@@ -105,10 +105,13 @@  static int __shmem_rw(struct file *file, loff_t off,
 			return PTR_ERR(page);
 
 		vaddr = kmap(page);
-		if (write)
+		if (write) {
 			memcpy(vaddr + offset_in_page(off), ptr, this);
-		else
+			set_page_dirty(page);
+		} else {
 			memcpy(ptr, vaddr + offset_in_page(off), this);
+		}
+		mark_page_accessed(page);
 		kunmap(page);
 		put_page(page);