Message ID | 20210917233818.33659-1-matthew.brost@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915: fix blank screen booting crashes | expand |
On 18/09/2021 00:38, Matthew Brost wrote: > From: Hugh Dickins <hughd@google.com> > > 5.15-rc1 crashes with blank screen when booting up on two ThinkPads > using i915. Bisections converge convincingly, but arrive at different > and surprising "culprits", none of them the actual culprit. It is certainly surprising this patch crashed SNB and KBL. How feasible would it be to make this code just not run when GuC is not used? Given the field it adds is called ce->guc_blocked it sounds like a natural and preferable thing to do... if possible. > netconsole (with init_netconsole() hacked to call i915_init() when > logging has started, instead of by module_init()) tells the story: > > kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245! > with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify(). > I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that > function needs to be 4-byte aligned. > > v2: > (Jani Nikula) > - Change BUG_ON to WARN_ON However in this case the code would then go on and call into a wrong function offset which may be worse than a BUG_ON, no? > > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation") > Signed-off-by: Hugh Dickins <hughd@google.com> > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > Reviewed-by: Matthew Brost <matthew.brost@intel.com> > --- > drivers/gpu/drm/i915/gt/intel_context.c | 1 + > drivers/gpu/drm/i915/i915_sw_fence.c | 4 +++- > 2 files changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c > index ff637147b1a9..f02c2202da9d 100644 > --- a/drivers/gpu/drm/i915/gt/intel_context.c > +++ b/drivers/gpu/drm/i915/gt/intel_context.c > @@ -362,6 +362,7 @@ static int __intel_context_active(struct i915_active *active) > return 0; > } > > +__aligned(4) /* Respect the I915_SW_FENCE_MASK */ Hugh suggested __i915_sw_fence_call which I think would be the right thing to do. Regards, Tvrtko > static int sw_fence_dummy_notify(struct i915_sw_fence *sf, > enum i915_sw_fence_notify state) > { > diff --git a/drivers/gpu/drm/i915/i915_sw_fence.c b/drivers/gpu/drm/i915/i915_sw_fence.c > index c589a681da77..1217b124c1d0 100644 > --- a/drivers/gpu/drm/i915/i915_sw_fence.c > +++ b/drivers/gpu/drm/i915/i915_sw_fence.c > @@ -14,8 +14,10 @@ > > #if IS_ENABLED(CONFIG_DRM_I915_DEBUG) > #define I915_SW_FENCE_BUG_ON(expr) BUG_ON(expr) > +#define I915_SW_FENCE_WARN_ON(expr) WARN_ON(expr) > #else > #define I915_SW_FENCE_BUG_ON(expr) BUILD_BUG_ON_INVALID(expr) > +#define I915_SW_FENCE_WARN_ON(expr) BUILD_BUG_ON_INVALID(expr) > #endif > > static DEFINE_SPINLOCK(i915_sw_fence_lock); > @@ -242,7 +244,7 @@ void __i915_sw_fence_init(struct i915_sw_fence *fence, > const char *name, > struct lock_class_key *key) > { > - BUG_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); > + I915_SW_FENCE_WARN_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); > > __init_waitqueue_head(&fence->wait, name, key); > fence->flags = (unsigned long)fn; >
On Mon, 20 Sep 2021, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > On 18/09/2021 00:38, Matthew Brost wrote: >> From: Hugh Dickins <hughd@google.com> >> >> 5.15-rc1 crashes with blank screen when booting up on two ThinkPads >> using i915. Bisections converge convincingly, but arrive at different >> and surprising "culprits", none of them the actual culprit. > > It is certainly surprising this patch crashed SNB and KBL. > > How feasible would it be to make this code just not run when GuC is not > used? Given the field it adds is called ce->guc_blocked it sounds like a > natural and preferable thing to do... if possible. > >> netconsole (with init_netconsole() hacked to call i915_init() when >> logging has started, instead of by module_init()) tells the story: >> >> kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245! >> with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify(). >> I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that >> function needs to be 4-byte aligned. >> >> v2: >> (Jani Nikula) >> - Change BUG_ON to WARN_ON > > However in this case the code would then go on and call into a wrong > function offset which may be worse than a BUG_ON, no? So how about just if (WARN_ON(...)) return; or whatever is needed to give both the user and the CI a better opportunity to see the error. BR, Jani > >> >> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation") >> Signed-off-by: Hugh Dickins <hughd@google.com> >> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >> Reviewed-by: Matthew Brost <matthew.brost@intel.com> >> --- >> drivers/gpu/drm/i915/gt/intel_context.c | 1 + >> drivers/gpu/drm/i915/i915_sw_fence.c | 4 +++- >> 2 files changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c >> index ff637147b1a9..f02c2202da9d 100644 >> --- a/drivers/gpu/drm/i915/gt/intel_context.c >> +++ b/drivers/gpu/drm/i915/gt/intel_context.c >> @@ -362,6 +362,7 @@ static int __intel_context_active(struct i915_active *active) >> return 0; >> } >> >> +__aligned(4) /* Respect the I915_SW_FENCE_MASK */ > > Hugh suggested __i915_sw_fence_call which I think would be the right > thing to do. > > Regards, > > Tvrtko > >> static int sw_fence_dummy_notify(struct i915_sw_fence *sf, >> enum i915_sw_fence_notify state) >> { >> diff --git a/drivers/gpu/drm/i915/i915_sw_fence.c b/drivers/gpu/drm/i915/i915_sw_fence.c >> index c589a681da77..1217b124c1d0 100644 >> --- a/drivers/gpu/drm/i915/i915_sw_fence.c >> +++ b/drivers/gpu/drm/i915/i915_sw_fence.c >> @@ -14,8 +14,10 @@ >> >> #if IS_ENABLED(CONFIG_DRM_I915_DEBUG) >> #define I915_SW_FENCE_BUG_ON(expr) BUG_ON(expr) >> +#define I915_SW_FENCE_WARN_ON(expr) WARN_ON(expr) >> #else >> #define I915_SW_FENCE_BUG_ON(expr) BUILD_BUG_ON_INVALID(expr) >> +#define I915_SW_FENCE_WARN_ON(expr) BUILD_BUG_ON_INVALID(expr) >> #endif >> >> static DEFINE_SPINLOCK(i915_sw_fence_lock); >> @@ -242,7 +244,7 @@ void __i915_sw_fence_init(struct i915_sw_fence *fence, >> const char *name, >> struct lock_class_key *key) >> { >> - BUG_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); >> + I915_SW_FENCE_WARN_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); >> >> __init_waitqueue_head(&fence->wait, name, key); >> fence->flags = (unsigned long)fn; >>
On 20/09/2021 08:38, Jani Nikula wrote: > On Mon, 20 Sep 2021, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: >> On 18/09/2021 00:38, Matthew Brost wrote: >>> From: Hugh Dickins <hughd@google.com> >>> >>> 5.15-rc1 crashes with blank screen when booting up on two ThinkPads >>> using i915. Bisections converge convincingly, but arrive at different >>> and surprising "culprits", none of them the actual culprit. >> >> It is certainly surprising this patch crashed SNB and KBL. >> >> How feasible would it be to make this code just not run when GuC is not >> used? Given the field it adds is called ce->guc_blocked it sounds like a >> natural and preferable thing to do... if possible. >> >>> netconsole (with init_netconsole() hacked to call i915_init() when >>> logging has started, instead of by module_init()) tells the story: >>> >>> kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245! >>> with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify(). >>> I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that >>> function needs to be 4-byte aligned. >>> >>> v2: >>> (Jani Nikula) >>> - Change BUG_ON to WARN_ON >> >> However in this case the code would then go on and call into a wrong >> function offset which may be worse than a BUG_ON, no? > > So how about just > > if (WARN_ON(...)) > return; > > or whatever is needed to give both the user and the CI a better > opportunity to see the error. Sounds good to me. Regards, Tvrtko > > BR, > Jani > > >> >>> >>> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation") >>> Signed-off-by: Hugh Dickins <hughd@google.com> >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>> Reviewed-by: Matthew Brost <matthew.brost@intel.com> >>> --- >>> drivers/gpu/drm/i915/gt/intel_context.c | 1 + >>> drivers/gpu/drm/i915/i915_sw_fence.c | 4 +++- >>> 2 files changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c >>> index ff637147b1a9..f02c2202da9d 100644 >>> --- a/drivers/gpu/drm/i915/gt/intel_context.c >>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c >>> @@ -362,6 +362,7 @@ static int __intel_context_active(struct i915_active *active) >>> return 0; >>> } >>> >>> +__aligned(4) /* Respect the I915_SW_FENCE_MASK */ >> >> Hugh suggested __i915_sw_fence_call which I think would be the right >> thing to do. >> >> Regards, >> >> Tvrtko >> >>> static int sw_fence_dummy_notify(struct i915_sw_fence *sf, >>> enum i915_sw_fence_notify state) >>> { >>> diff --git a/drivers/gpu/drm/i915/i915_sw_fence.c b/drivers/gpu/drm/i915/i915_sw_fence.c >>> index c589a681da77..1217b124c1d0 100644 >>> --- a/drivers/gpu/drm/i915/i915_sw_fence.c >>> +++ b/drivers/gpu/drm/i915/i915_sw_fence.c >>> @@ -14,8 +14,10 @@ >>> >>> #if IS_ENABLED(CONFIG_DRM_I915_DEBUG) >>> #define I915_SW_FENCE_BUG_ON(expr) BUG_ON(expr) >>> +#define I915_SW_FENCE_WARN_ON(expr) WARN_ON(expr) >>> #else >>> #define I915_SW_FENCE_BUG_ON(expr) BUILD_BUG_ON_INVALID(expr) >>> +#define I915_SW_FENCE_WARN_ON(expr) BUILD_BUG_ON_INVALID(expr) >>> #endif >>> >>> static DEFINE_SPINLOCK(i915_sw_fence_lock); >>> @@ -242,7 +244,7 @@ void __i915_sw_fence_init(struct i915_sw_fence *fence, >>> const char *name, >>> struct lock_class_key *key) >>> { >>> - BUG_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); >>> + I915_SW_FENCE_WARN_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); >>> >>> __init_waitqueue_head(&fence->wait, name, key); >>> fence->flags = (unsigned long)fn; >>> >
On Mon, Sep 20, 2021 at 08:28:13AM +0100, Tvrtko Ursulin wrote: > > On 18/09/2021 00:38, Matthew Brost wrote: > > From: Hugh Dickins <hughd@google.com> > > > > 5.15-rc1 crashes with blank screen when booting up on two ThinkPads > > using i915. Bisections converge convincingly, but arrive at different > > and surprising "culprits", none of them the actual culprit. > > It is certainly surprising this patch crashed SNB and KBL. > > How feasible would it be to make this code just not run when GuC is not > used? Given the field it adds is called ce->guc_blocked it sounds like a > natural and preferable thing to do... if possible. > I can likely do this in a follow up patch. > > netconsole (with init_netconsole() hacked to call i915_init() when > > logging has started, instead of by module_init()) tells the story: > > > > kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245! > > with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify(). > > I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that > > function needs to be 4-byte aligned. > > > > v2: > > (Jani Nikula) > > - Change BUG_ON to WARN_ON > > However in this case the code would then go on and call into a wrong > function offset which may be worse than a BUG_ON, no? > Yea, I guess that would be bad too. > > > > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation") > > Signed-off-by: Hugh Dickins <hughd@google.com> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > Reviewed-by: Matthew Brost <matthew.brost@intel.com> > > --- > > drivers/gpu/drm/i915/gt/intel_context.c | 1 + > > drivers/gpu/drm/i915/i915_sw_fence.c | 4 +++- > > 2 files changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c > > index ff637147b1a9..f02c2202da9d 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_context.c > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c > > @@ -362,6 +362,7 @@ static int __intel_context_active(struct i915_active *active) > > return 0; > > } > > +__aligned(4) /* Respect the I915_SW_FENCE_MASK */ > > Hugh suggested __i915_sw_fence_call which I think would be the right thing > to do. > Yep. Will do. Matt > Regards, > > Tvrtko > > > static int sw_fence_dummy_notify(struct i915_sw_fence *sf, > > enum i915_sw_fence_notify state) > > { > > diff --git a/drivers/gpu/drm/i915/i915_sw_fence.c b/drivers/gpu/drm/i915/i915_sw_fence.c > > index c589a681da77..1217b124c1d0 100644 > > --- a/drivers/gpu/drm/i915/i915_sw_fence.c > > +++ b/drivers/gpu/drm/i915/i915_sw_fence.c > > @@ -14,8 +14,10 @@ > > #if IS_ENABLED(CONFIG_DRM_I915_DEBUG) > > #define I915_SW_FENCE_BUG_ON(expr) BUG_ON(expr) > > +#define I915_SW_FENCE_WARN_ON(expr) WARN_ON(expr) > > #else > > #define I915_SW_FENCE_BUG_ON(expr) BUILD_BUG_ON_INVALID(expr) > > +#define I915_SW_FENCE_WARN_ON(expr) BUILD_BUG_ON_INVALID(expr) > > #endif > > static DEFINE_SPINLOCK(i915_sw_fence_lock); > > @@ -242,7 +244,7 @@ void __i915_sw_fence_init(struct i915_sw_fence *fence, > > const char *name, > > struct lock_class_key *key) > > { > > - BUG_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); > > + I915_SW_FENCE_WARN_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); > > __init_waitqueue_head(&fence->wait, name, key); > > fence->flags = (unsigned long)fn; > >
On Mon, Sep 20, 2021 at 08:42:42AM +0100, Tvrtko Ursulin wrote: > > On 20/09/2021 08:38, Jani Nikula wrote: > > On Mon, 20 Sep 2021, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > > On 18/09/2021 00:38, Matthew Brost wrote: > > > > From: Hugh Dickins <hughd@google.com> > > > > > > > > 5.15-rc1 crashes with blank screen when booting up on two ThinkPads > > > > using i915. Bisections converge convincingly, but arrive at different > > > > and surprising "culprits", none of them the actual culprit. > > > > > > It is certainly surprising this patch crashed SNB and KBL. > > > > > > How feasible would it be to make this code just not run when GuC is not > > > used? Given the field it adds is called ce->guc_blocked it sounds like a > > > natural and preferable thing to do... if possible. > > > > > > > netconsole (with init_netconsole() hacked to call i915_init() when > > > > logging has started, instead of by module_init()) tells the story: > > > > > > > > kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245! > > > > with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify(). > > > > I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that > > > > function needs to be 4-byte aligned. > > > > > > > > v2: > > > > (Jani Nikula) > > > > - Change BUG_ON to WARN_ON > > > > > > However in this case the code would then go on and call into a wrong > > > function offset which may be worse than a BUG_ON, no? > > > > So how about just > > > > if (WARN_ON(...)) > > return; I don't think it is quite that simple as if we short circuit this function fence->flags will be NULL which would be bad too. I'll have make a few more changes to make this safe. Matt > > > > or whatever is needed to give both the user and the CI a better > > opportunity to see the error. > > Sounds good to me. > > Regards, > > Tvrtko > > > > > > BR, > > Jani > > > > > > > > > > > > > > > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation") > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > Reviewed-by: Matthew Brost <matthew.brost@intel.com> > > > > --- > > > > drivers/gpu/drm/i915/gt/intel_context.c | 1 + > > > > drivers/gpu/drm/i915/i915_sw_fence.c | 4 +++- > > > > 2 files changed, 4 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c > > > > index ff637147b1a9..f02c2202da9d 100644 > > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c > > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c > > > > @@ -362,6 +362,7 @@ static int __intel_context_active(struct i915_active *active) > > > > return 0; > > > > } > > > > +__aligned(4) /* Respect the I915_SW_FENCE_MASK */ > > > > > > Hugh suggested __i915_sw_fence_call which I think would be the right > > > thing to do. > > > > > > Regards, > > > > > > Tvrtko > > > > > > > static int sw_fence_dummy_notify(struct i915_sw_fence *sf, > > > > enum i915_sw_fence_notify state) > > > > { > > > > diff --git a/drivers/gpu/drm/i915/i915_sw_fence.c b/drivers/gpu/drm/i915/i915_sw_fence.c > > > > index c589a681da77..1217b124c1d0 100644 > > > > --- a/drivers/gpu/drm/i915/i915_sw_fence.c > > > > +++ b/drivers/gpu/drm/i915/i915_sw_fence.c > > > > @@ -14,8 +14,10 @@ > > > > #if IS_ENABLED(CONFIG_DRM_I915_DEBUG) > > > > #define I915_SW_FENCE_BUG_ON(expr) BUG_ON(expr) > > > > +#define I915_SW_FENCE_WARN_ON(expr) WARN_ON(expr) > > > > #else > > > > #define I915_SW_FENCE_BUG_ON(expr) BUILD_BUG_ON_INVALID(expr) > > > > +#define I915_SW_FENCE_WARN_ON(expr) BUILD_BUG_ON_INVALID(expr) > > > > #endif > > > > static DEFINE_SPINLOCK(i915_sw_fence_lock); > > > > @@ -242,7 +244,7 @@ void __i915_sw_fence_init(struct i915_sw_fence *fence, > > > > const char *name, > > > > struct lock_class_key *key) > > > > { > > > > - BUG_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); > > > > + I915_SW_FENCE_WARN_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); > > > > __init_waitqueue_head(&fence->wait, name, key); > > > > fence->flags = (unsigned long)fn; > > > > > >
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index ff637147b1a9..f02c2202da9d 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -362,6 +362,7 @@ static int __intel_context_active(struct i915_active *active) return 0; } +__aligned(4) /* Respect the I915_SW_FENCE_MASK */ static int sw_fence_dummy_notify(struct i915_sw_fence *sf, enum i915_sw_fence_notify state) { diff --git a/drivers/gpu/drm/i915/i915_sw_fence.c b/drivers/gpu/drm/i915/i915_sw_fence.c index c589a681da77..1217b124c1d0 100644 --- a/drivers/gpu/drm/i915/i915_sw_fence.c +++ b/drivers/gpu/drm/i915/i915_sw_fence.c @@ -14,8 +14,10 @@ #if IS_ENABLED(CONFIG_DRM_I915_DEBUG) #define I915_SW_FENCE_BUG_ON(expr) BUG_ON(expr) +#define I915_SW_FENCE_WARN_ON(expr) WARN_ON(expr) #else #define I915_SW_FENCE_BUG_ON(expr) BUILD_BUG_ON_INVALID(expr) +#define I915_SW_FENCE_WARN_ON(expr) BUILD_BUG_ON_INVALID(expr) #endif static DEFINE_SPINLOCK(i915_sw_fence_lock); @@ -242,7 +244,7 @@ void __i915_sw_fence_init(struct i915_sw_fence *fence, const char *name, struct lock_class_key *key) { - BUG_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); + I915_SW_FENCE_WARN_ON(!fn || (unsigned long)fn & ~I915_SW_FENCE_MASK); __init_waitqueue_head(&fence->wait, name, key); fence->flags = (unsigned long)fn;