Message ID | 1480726985-12762-1-git-send-email-mattst88@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Dec 2, 2016 at 5:03 PM, Matt Turner <mattst88@gmail.com> wrote: > From these instructions, users assume that /sys/class/drm/card0/error > contains all the information a developer needs to diagnose and fix a GPU > hang. > > In fact it doesn't, and we have no tools for solving them (other than > stabbing in the dark). Most of the time the error state itself isn't > even useful because it just shows a hang on a PIPE_CONTROL or similar. > > Until a time when the error state contains enough information to > actually solve a hang, stop telling users to file unsolvable bugs, and > instead rely on users who know where and how to file a good bug report > to find their own way there. > > Signed-off-by: Matt Turner <mattst88@gmail.com> > --- > Maybe now's a good time to discuss what *would* be useful to put in the > error state for debugging hangs. The currently executing shader program > would be a great place to start. Looks like Ben had a patch: https://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/?h=extra_error_objs&id=711da2570cd3302d0cb9f2489a101e4a7c966a6c
On Fri, Dec 02, 2016 at 05:03:05PM -0800, Matt Turner wrote: > From these instructions, users assume that /sys/class/drm/card0/error > contains all the information a developer needs to diagnose and fix a GPU > hang. > > In fact it doesn't, and we have no tools for solving them (other than > stabbing in the dark). Most of the time the error state itself isn't > even useful because it just shows a hang on a PIPE_CONTROL or similar. > > Until a time when the error state contains enough information to > actually solve a hang, stop telling users to file unsolvable bugs, and > instead rely on users who know where and how to file a good bug report > to find their own way there. > > Signed-off-by: Matt Turner <mattst88@gmail.com> Nak. Though having stale bug reports is a pain, we've recently adopted the policy of stopping the request after a certain period, those bug reports are still vital. They don't just represent bugs in mesa. > --- > Maybe now's a good time to discuss what *would* be useful to put in the > error state for debugging hangs. The currently executing shader program > would be a great place to start. Now? That is the conversation we've being trying to have for several years. The contents of the error state are currently about sufficient to spot kernel bugs, triage the culprit and the general class of bug. Capturing all state for a request is unfeasible (because we can't copy the gigabytes of memory required). Copying a selected set of aux bo is one option. And since those bo are under user control and do not have to be executed, you can even store aub data in them or whatnot. Even if you make attaching the debug information conditional, I would still keep the error message unconditional. -Chris
On Sat, 03 Dec 2016, Matt Turner <mattst88@gmail.com> wrote: > From these instructions, users assume that /sys/class/drm/card0/error > contains all the information a developer needs to diagnose and fix a GPU > hang. > > In fact it doesn't, and we have no tools for solving them (other than > stabbing in the dark). Most of the time the error state itself isn't > even useful because it just shows a hang on a PIPE_CONTROL or similar. > > Until a time when the error state contains enough information to > actually solve a hang, stop telling users to file unsolvable bugs, and > instead rely on users who know where and how to file a good bug report > to find their own way there. > > Signed-off-by: Matt Turner <mattst88@gmail.com> > --- > Maybe now's a good time to discuss what *would* be useful to put in the > error state for debugging hangs. The currently executing shader program > would be a great place to start. I'm wondering why we're getting this patch now, and my guess is that it's because we have been reassigning the related bugs to Mesa more actively lately. Is that the case? IIUC the bug reports are useful for us when it's a kernel bug, but less useful for you when it's a Mesa bug. And you'd rather have fewer incoming bugs that you think are unsolvable with the information at hand. Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed out. BR, Jani. > > drivers/gpu/drm/i915/i915_gpu_error.c | 11 ----------- > 1 file changed, 11 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > index 334f15d..8ddca7b 100644 > --- a/drivers/gpu/drm/i915/i915_gpu_error.c > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c > @@ -1431,7 +1431,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv, > u32 engine_mask, > const char *error_msg) > { > - static bool warned; > struct drm_i915_error_state *error; > unsigned long flags; > > @@ -1475,16 +1474,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv, > i915_error_state_free(&error->ref); > return; > } > - > - if (!warned) { > - DRM_INFO("GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.\n"); > - DRM_INFO("Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel\n"); > - DRM_INFO("drm/i915 developers can then reassign to the right component if it's not a kernel issue.\n"); > - DRM_INFO("The gpu crash dump is required to analyze gpu hangs, so please always attach it.\n"); > - DRM_INFO("GPU crash dump saved to /sys/class/drm/card%d/error\n", > - dev_priv->drm.primary->index); > - warned = true; > - } > } > > void i915_error_state_get(struct drm_device *dev,
On Sat, Dec 3, 2016 at 1:52 AM, Jani Nikula <jani.nikula@linux.intel.com> wrote: > On Sat, 03 Dec 2016, Matt Turner <mattst88@gmail.com> wrote: >> From these instructions, users assume that /sys/class/drm/card0/error >> contains all the information a developer needs to diagnose and fix a GPU >> hang. >> >> In fact it doesn't, and we have no tools for solving them (other than >> stabbing in the dark). Most of the time the error state itself isn't >> even useful because it just shows a hang on a PIPE_CONTROL or similar. >> >> Until a time when the error state contains enough information to >> actually solve a hang, stop telling users to file unsolvable bugs, and >> instead rely on users who know where and how to file a good bug report >> to find their own way there. >> >> Signed-off-by: Matt Turner <mattst88@gmail.com> >> --- >> Maybe now's a good time to discuss what *would* be useful to put in the >> error state for debugging hangs. The currently executing shader program >> would be a great place to start. > > I'm wondering why we're getting this patch now, and my guess is that > it's because we have been reassigning the related bugs to Mesa more > actively lately. Is that the case? No, it's simply because I spent a week going through Bugzilla and realized how incomplete an unactionable the majority of GPU hang reports are. Asking users to report bugs, but not telling them what actually constitutes a bug report, is a recipe for a lot of wasted developer time. I suspect we could improve the usefulness of the reports by directing users to a webpage that gave a few suggestions (tell us what you were doing when the hang occurred would be an obvious one) about filing a bug and then provided a link to Bugzilla. Or even configured Bugzilla to have a default template that requested various bits of information. > IIUC the bug reports are useful for us when it's a kernel bug, but less > useful for you when it's a Mesa bug. And you'd rather have fewer > incoming bugs that you think are unsolvable with the information at > hand. > > Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed > out. Indeed. I'd rather have the information provided in a bug report to actually solve it. I hope having access to the shader program will make many more reports useful. I am also happy to see that there's now a sunset to the GPU hang message.
On Mon, Dec 05, 2016 at 04:55:47PM -0800, Matt Turner wrote: > On Sat, Dec 3, 2016 at 1:52 AM, Jani Nikula <jani.nikula@linux.intel.com> wrote: > > On Sat, 03 Dec 2016, Matt Turner <mattst88@gmail.com> wrote: > >> From these instructions, users assume that /sys/class/drm/card0/error > >> contains all the information a developer needs to diagnose and fix a GPU > >> hang. > >> > >> In fact it doesn't, and we have no tools for solving them (other than > >> stabbing in the dark). Most of the time the error state itself isn't > >> even useful because it just shows a hang on a PIPE_CONTROL or similar. > >> > >> Until a time when the error state contains enough information to > >> actually solve a hang, stop telling users to file unsolvable bugs, and > >> instead rely on users who know where and how to file a good bug report > >> to find their own way there. > >> > >> Signed-off-by: Matt Turner <mattst88@gmail.com> > >> --- > >> Maybe now's a good time to discuss what *would* be useful to put in the > >> error state for debugging hangs. The currently executing shader program > >> would be a great place to start. > > > > I'm wondering why we're getting this patch now, and my guess is that > > it's because we have been reassigning the related bugs to Mesa more > > actively lately. Is that the case? > > No, it's simply because I spent a week going through Bugzilla and > realized how incomplete an unactionable the majority of GPU hang > reports are. > > Asking users to report bugs, but not telling them what actually > constitutes a bug report, is a recipe for a lot of wasted developer > time. > > I suspect we could improve the usefulness of the reports by directing > users to a webpage that gave a few suggestions (tell us what you were > doing when the hang occurred would be an obvious one) about filing a > bug and then provided a link to Bugzilla. Or even configured Bugzilla > to have a default template that requested various bits of information. I think dumping at least some of the aux buffers should make this tons more useful for mesa, since it would indicate stuff like "we always die on resolves on skl gt4" or stuff like that. Thus far error states have been mostly used by kernel folks to debug kernel issues, which is why none of that additional stuff gets dumped. But a bare-bones parser to hunt for indirect state base addresses and fish out the aux stuff shouldn't be that hard, and might make this fully useful. Like Chris said the goal is to at least be able to triage and classify bugs, and I'm perfectly fine with merging additional code to the dumper to get there for mesa folks. We z-compress the state, so size isn't really an issue. And Ben has commit rights, so shouldn't be a problem to get this all merged. > > IIUC the bug reports are useful for us when it's a kernel bug, but less > > useful for you when it's a Mesa bug. And you'd rather have fewer > > incoming bugs that you think are unsolvable with the information at > > hand. > > > > Sounds like a bug workflow issue between drm/i915 and Mesa to be ironed > > out. > > Indeed. I'd rather have the information provided in a bug report to > actually solve it. I hope having access to the shader program will > make many more reports useful. > > I am also happy to see that there's now a sunset to the GPU hang message. The other option is that mesa folks don't want error states that we triage to mesa. We could definitely update the process in that area. -Daniel
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 334f15d..8ddca7b 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1431,7 +1431,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv, u32 engine_mask, const char *error_msg) { - static bool warned; struct drm_i915_error_state *error; unsigned long flags; @@ -1475,16 +1474,6 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv, i915_error_state_free(&error->ref); return; } - - if (!warned) { - DRM_INFO("GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.\n"); - DRM_INFO("Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel\n"); - DRM_INFO("drm/i915 developers can then reassign to the right component if it's not a kernel issue.\n"); - DRM_INFO("The gpu crash dump is required to analyze gpu hangs, so please always attach it.\n"); - DRM_INFO("GPU crash dump saved to /sys/class/drm/card%d/error\n", - dev_priv->drm.primary->index); - warned = true; - } } void i915_error_state_get(struct drm_device *dev,
From these instructions, users assume that /sys/class/drm/card0/error contains all the information a developer needs to diagnose and fix a GPU hang. In fact it doesn't, and we have no tools for solving them (other than stabbing in the dark). Most of the time the error state itself isn't even useful because it just shows a hang on a PIPE_CONTROL or similar. Until a time when the error state contains enough information to actually solve a hang, stop telling users to file unsolvable bugs, and instead rely on users who know where and how to file a good bug report to find their own way there. Signed-off-by: Matt Turner <mattst88@gmail.com> --- Maybe now's a good time to discuss what *would* be useful to put in the error state for debugging hangs. The currently executing shader program would be a great place to start. drivers/gpu/drm/i915/i915_gpu_error.c | 11 ----------- 1 file changed, 11 deletions(-)