Message ID | 20210519101523.688398-1-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Revert "drm/i915: Propagate errors on awaiting already signaled fences" | expand |
Once we no longer rely on error propagation, I think there's a lot we can rip out. --Jason On Wed, May 19, 2021 at 5:15 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > From: Jason Ekstrand <jason@jlekstrand.net> > > This reverts commit 9e31c1fe45d555a948ff66f1f0e3fe1f83ca63f7. Ever > since that commit, we've been having issues where a hang in one client > can propagate to another. In particular, a hang in an app can propagate > to the X server which causes the whole desktop to lock up. > > Error propagation along fences sound like a good idea, but as your bug > shows, surprising consequences, since propagating errors across security > boundaries is not a good thing. > > What we do have is track the hangs on the ctx, and report information to > userspace using RESET_STATS. That's how arb_robustness works. Also, if my > understanding is still correct, the EIO from execbuf is when your context > is banned (because not recoverable or too many hangs). And in all these > cases it's up to userspace to figure out what is all impacted and should > be reported to the application, that's not on the kernel to guess and > automatically propagate. > > What's more, we're also building more features on top of ctx error > reporting with RESET_STATS ioctl: Encrypted buffers use the same, and the > userspace fence wait also relies on that mechanism. So it is the path > going forward for reporting gpu hangs and resets to userspace. > > So all together that's why I think we should just bury this idea again as > not quite the direction we want to go to, hence why I think the revert is > the right option here.Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com> > > v2: Augment commit message. Also restore Jason's sob that I > accidentally lost. > > Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com> (v1) > Reported-by: Marcin Slusarz <marcin.slusarz@intel.com> > Cc: <stable@vger.kernel.org> # v5.6+ > Cc: Jason Ekstrand <jason.ekstrand@intel.com> > Cc: Marcin Slusarz <marcin.slusarz@intel.com> > Cc: Jon Bloomfield <jon.bloomfield@intel.com> > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3080 > Fixes: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already signaled fences") > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> > --- > drivers/gpu/drm/i915/i915_request.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > index 970d8f4986bb..b796197c0772 100644 > --- a/drivers/gpu/drm/i915/i915_request.c > +++ b/drivers/gpu/drm/i915/i915_request.c > @@ -1426,10 +1426,8 @@ i915_request_await_execution(struct i915_request *rq, > > do { > fence = *child++; > - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { > - i915_sw_fence_set_error_once(&rq->submit, fence->error); > + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) > continue; > - } > > if (fence->context == rq->fence.context) > continue; > @@ -1527,10 +1525,8 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence) > > do { > fence = *child++; > - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { > - i915_sw_fence_set_error_once(&rq->submit, fence->error); > + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) > continue; > - } > > /* > * Requests on the same timeline are explicitly ordered, along > -- > 2.31.0 >
On Wed, May 19, 2021 at 5:06 PM Jason Ekstrand <jason@jlekstrand.net> wrote: > > Once we no longer rely on error propagation, I think there's a lot we > can rip out. I honestly did not find that much ... what did you uncover? -Daniel > > --Jason > > On Wed, May 19, 2021 at 5:15 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > > > > From: Jason Ekstrand <jason@jlekstrand.net> > > > > This reverts commit 9e31c1fe45d555a948ff66f1f0e3fe1f83ca63f7. Ever > > since that commit, we've been having issues where a hang in one client > > can propagate to another. In particular, a hang in an app can propagate > > to the X server which causes the whole desktop to lock up. > > > > Error propagation along fences sound like a good idea, but as your bug > > shows, surprising consequences, since propagating errors across security > > boundaries is not a good thing. > > > > What we do have is track the hangs on the ctx, and report information to > > userspace using RESET_STATS. That's how arb_robustness works. Also, if my > > understanding is still correct, the EIO from execbuf is when your context > > is banned (because not recoverable or too many hangs). And in all these > > cases it's up to userspace to figure out what is all impacted and should > > be reported to the application, that's not on the kernel to guess and > > automatically propagate. > > > > What's more, we're also building more features on top of ctx error > > reporting with RESET_STATS ioctl: Encrypted buffers use the same, and the > > userspace fence wait also relies on that mechanism. So it is the path > > going forward for reporting gpu hangs and resets to userspace. > > > > So all together that's why I think we should just bury this idea again as > > not quite the direction we want to go to, hence why I think the revert is > > the right option here.Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com> > > > > v2: Augment commit message. Also restore Jason's sob that I > > accidentally lost. > > > > Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com> (v1) > > Reported-by: Marcin Slusarz <marcin.slusarz@intel.com> > > Cc: <stable@vger.kernel.org> # v5.6+ > > Cc: Jason Ekstrand <jason.ekstrand@intel.com> > > Cc: Marcin Slusarz <marcin.slusarz@intel.com> > > Cc: Jon Bloomfield <jon.bloomfield@intel.com> > > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3080 > > Fixes: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already signaled fences") > > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> > > --- > > drivers/gpu/drm/i915/i915_request.c | 8 ++------ > > 1 file changed, 2 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > > index 970d8f4986bb..b796197c0772 100644 > > --- a/drivers/gpu/drm/i915/i915_request.c > > +++ b/drivers/gpu/drm/i915/i915_request.c > > @@ -1426,10 +1426,8 @@ i915_request_await_execution(struct i915_request *rq, > > > > do { > > fence = *child++; > > - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { > > - i915_sw_fence_set_error_once(&rq->submit, fence->error); > > + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) > > continue; > > - } > > > > if (fence->context == rq->fence.context) > > continue; > > @@ -1527,10 +1525,8 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence) > > > > do { > > fence = *child++; > > - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { > > - i915_sw_fence_set_error_once(&rq->submit, fence->error); > > + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) > > continue; > > - } > > > > /* > > * Requests on the same timeline are explicitly ordered, along > > -- > > 2.31.0 > >
On May 19, 2021 12:16:15 Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > On Wed, May 19, 2021 at 5:06 PM Jason Ekstrand <jason@jlekstrand.net> wrote: >> >> Once we no longer rely on error propagation, I think there's a lot we >> can rip out. > > I honestly did not find that much ... what did you uncover? When I was digging through this earlier today, I think I convinced myself that the cmdparser is the only user of the entire fence error architecture. I may have missed something, though. --Jason > > -Daniel > >> >> --Jason >> >> On Wed, May 19, 2021 at 5:15 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote: >>> >>> From: Jason Ekstrand <jason@jlekstrand.net> >>> >>> This reverts commit 9e31c1fe45d555a948ff66f1f0e3fe1f83ca63f7. Ever >>> since that commit, we've been having issues where a hang in one client >>> can propagate to another. In particular, a hang in an app can propagate >>> to the X server which causes the whole desktop to lock up. >>> >>> Error propagation along fences sound like a good idea, but as your bug >>> shows, surprising consequences, since propagating errors across security >>> boundaries is not a good thing. >>> >>> What we do have is track the hangs on the ctx, and report information to >>> userspace using RESET_STATS. That's how arb_robustness works. Also, if my >>> understanding is still correct, the EIO from execbuf is when your context >>> is banned (because not recoverable or too many hangs). And in all these >>> cases it's up to userspace to figure out what is all impacted and should >>> be reported to the application, that's not on the kernel to guess and >>> automatically propagate. >>> >>> What's more, we're also building more features on top of ctx error >>> reporting with RESET_STATS ioctl: Encrypted buffers use the same, and the >>> userspace fence wait also relies on that mechanism. So it is the path >>> going forward for reporting gpu hangs and resets to userspace. >>> >>> So all together that's why I think we should just bury this idea again as >>> not quite the direction we want to go to, hence why I think the revert is >>> the right option here.Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com> >>> >>> v2: Augment commit message. Also restore Jason's sob that I >>> accidentally lost. >>> >>> Signed-off-by: Jason Ekstrand <jason.ekstrand@intel.com> (v1) >>> Reported-by: Marcin Slusarz <marcin.slusarz@intel.com> >>> Cc: <stable@vger.kernel.org> # v5.6+ >>> Cc: Jason Ekstrand <jason.ekstrand@intel.com> >>> Cc: Marcin Slusarz <marcin.slusarz@intel.com> >>> Cc: Jon Bloomfield <jon.bloomfield@intel.com> >>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3080 >>> Fixes: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already >>> signaled fences") >>> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> >>> --- >>> drivers/gpu/drm/i915/i915_request.c | 8 ++------ >>> 1 file changed, 2 insertions(+), 6 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_request.c >>> b/drivers/gpu/drm/i915/i915_request.c >>> index 970d8f4986bb..b796197c0772 100644 >>> --- a/drivers/gpu/drm/i915/i915_request.c >>> +++ b/drivers/gpu/drm/i915/i915_request.c >>> @@ -1426,10 +1426,8 @@ i915_request_await_execution(struct i915_request *rq, >>> >>> do { >>> fence = *child++; >>> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { >>> - i915_sw_fence_set_error_once(&rq->submit, >>> fence->error); >>> + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) >>> continue; >>> - } >>> >>> if (fence->context == rq->fence.context) >>> continue; >>> @@ -1527,10 +1525,8 @@ i915_request_await_dma_fence(struct i915_request >>> *rq, struct dma_fence *fence) >>> >>> do { >>> fence = *child++; >>> - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { >>> - i915_sw_fence_set_error_once(&rq->submit, >>> fence->error); >>> + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) >>> continue; >>> - } >>> >>> /* >>> * Requests on the same timeline are explicitly ordered, along >>> -- >>> 2.31.0 > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 970d8f4986bb..b796197c0772 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1426,10 +1426,8 @@ i915_request_await_execution(struct i915_request *rq, do { fence = *child++; - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { - i915_sw_fence_set_error_once(&rq->submit, fence->error); + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) continue; - } if (fence->context == rq->fence.context) continue; @@ -1527,10 +1525,8 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence) do { fence = *child++; - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { - i915_sw_fence_set_error_once(&rq->submit, fence->error); + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) continue; - } /* * Requests on the same timeline are explicitly ordered, along