Message ID | 1385998313-5783-1-git-send-email-mika.kuoppala@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote: > We used to lean on active_list to handle the references > to batch objects. But there are useful cases when same, > albeit simple, batch can be executing on multiple rings > concurrently. For this case the active_list reference count > handling is just not enough as batch could be freed by > ring A request retirement as it is still running on ring B. > > Fix this by doing proper batch_obj reference counting. > > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > > Notes: > This is a patch which ameliorates the > [PATCH] tests/gem_reset_stats: add close-pending-fork > > Chris wasn't happy about the refcounting as it might hide > the true problem. But I haven't been able to find the real culprit, > thus the RFC. I think I understand the bug now, and your patch looks like the correct fix. But we need to pimp the commit message. In i915_gem_reset_ring_lists we reset requests and move objects to the inactive list. Which means if the active list is the last one to hold a reference, the object will disappear. Now the problem is that we do this per-ring, and not in the order that the objects would have been retired if the gpu wouldn't have hung. E.g. if a batch is active on both ring 1&2 but was last active on ring 1, then we'd free it before we go ahead with cleaning up the requests for ring 2. So reference-counting the batch_obj pointers looks like the real fix. Can you please amend your patch with this explanation and also please add a backtrace that shows the crash? Might need to gather it with netconsole. Thanks, Daniel > --- > drivers/gpu/drm/i915/i915_gem.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 40d9dcf..858538f 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -2145,13 +2145,12 @@ int __i915_add_request(struct intel_ring_buffer *ring, > request->head = request_start; > request->tail = request_ring_position; > > - /* Whilst this request exists, batch_obj will be on the > - * active_list, and so will hold the active reference. Only when this > - * request is retired will the the batch_obj be moved onto the > - * inactive_list and lose its active reference. Hence we do not need > - * to explicitly hold another reference here. > + /* Active list has one reference but that is not enough as same > + * batch_obj can be active on multiple rings > */ > request->batch_obj = obj; > + if (request->batch_obj) > + drm_gem_object_reference(&request->batch_obj->base); > > /* Hold a reference to the current context so that we can inspect > * it later in case a hangcheck error event fires. > @@ -2340,6 +2339,9 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request) > if (request->ctx) > i915_gem_context_unreference(request->ctx); > > + if (request->batch_obj) > + drm_gem_object_unreference(&request->batch_obj->base); > + > kfree(request); > } > > -- > 1.7.9.5 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote: > On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote: > > We used to lean on active_list to handle the references > > to batch objects. But there are useful cases when same, > > albeit simple, batch can be executing on multiple rings > > concurrently. For this case the active_list reference count > > handling is just not enough as batch could be freed by > > ring A request retirement as it is still running on ring B. > > > > Fix this by doing proper batch_obj reference counting. > > > > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > > > > Notes: > > This is a patch which ameliorates the > > [PATCH] tests/gem_reset_stats: add close-pending-fork > > > > Chris wasn't happy about the refcounting as it might hide > > the true problem. But I haven't been able to find the real culprit, > > thus the RFC. > > I think I understand the bug now, and your patch looks like the correct > fix. But we need to pimp the commit message. > > In i915_gem_reset_ring_lists we reset requests and move objects to the > inactive list. Which means if the active list is the last one to hold a > reference, the object will disappear. > > Now the problem is that we do this per-ring, and not in the order that the > objects would have been retired if the gpu wouldn't have hung. E.g. if a > batch is active on both ring 1&2 but was last active on ring 1, then we'd > free it before we go ahead with cleaning up the requests for ring 2. > > So reference-counting the batch_obj pointers looks like the real fix. No. The bug only exists in i915_gem_reset() and should not impact anywhere else. -Chris
On Wed, Dec 4, 2013 at 12:24 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote: >> On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote: >> > We used to lean on active_list to handle the references >> > to batch objects. But there are useful cases when same, >> > albeit simple, batch can be executing on multiple rings >> > concurrently. For this case the active_list reference count >> > handling is just not enough as batch could be freed by >> > ring A request retirement as it is still running on ring B. >> > >> > Fix this by doing proper batch_obj reference counting. >> > >> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> >> > >> > Notes: >> > This is a patch which ameliorates the >> > [PATCH] tests/gem_reset_stats: add close-pending-fork >> > >> > Chris wasn't happy about the refcounting as it might hide >> > the true problem. But I haven't been able to find the real culprit, >> > thus the RFC. >> >> I think I understand the bug now, and your patch looks like the correct >> fix. But we need to pimp the commit message. >> >> In i915_gem_reset_ring_lists we reset requests and move objects to the >> inactive list. Which means if the active list is the last one to hold a >> reference, the object will disappear. >> >> Now the problem is that we do this per-ring, and not in the order that the >> objects would have been retired if the gpu wouldn't have hung. E.g. if a >> batch is active on both ring 1&2 but was last active on ring 1, then we'd >> free it before we go ahead with cleaning up the requests for ring 2. >> >> So reference-counting the batch_obj pointers looks like the real fix. > > No. The bug only exists in i915_gem_reset() and should not impact > anywhere else. Well fixing the bug in i915_gem_reset would be lots more work and rather fragile - we'd need to retire requests in the correct order. That level of fanciness is generally not something I like to see in less-tested code like the reset paths. Also simply holding a reference for each pointer is established best practice - atm we have a big comment explaining why it should work by holding an implicit reference through the active list. With the refcounting we'll simply trade that complexity (and the code comment explaining things) in with another tricky case. -Daniel
On Wed, Dec 04, 2013 at 01:08:56PM +0100, Daniel Vetter wrote: > On Wed, Dec 4, 2013 at 12:24 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote: > >> On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote: > >> > We used to lean on active_list to handle the references > >> > to batch objects. But there are useful cases when same, > >> > albeit simple, batch can be executing on multiple rings > >> > concurrently. For this case the active_list reference count > >> > handling is just not enough as batch could be freed by > >> > ring A request retirement as it is still running on ring B. > >> > > >> > Fix this by doing proper batch_obj reference counting. > >> > > >> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > >> > > >> > Notes: > >> > This is a patch which ameliorates the > >> > [PATCH] tests/gem_reset_stats: add close-pending-fork > >> > > >> > Chris wasn't happy about the refcounting as it might hide > >> > the true problem. But I haven't been able to find the real culprit, > >> > thus the RFC. > >> > >> I think I understand the bug now, and your patch looks like the correct > >> fix. But we need to pimp the commit message. > >> > >> In i915_gem_reset_ring_lists we reset requests and move objects to the > >> inactive list. Which means if the active list is the last one to hold a > >> reference, the object will disappear. > >> > >> Now the problem is that we do this per-ring, and not in the order that the > >> objects would have been retired if the gpu wouldn't have hung. E.g. if a > >> batch is active on both ring 1&2 but was last active on ring 1, then we'd > >> free it before we go ahead with cleaning up the requests for ring 2. > >> > >> So reference-counting the batch_obj pointers looks like the real fix. > > > > No. The bug only exists in i915_gem_reset() and should not impact > > anywhere else. > > Well fixing the bug in i915_gem_reset would be lots more work and > rather fragile - we'd need to retire requests in the correct order. > That level of fanciness is generally not something I like to see in > less-tested code like the reset paths. Nope, just the operations in the right order. -Chris
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 40d9dcf..858538f 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2145,13 +2145,12 @@ int __i915_add_request(struct intel_ring_buffer *ring, request->head = request_start; request->tail = request_ring_position; - /* Whilst this request exists, batch_obj will be on the - * active_list, and so will hold the active reference. Only when this - * request is retired will the the batch_obj be moved onto the - * inactive_list and lose its active reference. Hence we do not need - * to explicitly hold another reference here. + /* Active list has one reference but that is not enough as same + * batch_obj can be active on multiple rings */ request->batch_obj = obj; + if (request->batch_obj) + drm_gem_object_reference(&request->batch_obj->base); /* Hold a reference to the current context so that we can inspect * it later in case a hangcheck error event fires. @@ -2340,6 +2339,9 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request) if (request->ctx) i915_gem_context_unreference(request->ctx); + if (request->batch_obj) + drm_gem_object_unreference(&request->batch_obj->base); + kfree(request); }
We used to lean on active_list to handle the references to batch objects. But there are useful cases when same, albeit simple, batch can be executing on multiple rings concurrently. For this case the active_list reference count handling is just not enough as batch could be freed by ring A request retirement as it is still running on ring B. Fix this by doing proper batch_obj reference counting. Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> Notes: This is a patch which ameliorates the [PATCH] tests/gem_reset_stats: add close-pending-fork Chris wasn't happy about the refcounting as it might hide the true problem. But I haven't been able to find the real culprit, thus the RFC. --- drivers/gpu/drm/i915/i915_gem.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)