diff mbox

[RFC] drm/i915: reference count batch object on requests

Message ID 1385998313-5783-1-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mika Kuoppala Dec. 2, 2013, 3:31 p.m. UTC
We used to lean on active_list to handle the references
to batch objects. But there are useful cases when same,
albeit simple, batch can be executing on multiple rings
concurrently. For this case the active_list reference count
handling is just not enough as batch could be freed by
ring A request retirement as it is still running on ring B.

Fix this by doing proper batch_obj reference counting.

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>

Notes:
    This is a patch which ameliorates the
    [PATCH] tests/gem_reset_stats: add close-pending-fork
    
    Chris wasn't happy about the refcounting as it might hide
    the true problem. But I haven't been able to find the real culprit,
    thus the RFC.
---
 drivers/gpu/drm/i915/i915_gem.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Comments

Daniel Vetter Dec. 3, 2013, 5:10 p.m. UTC | #1
On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote:
> We used to lean on active_list to handle the references
> to batch objects. But there are useful cases when same,
> albeit simple, batch can be executing on multiple rings
> concurrently. For this case the active_list reference count
> handling is just not enough as batch could be freed by
> ring A request retirement as it is still running on ring B.
> 
> Fix this by doing proper batch_obj reference counting.
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> 
> Notes:
>     This is a patch which ameliorates the
>     [PATCH] tests/gem_reset_stats: add close-pending-fork
>     
>     Chris wasn't happy about the refcounting as it might hide
>     the true problem. But I haven't been able to find the real culprit,
>     thus the RFC.

I think I understand the bug now, and your patch looks like the correct
fix. But we need to pimp the commit message.

In i915_gem_reset_ring_lists we reset requests and move objects to the
inactive list. Which means if the active list is the last one to hold a
reference, the object will disappear.

Now the problem is that we do this per-ring, and not in the order that the
objects would have been retired if the gpu wouldn't have hung. E.g. if a
batch is active on both ring 1&2 but was last active on ring 1, then we'd
free it before we go ahead with cleaning up the requests for ring 2.

So reference-counting the batch_obj pointers looks like the real fix.

Can you please amend your patch with this explanation and also please add
a backtrace that shows the crash? Might need to gather it with netconsole.

Thanks, Daniel

> ---
>  drivers/gpu/drm/i915/i915_gem.c |   12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 40d9dcf..858538f 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2145,13 +2145,12 @@ int __i915_add_request(struct intel_ring_buffer *ring,
>  	request->head = request_start;
>  	request->tail = request_ring_position;
>  
> -	/* Whilst this request exists, batch_obj will be on the
> -	 * active_list, and so will hold the active reference. Only when this
> -	 * request is retired will the the batch_obj be moved onto the
> -	 * inactive_list and lose its active reference. Hence we do not need
> -	 * to explicitly hold another reference here.
> +	/* Active list has one reference but that is not enough as same
> +	 * batch_obj can be active on multiple rings
>  	 */
>  	request->batch_obj = obj;
> +	if (request->batch_obj)
> +		drm_gem_object_reference(&request->batch_obj->base);
>  
>  	/* Hold a reference to the current context so that we can inspect
>  	 * it later in case a hangcheck error event fires.
> @@ -2340,6 +2339,9 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
>  	if (request->ctx)
>  		i915_gem_context_unreference(request->ctx);
>  
> +	if (request->batch_obj)
> +		drm_gem_object_unreference(&request->batch_obj->base);
> +
>  	kfree(request);
>  }
>  
> -- 
> 1.7.9.5
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Dec. 4, 2013, 11:24 a.m. UTC | #2
On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote:
> On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote:
> > We used to lean on active_list to handle the references
> > to batch objects. But there are useful cases when same,
> > albeit simple, batch can be executing on multiple rings
> > concurrently. For this case the active_list reference count
> > handling is just not enough as batch could be freed by
> > ring A request retirement as it is still running on ring B.
> > 
> > Fix this by doing proper batch_obj reference counting.
> > 
> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> > 
> > Notes:
> >     This is a patch which ameliorates the
> >     [PATCH] tests/gem_reset_stats: add close-pending-fork
> >     
> >     Chris wasn't happy about the refcounting as it might hide
> >     the true problem. But I haven't been able to find the real culprit,
> >     thus the RFC.
> 
> I think I understand the bug now, and your patch looks like the correct
> fix. But we need to pimp the commit message.
> 
> In i915_gem_reset_ring_lists we reset requests and move objects to the
> inactive list. Which means if the active list is the last one to hold a
> reference, the object will disappear.
> 
> Now the problem is that we do this per-ring, and not in the order that the
> objects would have been retired if the gpu wouldn't have hung. E.g. if a
> batch is active on both ring 1&2 but was last active on ring 1, then we'd
> free it before we go ahead with cleaning up the requests for ring 2.
> 
> So reference-counting the batch_obj pointers looks like the real fix.

No. The bug only exists in i915_gem_reset() and should not impact
anywhere else.
-Chris
Daniel Vetter Dec. 4, 2013, 12:08 p.m. UTC | #3
On Wed, Dec 4, 2013 at 12:24 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote:
>> On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote:
>> > We used to lean on active_list to handle the references
>> > to batch objects. But there are useful cases when same,
>> > albeit simple, batch can be executing on multiple rings
>> > concurrently. For this case the active_list reference count
>> > handling is just not enough as batch could be freed by
>> > ring A request retirement as it is still running on ring B.
>> >
>> > Fix this by doing proper batch_obj reference counting.
>> >
>> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
>> >
>> > Notes:
>> >     This is a patch which ameliorates the
>> >     [PATCH] tests/gem_reset_stats: add close-pending-fork
>> >
>> >     Chris wasn't happy about the refcounting as it might hide
>> >     the true problem. But I haven't been able to find the real culprit,
>> >     thus the RFC.
>>
>> I think I understand the bug now, and your patch looks like the correct
>> fix. But we need to pimp the commit message.
>>
>> In i915_gem_reset_ring_lists we reset requests and move objects to the
>> inactive list. Which means if the active list is the last one to hold a
>> reference, the object will disappear.
>>
>> Now the problem is that we do this per-ring, and not in the order that the
>> objects would have been retired if the gpu wouldn't have hung. E.g. if a
>> batch is active on both ring 1&2 but was last active on ring 1, then we'd
>> free it before we go ahead with cleaning up the requests for ring 2.
>>
>> So reference-counting the batch_obj pointers looks like the real fix.
>
> No. The bug only exists in i915_gem_reset() and should not impact
> anywhere else.

Well fixing the bug in i915_gem_reset would be lots more work and
rather fragile - we'd need to retire requests in the correct order.
That level of fanciness is generally not something I like to see in
less-tested code like the reset paths.

Also simply holding a reference for each pointer is established best
practice - atm we have a big comment explaining why it should work by
holding an implicit reference through the active list. With the
refcounting we'll simply trade that complexity (and the code comment
explaining things) in with another tricky case.
-Daniel
Chris Wilson Dec. 4, 2013, 12:11 p.m. UTC | #4
On Wed, Dec 04, 2013 at 01:08:56PM +0100, Daniel Vetter wrote:
> On Wed, Dec 4, 2013 at 12:24 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > On Tue, Dec 03, 2013 at 06:10:05PM +0100, Daniel Vetter wrote:
> >> On Mon, Dec 02, 2013 at 05:31:53PM +0200, Mika Kuoppala wrote:
> >> > We used to lean on active_list to handle the references
> >> > to batch objects. But there are useful cases when same,
> >> > albeit simple, batch can be executing on multiple rings
> >> > concurrently. For this case the active_list reference count
> >> > handling is just not enough as batch could be freed by
> >> > ring A request retirement as it is still running on ring B.
> >> >
> >> > Fix this by doing proper batch_obj reference counting.
> >> >
> >> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> >> >
> >> > Notes:
> >> >     This is a patch which ameliorates the
> >> >     [PATCH] tests/gem_reset_stats: add close-pending-fork
> >> >
> >> >     Chris wasn't happy about the refcounting as it might hide
> >> >     the true problem. But I haven't been able to find the real culprit,
> >> >     thus the RFC.
> >>
> >> I think I understand the bug now, and your patch looks like the correct
> >> fix. But we need to pimp the commit message.
> >>
> >> In i915_gem_reset_ring_lists we reset requests and move objects to the
> >> inactive list. Which means if the active list is the last one to hold a
> >> reference, the object will disappear.
> >>
> >> Now the problem is that we do this per-ring, and not in the order that the
> >> objects would have been retired if the gpu wouldn't have hung. E.g. if a
> >> batch is active on both ring 1&2 but was last active on ring 1, then we'd
> >> free it before we go ahead with cleaning up the requests for ring 2.
> >>
> >> So reference-counting the batch_obj pointers looks like the real fix.
> >
> > No. The bug only exists in i915_gem_reset() and should not impact
> > anywhere else.
> 
> Well fixing the bug in i915_gem_reset would be lots more work and
> rather fragile - we'd need to retire requests in the correct order.
> That level of fanciness is generally not something I like to see in
> less-tested code like the reset paths.

Nope, just the operations in the right order.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 40d9dcf..858538f 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2145,13 +2145,12 @@  int __i915_add_request(struct intel_ring_buffer *ring,
 	request->head = request_start;
 	request->tail = request_ring_position;
 
-	/* Whilst this request exists, batch_obj will be on the
-	 * active_list, and so will hold the active reference. Only when this
-	 * request is retired will the the batch_obj be moved onto the
-	 * inactive_list and lose its active reference. Hence we do not need
-	 * to explicitly hold another reference here.
+	/* Active list has one reference but that is not enough as same
+	 * batch_obj can be active on multiple rings
 	 */
 	request->batch_obj = obj;
+	if (request->batch_obj)
+		drm_gem_object_reference(&request->batch_obj->base);
 
 	/* Hold a reference to the current context so that we can inspect
 	 * it later in case a hangcheck error event fires.
@@ -2340,6 +2339,9 @@  static void i915_gem_free_request(struct drm_i915_gem_request *request)
 	if (request->ctx)
 		i915_gem_context_unreference(request->ctx);
 
+	if (request->batch_obj)
+		drm_gem_object_unreference(&request->batch_obj->base);
+
 	kfree(request);
 }