diff mbox

Issue #5876 : assertion failure in rbd_img_obj_callback()

Message ID 53321896.1080606@ieee.org (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Elder March 26, 2014, midnight UTC
On 03/25/2014 06:04 PM, Olivier Bonvalet wrote:
> Le mardi 25 mars 2014 à 17:46 -0500, Alex Elder a écrit :
>> On 03/25/2014 05:17 PM, Olivier Bonvalet wrote:
>>>
>>> I now have this one very often (here 5 minutes after the host boot) :
>>
>> I am fairly sure this indicates a use-after-free scenario,
>> likely caused by something getting deleted before every
>> user was done with it.
>>
>> I believe Ilya is done for the night; I'm going to spend some
>> time looking at this to try to determine the cause.  If you
>> are willing I'd love to have you try whatever fix I come up
>> with.  I'd rather find a fix than just collect more information,
>> but I may need to get more, we'll see.
>>
>> Thank you for all your reports, they help a lot.
>>
>> 					-Alex
> 
> Ok. I can apply some patch to help debug that yes.
> I will try to reproduce on a different host, without customer data.
> 
> But I think I will stop here for the night too.
> 
> Thanks for your time,
> Olivier

Here's something that will provide a few more pieces of
information.  If you're around and you're able to try it
out it might confirm something had likely been destroyed.

I'll keep sending stuff as I come up with it (even though
I realize you may not be around until morning).

					-Alex



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Olivier Bonvalet March 26, 2014, 1:33 a.m. UTC | #1
Le mardi 25 mars 2014 à 19:00 -0500, Alex Elder a écrit :
> On 03/25/2014 06:04 PM, Olivier Bonvalet wrote:
> > Le mardi 25 mars 2014 à 17:46 -0500, Alex Elder a écrit :
> >> On 03/25/2014 05:17 PM, Olivier Bonvalet wrote:
> >>>
> >>> I now have this one very often (here 5 minutes after the host boot) :
> >>
> >> I am fairly sure this indicates a use-after-free scenario,
> >> likely caused by something getting deleted before every
> >> user was done with it.
> >>
> >> I believe Ilya is done for the night; I'm going to spend some
> >> time looking at this to try to determine the cause.  If you
> >> are willing I'd love to have you try whatever fix I come up
> >> with.  I'd rather find a fix than just collect more information,
> >> but I may need to get more, we'll see.
> >>
> >> Thank you for all your reports, they help a lot.
> >>
> >> 					-Alex
> > 
> > Ok. I can apply some patch to help debug that yes.
> > I will try to reproduce on a different host, without customer data.
> > 
> > But I think I will stop here for the night too.
> > 
> > Thanks for your time,
> > Olivier
> 
> Here's something that will provide a few more pieces of
> information.  If you're around and you're able to try it
> out it might confirm something had likely been destroyed.
> 
> I'll keep sending stuff as I come up with it (even though
> I realize you may not be around until morning).
> 
> 					-Alex
> 
> 
> Index: b/drivers/block/rbd.c
> ===================================================================
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2132,6 +2132,35 @@ static void rbd_img_obj_callback(struct
>  	spin_lock_irq(&img_request->completion_lock);
>  	if (which > img_request->next_completion)
>  		goto out;
> +	if (which != img_request->next_completion) {
> +		printk("%s: bad image object request information:\n", __func__);
> +		printk("obj_request %p\n", obj_request);
> +		printk("    ->object_name <%s>\n", obj_request->object_name);
> +		printk("    ->offset %llu\n", obj_request->offset);
> +		printk("    ->length %llu\n", obj_request->length);
> +		printk("    ->type 0x%x\n", (u32)obj_request->type);
> +		printk("    ->flags 0x%lx\n", obj_request->flags);
> +		printk("    ->img_request %p\n", obj_request->img_request);
> +		printk("    ->which %u\n", obj_request->which);
> +		printk("    ->xferred %llu\n", obj_request->xferred);
> +		printk("    ->result %d\n", obj_request->result);
> +		printk("    ->kref %d\n", atomic_read(&obj_request->kref));
> +
> +		printk("img_request %p\n", img_request);
> +		printk("    ->snap 0x%016llx\n", img_request->snap_id);
> +		printk("    ->offset %llu\n", img_request->offset);
> +		printk("    ->length %llu\n", img_request->length);
> +		printk("    ->flags 0x%lx\n", img_request->flags);
> +		printk("    ->obj_request_count %u\n",
> +			img_request->obj_request_count);
> +		printk("    ->next_completion %u\n",
> +			img_request->next_completion);
> +		printk("    ->xferred %llu\n", img_request->xferred);
> +		printk("    ->result %d\n", img_request->result);
> +		printk("    ->obj_requests head %p\n",
> +			img_request->obj_requests.next);
> +		printk("    ->kref %d\n", atomic_read(&img_request->kref));
> +	}
>  	rbd_assert(which == img_request->next_completion);
> 
>  	for_each_obj_request_from(img_request, obj_request) {
> 
> 

Thanks for your patch.

This is an output of a crash case :

Mar 26 02:31:18 alg kernel: [  965.366895] rbd_img_obj_callback: bad image object request information:
Mar 26 02:31:18 alg kernel: [  965.366905] obj_request ffff880224bc9528
Mar 26 02:31:18 alg kernel: [  965.366909]     ->object_name <(null)>
Mar 26 02:31:18 alg kernel: [  965.366913]     ->offset 0
Mar 26 02:31:18 alg kernel: [  965.366917]     ->length 4096
Mar 26 02:31:18 alg kernel: [  965.366921]     ->type 0x1
Mar 26 02:31:18 alg kernel: [  965.366925]     ->flags 0x3
Mar 26 02:31:18 alg kernel: [  965.366929]     ->img_request           (null)
Mar 26 02:31:18 alg kernel: [  965.366933]     ->which 4294967295
Mar 26 02:31:18 alg kernel: [  965.366936]     ->xferred 4096
Mar 26 02:31:18 alg kernel: [  965.366940]     ->result 0
Mar 26 02:31:18 alg kernel: [  965.366943]     ->kref 0
Mar 26 02:31:18 alg kernel: [  965.366947] img_request ffff880222f4fb50
Mar 26 02:31:18 alg kernel: [  965.366950]     ->snap 0xfffffffffffffffe
Mar 26 02:31:18 alg kernel: [  965.366954]     ->offset 1417662464
Mar 26 02:31:18 alg kernel: [  965.366957]     ->length 16384
Mar 26 02:31:18 alg kernel: [  965.366960]     ->flags 0x0
Mar 26 02:31:18 alg kernel: [  965.366963]     ->obj_request_count 0
Mar 26 02:31:18 alg kernel: [  965.366966]     ->next_completion 2
Mar 26 02:31:18 alg kernel: [  965.366969]     ->xferred 16384
Mar 26 02:31:18 alg kernel: [  965.366973]     ->result 0
Mar 26 02:31:18 alg kernel: [  965.366976]     ->obj_requests head ffff880222f4fbb0
Mar 26 02:31:18 alg kernel: [  965.366980]     ->kref 0
Mar 26 02:31:18 alg kernel: [  965.366985] 
Mar 26 02:31:18 alg kernel: [  965.366985] Assertion failure in rbd_img_obj_callback() at line 2165:
Mar 26 02:31:18 alg kernel: [  965.366985] 
Mar 26 02:31:18 alg kernel: [  965.366985] 	rbd_assert(which == img_request->next_completion);
Mar 26 02:31:18 alg kernel: [  965.366985] 
Mar 26 02:31:18 alg kernel: [  965.367185] ------------[ cut here ]------------
Mar 26 02:31:18 alg kernel: [  965.367241] kernel BUG at drivers/block/rbd.c:2165!


I hope it can help.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olivier Bonvalet March 26, 2014, 1:50 a.m. UTC | #2
Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit :
> Thanks for your patch.
> 
> This is an output of a crash case :
> 
> Mar 26 02:31:18 alg kernel: [  965.366895] rbd_img_obj_callback: bad image object request information:
> Mar 26 02:31:18 alg kernel: [  965.366905] obj_request ffff880224bc9528
> Mar 26 02:31:18 alg kernel: [  965.366909]     ->object_name <(null)>
> Mar 26 02:31:18 alg kernel: [  965.366913]     ->offset 0
> Mar 26 02:31:18 alg kernel: [  965.366917]     ->length 4096
> Mar 26 02:31:18 alg kernel: [  965.366921]     ->type 0x1
> Mar 26 02:31:18 alg kernel: [  965.366925]     ->flags 0x3
> Mar 26 02:31:18 alg kernel: [  965.366929]     ->img_request           (null)
> Mar 26 02:31:18 alg kernel: [  965.366933]     ->which 4294967295
> Mar 26 02:31:18 alg kernel: [  965.366936]     ->xferred 4096
> Mar 26 02:31:18 alg kernel: [  965.366940]     ->result 0
> Mar 26 02:31:18 alg kernel: [  965.366943]     ->kref 0
> Mar 26 02:31:18 alg kernel: [  965.366947] img_request ffff880222f4fb50
> Mar 26 02:31:18 alg kernel: [  965.366950]     ->snap 0xfffffffffffffffe
> Mar 26 02:31:18 alg kernel: [  965.366954]     ->offset 1417662464
> Mar 26 02:31:18 alg kernel: [  965.366957]     ->length 16384
> Mar 26 02:31:18 alg kernel: [  965.366960]     ->flags 0x0
> Mar 26 02:31:18 alg kernel: [  965.366963]     ->obj_request_count 0
> Mar 26 02:31:18 alg kernel: [  965.366966]     ->next_completion 2
> Mar 26 02:31:18 alg kernel: [  965.366969]     ->xferred 16384
> Mar 26 02:31:18 alg kernel: [  965.366973]     ->result 0
> Mar 26 02:31:18 alg kernel: [  965.366976]     ->obj_requests head ffff880222f4fbb0
> Mar 26 02:31:18 alg kernel: [  965.366980]     ->kref 0
> Mar 26 02:31:18 alg kernel: [  965.366985] 
> Mar 26 02:31:18 alg kernel: [  965.366985] Assertion failure in rbd_img_obj_callback() at line 2165:
> Mar 26 02:31:18 alg kernel: [  965.366985] 
> Mar 26 02:31:18 alg kernel: [  965.366985] 	rbd_assert(which == img_request->next_completion);
> Mar 26 02:31:18 alg kernel: [  965.366985] 
> Mar 26 02:31:18 alg kernel: [  965.367185] ------------[ cut here ]------------
> Mar 26 02:31:18 alg kernel: [  965.367241] kernel BUG at drivers/block/rbd.c:2165!
> 
> 
> I hope it can help.
> 
> 
> --

and a second one, very similar :

Mar 26 02:48:27 alg kernel: [  681.167833] rbd_img_obj_callback: bad image object request information:
Mar 26 02:48:27 alg kernel: [  681.167836] obj_request ffff88022e1e2828
Mar 26 02:48:27 alg kernel: [  681.167837]     ->object_name <(null)>
Mar 26 02:48:27 alg kernel: [  681.167838]     ->offset 0
Mar 26 02:48:27 alg kernel: [  681.167839]     ->length 4096
Mar 26 02:48:27 alg kernel: [  681.167840]     ->type 0x1
Mar 26 02:48:27 alg kernel: [  681.167840]     ->flags 0x3
Mar 26 02:48:27 alg kernel: [  681.167841]     ->img_request           (null)
Mar 26 02:48:27 alg kernel: [  681.167842]     ->which 4294967295
Mar 26 02:48:27 alg kernel: [  681.167843]     ->xferred 4096
Mar 26 02:48:27 alg kernel: [  681.167844]     ->result 0
Mar 26 02:48:27 alg kernel: [  681.167844]     ->kref 0
Mar 26 02:48:27 alg kernel: [  681.167845] img_request ffff88021f555f10
Mar 26 02:48:27 alg kernel: [  681.167846]     ->snap 0xfffffffffffffffe
Mar 26 02:48:27 alg kernel: [  681.167847]     ->offset 28072464384
Mar 26 02:48:27 alg kernel: [  681.167847]     ->length 16384
Mar 26 02:48:27 alg kernel: [  681.167848]     ->flags 0x0
Mar 26 02:48:27 alg kernel: [  681.167849]     ->obj_request_count 0
Mar 26 02:48:27 alg kernel: [  681.167850]     ->next_completion 2
Mar 26 02:48:27 alg kernel: [  681.167850]     ->xferred 16384
Mar 26 02:48:27 alg kernel: [  681.167851]     ->result 0
Mar 26 02:48:27 alg kernel: [  681.167852]     ->obj_requests head ffff88021f555f70
Mar 26 02:48:27 alg kernel: [  681.167853]     ->kref 0
Mar 26 02:48:27 alg kernel: [  681.167854] 
Mar 26 02:48:27 alg kernel: [  681.167854] Assertion failure in rbd_img_obj_callback() at line 2165:
Mar 26 02:48:27 alg kernel: [  681.167854] 
Mar 26 02:48:27 alg kernel: [  681.167854] 	rbd_assert(which == img_request->next_completion);
Mar 26 02:48:27 alg kernel: [  681.167854] 
Mar 26 02:48:27 alg kernel: [  681.168117] ------------[ cut here ]------------
Mar 26 02:48:27 alg kernel: [  681.168211] kernel BUG at drivers/block/rbd.c:2165!


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elder March 26, 2014, 1:55 a.m. UTC | #3
On 03/25/2014 08:50 PM, Olivier Bonvalet wrote:
> Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit :
>> Thanks for your patch.
>>
>> This is an output of a crash case :
>>
>> Mar 26 02:31:18 alg kernel: [  965.366895] rbd_img_obj_callback: bad image object request information:
>> Mar 26 02:31:18 alg kernel: [  965.366905] obj_request ffff880224bc9528
>> Mar 26 02:31:18 alg kernel: [  965.366909]     ->object_name <(null)>
>> Mar 26 02:31:18 alg kernel: [  965.366913]     ->offset 0
>> Mar 26 02:31:18 alg kernel: [  965.366917]     ->length 4096
>> Mar 26 02:31:18 alg kernel: [  965.366921]     ->type 0x1
>> Mar 26 02:31:18 alg kernel: [  965.366925]     ->flags 0x3
>> Mar 26 02:31:18 alg kernel: [  965.366929]     ->img_request           (null)
>> Mar 26 02:31:18 alg kernel: [  965.366933]     ->which 4294967295
>> Mar 26 02:31:18 alg kernel: [  965.366936]     ->xferred 4096
>> Mar 26 02:31:18 alg kernel: [  965.366940]     ->result 0
>> Mar 26 02:31:18 alg kernel: [  965.366943]     ->kref 0
>> Mar 26 02:31:18 alg kernel: [  965.366947] img_request ffff880222f4fb50
>> Mar 26 02:31:18 alg kernel: [  965.366950]     ->snap 0xfffffffffffffffe
>> Mar 26 02:31:18 alg kernel: [  965.366954]     ->offset 1417662464
>> Mar 26 02:31:18 alg kernel: [  965.366957]     ->length 16384
>> Mar 26 02:31:18 alg kernel: [  965.366960]     ->flags 0x0
>> Mar 26 02:31:18 alg kernel: [  965.366963]     ->obj_request_count 0
>> Mar 26 02:31:18 alg kernel: [  965.366966]     ->next_completion 2
>> Mar 26 02:31:18 alg kernel: [  965.366969]     ->xferred 16384
>> Mar 26 02:31:18 alg kernel: [  965.366973]     ->result 0
>> Mar 26 02:31:18 alg kernel: [  965.366976]     ->obj_requests head ffff880222f4fbb0
>> Mar 26 02:31:18 alg kernel: [  965.366980]     ->kref 0
>> Mar 26 02:31:18 alg kernel: [  965.366985] 
>> Mar 26 02:31:18 alg kernel: [  965.366985] Assertion failure in rbd_img_obj_callback() at line 2165:
>> Mar 26 02:31:18 alg kernel: [  965.366985] 
>> Mar 26 02:31:18 alg kernel: [  965.366985] 	rbd_assert(which == img_request->next_completion);
>> Mar 26 02:31:18 alg kernel: [  965.366985] 
>> Mar 26 02:31:18 alg kernel: [  965.367185] ------------[ cut here ]------------
>> Mar 26 02:31:18 alg kernel: [  965.367241] kernel BUG at drivers/block/rbd.c:2165!
>>
>>
>> I hope it can help.
>>
>>


Thanks for sending these.

> 
> and a second one, very similar :
> 
> Mar 26 02:48:27 alg kernel: [  681.167833] rbd_img_obj_callback: bad image object request information:
> Mar 26 02:48:27 alg kernel: [  681.167836] obj_request ffff88022e1e2828
> Mar 26 02:48:27 alg kernel: [  681.167837]     ->object_name <(null)>
> Mar 26 02:48:27 alg kernel: [  681.167838]     ->offset 0
> Mar 26 02:48:27 alg kernel: [  681.167839]     ->length 4096
> Mar 26 02:48:27 alg kernel: [  681.167840]     ->type 0x1
> Mar 26 02:48:27 alg kernel: [  681.167840]     ->flags 0x3
> Mar 26 02:48:27 alg kernel: [  681.167841]     ->img_request           (null)
> Mar 26 02:48:27 alg kernel: [  681.167842]     ->which 4294967295
> Mar 26 02:48:27 alg kernel: [  681.167843]     ->xferred 4096
> Mar 26 02:48:27 alg kernel: [  681.167844]     ->result 0
> Mar 26 02:48:27 alg kernel: [  681.167844]     ->kref 0

This confirms the reference count of the object request has gone
to zero.  This object request has already been destroyed (yet
we're handling a callback for it).

> Mar 26 02:48:27 alg kernel: [  681.167845] img_request ffff88021f555f10
> Mar 26 02:48:27 alg kernel: [  681.167846]     ->snap 0xfffffffffffffffe
> Mar 26 02:48:27 alg kernel: [  681.167847]     ->offset 28072464384
> Mar 26 02:48:27 alg kernel: [  681.167847]     ->length 16384
> Mar 26 02:48:27 alg kernel: [  681.167848]     ->flags 0x0
> Mar 26 02:48:27 alg kernel: [  681.167849]     ->obj_request_count 0
> Mar 26 02:48:27 alg kernel: [  681.167850]     ->next_completion 2
> Mar 26 02:48:27 alg kernel: [  681.167850]     ->xferred 16384
> Mar 26 02:48:27 alg kernel: [  681.167851]     ->result 0
> Mar 26 02:48:27 alg kernel: [  681.167852]     ->obj_requests head ffff88021f555f70

The object request list is empty.

> Mar 26 02:48:27 alg kernel: [  681.167853]     ->kref 0

This confirms the reference count of the image request has gone
to zero.  So not only has the object request already completed,
the image request has as well.

I'm almost done composing a very large e-mail with some detailed
analysis.  No answer quite yet, but I am certain that we're
getting duplicate callbacks on the second object request of
an image request that spans two objects.  That should help
narrow the search for the root cause.

					-Alex

> Mar 26 02:48:27 alg kernel: [  681.167854] 
> Mar 26 02:48:27 alg kernel: [  681.167854] Assertion failure in rbd_img_obj_callback() at line 2165:
> Mar 26 02:48:27 alg kernel: [  681.167854] 
> Mar 26 02:48:27 alg kernel: [  681.167854] 	rbd_assert(which == img_request->next_completion);
> Mar 26 02:48:27 alg kernel: [  681.167854] 
> Mar 26 02:48:27 alg kernel: [  681.168117] ------------[ cut here ]------------
> Mar 26 02:48:27 alg kernel: [  681.168211] kernel BUG at drivers/block/rbd.c:2165!
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olivier Bonvalet March 26, 2014, 2:35 a.m. UTC | #4
Le mercredi 26 mars 2014 à 02:50 +0100, Olivier Bonvalet a écrit :
> Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit :
> > Thanks for your patch.
> > 
> > This is an output of a crash case :
> > 
> > Mar 26 02:31:18 alg kernel: [  965.366895] rbd_img_obj_callback: bad image object request information:
> > Mar 26 02:31:18 alg kernel: [  965.366905] obj_request ffff880224bc9528
> > Mar 26 02:31:18 alg kernel: [  965.366909]     ->object_name <(null)>
> > Mar 26 02:31:18 alg kernel: [  965.366913]     ->offset 0
> > Mar 26 02:31:18 alg kernel: [  965.366917]     ->length 4096
> > Mar 26 02:31:18 alg kernel: [  965.366921]     ->type 0x1
> > Mar 26 02:31:18 alg kernel: [  965.366925]     ->flags 0x3
> > Mar 26 02:31:18 alg kernel: [  965.366929]     ->img_request           (null)
> > Mar 26 02:31:18 alg kernel: [  965.366933]     ->which 4294967295
> > Mar 26 02:31:18 alg kernel: [  965.366936]     ->xferred 4096
> > Mar 26 02:31:18 alg kernel: [  965.366940]     ->result 0
> > Mar 26 02:31:18 alg kernel: [  965.366943]     ->kref 0
> > Mar 26 02:31:18 alg kernel: [  965.366947] img_request ffff880222f4fb50
> > Mar 26 02:31:18 alg kernel: [  965.366950]     ->snap 0xfffffffffffffffe
> > Mar 26 02:31:18 alg kernel: [  965.366954]     ->offset 1417662464
> > Mar 26 02:31:18 alg kernel: [  965.366957]     ->length 16384
> > Mar 26 02:31:18 alg kernel: [  965.366960]     ->flags 0x0
> > Mar 26 02:31:18 alg kernel: [  965.366963]     ->obj_request_count 0
> > Mar 26 02:31:18 alg kernel: [  965.366966]     ->next_completion 2
> > Mar 26 02:31:18 alg kernel: [  965.366969]     ->xferred 16384
> > Mar 26 02:31:18 alg kernel: [  965.366973]     ->result 0
> > Mar 26 02:31:18 alg kernel: [  965.366976]     ->obj_requests head ffff880222f4fbb0
> > Mar 26 02:31:18 alg kernel: [  965.366980]     ->kref 0
> > Mar 26 02:31:18 alg kernel: [  965.366985] 
> > Mar 26 02:31:18 alg kernel: [  965.366985] Assertion failure in rbd_img_obj_callback() at line 2165:
> > Mar 26 02:31:18 alg kernel: [  965.366985] 
> > Mar 26 02:31:18 alg kernel: [  965.366985] 	rbd_assert(which == img_request->next_completion);
> > Mar 26 02:31:18 alg kernel: [  965.366985] 
> > Mar 26 02:31:18 alg kernel: [  965.367185] ------------[ cut here ]------------
> > Mar 26 02:31:18 alg kernel: [  965.367241] kernel BUG at drivers/block/rbd.c:2165!
> > 
> > 
> > I hope it can help.
> > 
> > 
> > --
> 
> and a second one, very similar :
> 
> Mar 26 02:48:27 alg kernel: [  681.167833] rbd_img_obj_callback: bad image object request information:
> Mar 26 02:48:27 alg kernel: [  681.167836] obj_request ffff88022e1e2828
> Mar 26 02:48:27 alg kernel: [  681.167837]     ->object_name <(null)>
> Mar 26 02:48:27 alg kernel: [  681.167838]     ->offset 0
> Mar 26 02:48:27 alg kernel: [  681.167839]     ->length 4096
> Mar 26 02:48:27 alg kernel: [  681.167840]     ->type 0x1
> Mar 26 02:48:27 alg kernel: [  681.167840]     ->flags 0x3
> Mar 26 02:48:27 alg kernel: [  681.167841]     ->img_request           (null)
> Mar 26 02:48:27 alg kernel: [  681.167842]     ->which 4294967295
> Mar 26 02:48:27 alg kernel: [  681.167843]     ->xferred 4096
> Mar 26 02:48:27 alg kernel: [  681.167844]     ->result 0
> Mar 26 02:48:27 alg kernel: [  681.167844]     ->kref 0
> Mar 26 02:48:27 alg kernel: [  681.167845] img_request ffff88021f555f10
> Mar 26 02:48:27 alg kernel: [  681.167846]     ->snap 0xfffffffffffffffe
> Mar 26 02:48:27 alg kernel: [  681.167847]     ->offset 28072464384
> Mar 26 02:48:27 alg kernel: [  681.167847]     ->length 16384
> Mar 26 02:48:27 alg kernel: [  681.167848]     ->flags 0x0
> Mar 26 02:48:27 alg kernel: [  681.167849]     ->obj_request_count 0
> Mar 26 02:48:27 alg kernel: [  681.167850]     ->next_completion 2
> Mar 26 02:48:27 alg kernel: [  681.167850]     ->xferred 16384
> Mar 26 02:48:27 alg kernel: [  681.167851]     ->result 0
> Mar 26 02:48:27 alg kernel: [  681.167852]     ->obj_requests head ffff88021f555f70
> Mar 26 02:48:27 alg kernel: [  681.167853]     ->kref 0
> Mar 26 02:48:27 alg kernel: [  681.167854] 
> Mar 26 02:48:27 alg kernel: [  681.167854] Assertion failure in rbd_img_obj_callback() at line 2165:
> Mar 26 02:48:27 alg kernel: [  681.167854] 
> Mar 26 02:48:27 alg kernel: [  681.167854] 	rbd_assert(which == img_request->next_completion);
> Mar 26 02:48:27 alg kernel: [  681.167854] 
> Mar 26 02:48:27 alg kernel: [  681.168117] ------------[ cut here ]------------
> Mar 26 02:48:27 alg kernel: [  681.168211] kernel BUG at drivers/block/rbd.c:2165!
> 
> 
> --

And a last one. As you said, it looks like to be same problem, each
time :

Mar 26 03:32:53 murmillia kernel: [  533.506391] rbd_img_obj_callback: bad image object request information:
Mar 26 03:32:53 murmillia kernel: [  533.506395] obj_request ffff88024fe73da8
Mar 26 03:32:53 murmillia kernel: [  533.506396]     ->object_name <(null)>
Mar 26 03:32:53 murmillia kernel: [  533.506397]     ->offset 0
Mar 26 03:32:53 murmillia kernel: [  533.506398]     ->length 24576
Mar 26 03:32:53 murmillia kernel: [  533.506399]     ->type 0x1
Mar 26 03:32:53 murmillia kernel: [  533.506400]     ->flags 0x3
Mar 26 03:32:53 murmillia kernel: [  533.506400]     ->img_request           (null)
Mar 26 03:32:53 murmillia kernel: [  533.506401]     ->which 4294967295
Mar 26 03:32:53 murmillia kernel: [  533.506402]     ->xferred 24576
Mar 26 03:32:53 murmillia kernel: [  533.506403]     ->result 0
Mar 26 03:32:53 murmillia kernel: [  533.506404]     ->kref 0
Mar 26 03:32:53 murmillia kernel: [  533.506405] img_request ffff880256f0cad8
Mar 26 03:32:53 murmillia kernel: [  533.506405]     ->snap 0xfffffffffffffffe
Mar 26 03:32:53 murmillia kernel: [  533.506406]     ->offset 67637329920
Mar 26 03:32:53 murmillia kernel: [  533.506407]     ->length 40960
Mar 26 03:32:53 murmillia kernel: [  533.506408]     ->flags 0x0
Mar 26 03:32:53 murmillia kernel: [  533.506409]     ->obj_request_count 0
Mar 26 03:32:53 murmillia kernel: [  533.506410]     ->next_completion 2
Mar 26 03:32:53 murmillia kernel: [  533.506410]     ->xferred 40960
Mar 26 03:32:53 murmillia kernel: [  533.506411]     ->result 0
Mar 26 03:32:53 murmillia kernel: [  533.506412]     ->obj_requests head ffff880256f0cb38
Mar 26 03:32:53 murmillia kernel: [  533.506413]     ->kref 0
Mar 26 03:32:53 murmillia kernel: [  533.506414] 
Mar 26 03:32:53 murmillia kernel: [  533.506414] Assertion failure in rbd_img_obj_callback() at line 2165:
Mar 26 03:32:53 murmillia kernel: [  533.506414] 
Mar 26 03:32:53 murmillia kernel: [  533.506414] 	rbd_assert(which == img_request->next_completion);
Mar 26 03:32:53 murmillia kernel: [  533.506414] 
Mar 26 03:32:53 murmillia kernel: [  533.506620] ------------[ cut here ]------------


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olivier Bonvalet March 26, 2014, 2:40 a.m. UTC | #5
Le mardi 25 mars 2014 à 20:55 -0500, Alex Elder a écrit :
> On 03/25/2014 08:50 PM, Olivier Bonvalet wrote:
> > Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit :
> >> Thanks for your patch.
> >>
> >> This is an output of a crash case :
> >>
> >> Mar 26 02:31:18 alg kernel: [  965.366895] rbd_img_obj_callback: bad image object request information:
> >> Mar 26 02:31:18 alg kernel: [  965.366905] obj_request ffff880224bc9528
> >> Mar 26 02:31:18 alg kernel: [  965.366909]     ->object_name <(null)>
> >> Mar 26 02:31:18 alg kernel: [  965.366913]     ->offset 0
> >> Mar 26 02:31:18 alg kernel: [  965.366917]     ->length 4096
> >> Mar 26 02:31:18 alg kernel: [  965.366921]     ->type 0x1
> >> Mar 26 02:31:18 alg kernel: [  965.366925]     ->flags 0x3
> >> Mar 26 02:31:18 alg kernel: [  965.366929]     ->img_request           (null)
> >> Mar 26 02:31:18 alg kernel: [  965.366933]     ->which 4294967295
> >> Mar 26 02:31:18 alg kernel: [  965.366936]     ->xferred 4096
> >> Mar 26 02:31:18 alg kernel: [  965.366940]     ->result 0
> >> Mar 26 02:31:18 alg kernel: [  965.366943]     ->kref 0
> >> Mar 26 02:31:18 alg kernel: [  965.366947] img_request ffff880222f4fb50
> >> Mar 26 02:31:18 alg kernel: [  965.366950]     ->snap 0xfffffffffffffffe
> >> Mar 26 02:31:18 alg kernel: [  965.366954]     ->offset 1417662464
> >> Mar 26 02:31:18 alg kernel: [  965.366957]     ->length 16384
> >> Mar 26 02:31:18 alg kernel: [  965.366960]     ->flags 0x0
> >> Mar 26 02:31:18 alg kernel: [  965.366963]     ->obj_request_count 0
> >> Mar 26 02:31:18 alg kernel: [  965.366966]     ->next_completion 2
> >> Mar 26 02:31:18 alg kernel: [  965.366969]     ->xferred 16384
> >> Mar 26 02:31:18 alg kernel: [  965.366973]     ->result 0
> >> Mar 26 02:31:18 alg kernel: [  965.366976]     ->obj_requests head ffff880222f4fbb0
> >> Mar 26 02:31:18 alg kernel: [  965.366980]     ->kref 0
> >> Mar 26 02:31:18 alg kernel: [  965.366985] 
> >> Mar 26 02:31:18 alg kernel: [  965.366985] Assertion failure in rbd_img_obj_callback() at line 2165:
> >> Mar 26 02:31:18 alg kernel: [  965.366985] 
> >> Mar 26 02:31:18 alg kernel: [  965.366985] 	rbd_assert(which == img_request->next_completion);
> >> Mar 26 02:31:18 alg kernel: [  965.366985] 
> >> Mar 26 02:31:18 alg kernel: [  965.367185] ------------[ cut here ]------------
> >> Mar 26 02:31:18 alg kernel: [  965.367241] kernel BUG at drivers/block/rbd.c:2165!
> >>
> >>
> >> I hope it can help.
> >>
> >>
> 
> 
> Thanks for sending these.
> 
> > 
> > and a second one, very similar :
> > 
> > Mar 26 02:48:27 alg kernel: [  681.167833] rbd_img_obj_callback: bad image object request information:
> > Mar 26 02:48:27 alg kernel: [  681.167836] obj_request ffff88022e1e2828
> > Mar 26 02:48:27 alg kernel: [  681.167837]     ->object_name <(null)>
> > Mar 26 02:48:27 alg kernel: [  681.167838]     ->offset 0
> > Mar 26 02:48:27 alg kernel: [  681.167839]     ->length 4096
> > Mar 26 02:48:27 alg kernel: [  681.167840]     ->type 0x1
> > Mar 26 02:48:27 alg kernel: [  681.167840]     ->flags 0x3
> > Mar 26 02:48:27 alg kernel: [  681.167841]     ->img_request           (null)
> > Mar 26 02:48:27 alg kernel: [  681.167842]     ->which 4294967295
> > Mar 26 02:48:27 alg kernel: [  681.167843]     ->xferred 4096
> > Mar 26 02:48:27 alg kernel: [  681.167844]     ->result 0
> > Mar 26 02:48:27 alg kernel: [  681.167844]     ->kref 0
> 
> This confirms the reference count of the object request has gone
> to zero.  This object request has already been destroyed (yet
> we're handling a callback for it).
> 
> > Mar 26 02:48:27 alg kernel: [  681.167845] img_request ffff88021f555f10
> > Mar 26 02:48:27 alg kernel: [  681.167846]     ->snap 0xfffffffffffffffe
> > Mar 26 02:48:27 alg kernel: [  681.167847]     ->offset 28072464384
> > Mar 26 02:48:27 alg kernel: [  681.167847]     ->length 16384
> > Mar 26 02:48:27 alg kernel: [  681.167848]     ->flags 0x0
> > Mar 26 02:48:27 alg kernel: [  681.167849]     ->obj_request_count 0
> > Mar 26 02:48:27 alg kernel: [  681.167850]     ->next_completion 2
> > Mar 26 02:48:27 alg kernel: [  681.167850]     ->xferred 16384
> > Mar 26 02:48:27 alg kernel: [  681.167851]     ->result 0
> > Mar 26 02:48:27 alg kernel: [  681.167852]     ->obj_requests head ffff88021f555f70
> 
> The object request list is empty.
> 
> > Mar 26 02:48:27 alg kernel: [  681.167853]     ->kref 0
> 
> This confirms the reference count of the image request has gone
> to zero.  So not only has the object request already completed,
> the image request has as well.
> 
> I'm almost done composing a very large e-mail with some detailed
> analysis.  No answer quite yet, but I am certain that we're
> getting duplicate callbacks on the second object request of
> an image request that spans two objects.  That should help
> narrow the search for the root cause.
> 
> 					-Alex

Thanks again to took time to analyze that problem.

All my RBD images have daily snapshots, can this bug be related to
snapshots ?

Maybe it's a stupid question, but is there a workaround that I could use
to reduce that problem in production, until a proper fix is found ?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elder March 26, 2014, 2:42 a.m. UTC | #6
On 03/25/2014 09:40 PM, Olivier Bonvalet wrote:
> Thanks again to took time to analyze that problem.
> 
> All my RBD images have daily snapshots, can this bug be related to
> snapshots ?
> 
> Maybe it's a stupid question, but is there a workaround that I could use
> to reduce that problem in production, until a proper fix is found ?

I don't believe this has anything to do with snapshots.

And I wish I knew a workaround, but unfortunately I
don't know of one.  Once I understand the problem (I'm
getting close) I might have one, but at that point I'll
probably have a fix.

I'll try to find a fix as soon as I can.

					-Alex

PS  I thought you said you were going to stop for the night!

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olivier Bonvalet March 26, 2014, 2:45 a.m. UTC | #7
Le mardi 25 mars 2014 à 21:42 -0500, Alex Elder a écrit :
> PS  I thought you said you were going to stop for the night!

Yes, I would love ! But my phone doesn't stop ring about ceph crash :D

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: b/drivers/block/rbd.c
===================================================================
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2132,6 +2132,35 @@  static void rbd_img_obj_callback(struct
 	spin_lock_irq(&img_request->completion_lock);
 	if (which > img_request->next_completion)
 		goto out;
+	if (which != img_request->next_completion) {
+		printk("%s: bad image object request information:\n", __func__);
+		printk("obj_request %p\n", obj_request);
+		printk("    ->object_name <%s>\n", obj_request->object_name);
+		printk("    ->offset %llu\n", obj_request->offset);
+		printk("    ->length %llu\n", obj_request->length);
+		printk("    ->type 0x%x\n", (u32)obj_request->type);
+		printk("    ->flags 0x%lx\n", obj_request->flags);
+		printk("    ->img_request %p\n", obj_request->img_request);
+		printk("    ->which %u\n", obj_request->which);
+		printk("    ->xferred %llu\n", obj_request->xferred);
+		printk("    ->result %d\n", obj_request->result);
+		printk("    ->kref %d\n", atomic_read(&obj_request->kref));
+
+		printk("img_request %p\n", img_request);
+		printk("    ->snap 0x%016llx\n", img_request->snap_id);
+		printk("    ->offset %llu\n", img_request->offset);
+		printk("    ->length %llu\n", img_request->length);
+		printk("    ->flags 0x%lx\n", img_request->flags);
+		printk("    ->obj_request_count %u\n",
+			img_request->obj_request_count);
+		printk("    ->next_completion %u\n",
+			img_request->next_completion);
+		printk("    ->xferred %llu\n", img_request->xferred);
+		printk("    ->result %d\n", img_request->result);
+		printk("    ->obj_requests head %p\n",
+			img_request->obj_requests.next);
+		printk("    ->kref %d\n", atomic_read(&img_request->kref));
+	}
 	rbd_assert(which == img_request->next_completion);

 	for_each_obj_request_from(img_request, obj_request) {