diff mbox

Issue #5876 : assertion failure in rbd_img_obj_callback()

Message ID 5331853D.40408@ieee.org (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Elder March 25, 2014, 1:31 p.m. UTC
...
>> So, a (partial) fix can be this patch ?
>>
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -2123,6 +2123,7 @@ static void rbd_img_obj_callback(struct rbd_obj_request *obj_request)
>>         rbd_assert(obj_request_img_data_test(obj_request));
>>         img_request = obj_request->img_request;
>>  
>> +       spin_lock_irq(&img_request->completion_lock);
>>         dout("%s: img %p obj %p\n", __func__, img_request, obj_request);
>>         rbd_assert(img_request != NULL);
>>         rbd_assert(img_request->obj_request_count > 0);
>> @@ -2130,7 +2131,6 @@ static void rbd_img_obj_callback(struct rbd_obj_request *obj_request)
>>         rbd_assert(which < img_request->obj_request_count);
>>         rbd_assert(which >= img_request->next_completion);
>>  
>> -       spin_lock_irq(&img_request->completion_lock);
>>         if (which != img_request->next_completion)
>>                 goto out;
> 
> 
> Yes, roughly.  I'd do the following instead.  It would be great
> to learn whether it eliminates the one form of assertion failure
> you were seeing.
> 
> 					-Alex
>


Strike that, my last patch was dead wrong.  Sorry.  Try this:



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Olivier Bonvalet March 25, 2014, 2:01 p.m. UTC | #1
Le mardi 25 mars 2014 à 08:31 -0500, Alex Elder a écrit :
> ...
> >> So, a (partial) fix can be this patch ?
> >>
> > 
> > 
> > Yes, roughly.  I'd do the following instead.  It would be great
> > to learn whether it eliminates the one form of assertion failure
> > you were seeing.
> > 
> > 					-Alex
> >
> 
> 
> Strike that, my last patch was dead wrong.  Sorry.  Try this:
> 
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2128,11 +2128,11 @@ static void rbd_img_obj_callback(struct
>  	rbd_assert(img_request->obj_request_count > 0);
>  	rbd_assert(which != BAD_WHICH);
>  	rbd_assert(which < img_request->obj_request_count);
> -	rbd_assert(which >= img_request->next_completion);
> 
>  	spin_lock_irq(&img_request->completion_lock);
> -	if (which != img_request->next_completion)
> +	if (which > img_request->next_completion)
>  		goto out;
> +	rbd_assert(which == img_request->next_completion);
> 
>  	for_each_obj_request_from(img_request, obj_request) {
>  		rbd_assert(more);
> 

Thanks !

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olivier Bonvalet March 25, 2014, 5:15 p.m. UTC | #2
Le mardi 25 mars 2014 à 08:31 -0500, Alex Elder a écrit :
> ...
> >> So, a (partial) fix can be this patch ?
> >>
> > 
> > 
> > Yes, roughly.  I'd do the following instead.  It would be great
> > to learn whether it eliminates the one form of assertion failure
> > you were seeing.
> > 
> > 					-Alex
> >
> 
> 
> Strike that, my last patch was dead wrong.  Sorry.  Try this:
> 
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2128,11 +2128,11 @@ static void rbd_img_obj_callback(struct
>  	rbd_assert(img_request->obj_request_count > 0);
>  	rbd_assert(which != BAD_WHICH);
>  	rbd_assert(which < img_request->obj_request_count);
> -	rbd_assert(which >= img_request->next_completion);
> 
>  	spin_lock_irq(&img_request->completion_lock);
> -	if (which != img_request->next_completion)
> +	if (which > img_request->next_completion)
>  		goto out;
> +	rbd_assert(which == img_request->next_completion);
> 
>  	for_each_obj_request_from(img_request, obj_request) {
>  		rbd_assert(more);
> 
> 
> 

Well, it just hang :

Mar 25 17:58:36 rurkh kernel: [ 4135.913079] Assertion failure in rbd_img_obj_callback() at line 2135:
Mar 25 17:58:36 rurkh kernel: [ 4135.913079] 
Mar 25 17:58:36 rurkh kernel: [ 4135.913079] 	rbd_assert(which == img_request->next_completion);
Mar 25 17:58:36 rurkh kernel: [ 4135.913079] 
Mar 25 17:58:36 rurkh kernel: [ 4135.913252] ------------[ cut here ]------------
Mar 25 17:58:36 rurkh kernel: [ 4135.913288] kernel BUG at drivers/block/rbd.c:2135!
Mar 25 17:58:36 rurkh kernel: [ 4135.913331] invalid opcode: 0000 [#1] SMP 
Mar 25 17:58:36 rurkh kernel: [ 4135.913373] Modules linked in: cbc rbd libceph xen_gntdev xt_physdev iptable_filter ip_tables x_tables xfs libcrc32c bridge loop iTCO_wdt iTCO_vendor_support gpio_ich serio_raw sb_edac edac_core i2c_i801 lpc_ich mfd_core evdev ioatdma shpchp ipmi_si ipmi_msghandler wmi ac button dm_mod hid_generic usbhid hid sg sd_mod crc_t10dif crct10dif_common isci ahci libsas libahci megaraid_sas libata scsi_transport_sas ehci_pci igb scsi_mod ehci_hcd ixgbe i2c_algo_bit i2c_core usbcore dca ptp usb_common pps_core mdio
Mar 25 17:58:36 rurkh kernel: [ 4135.913821] CPU: 0 PID: 30629 Comm: kworker/0:1 Not tainted 3.13-dae-dom0 #20
Mar 25 17:58:36 rurkh kernel: [ 4135.913863] Hardware name: Supermicro X9DRW-7TPF+/X9DRW-7TPF+, BIOS 3.0 07/24/2013
Mar 25 17:58:36 rurkh kernel: [ 4135.913931] Workqueue: ceph-msgr con_work [libceph]
Mar 25 17:58:36 rurkh kernel: [ 4135.913970] task: ffff88027374b760 ti: ffff88024933c000 task.ti: ffff88024933c000
Mar 25 17:58:36 rurkh kernel: [ 4135.914033] RIP: e030:[<ffffffffa0304b86>]  [<ffffffffa0304b86>] rbd_img_obj_callback+0x12f/0x3d0 [rbd]
Mar 25 17:58:36 rurkh kernel: [ 4135.914104] RSP: e02b:ffff88024933dce8  EFLAGS: 00010082
Mar 25 17:58:36 rurkh kernel: [ 4135.914141] RAX: 0000000000000070 RBX: ffff88024d2dcc48 RCX: 0000000000000000
Mar 25 17:58:36 rurkh kernel: [ 4135.914182] RDX: ffff88027fe0eb50 RSI: ffff88027fe0e1a8 RDI: ffff8802493300a8
Mar 25 17:58:36 rurkh kernel: [ 4135.914223] RBP: ffff88024ccc3e20 R08: 0000000000000000 R09: 0000000000000000
Mar 25 17:58:36 rurkh kernel: [ 4135.914265] R10: 0000000000000000 R11: 0000000000000098 R12: 0000000000000001
Mar 25 17:58:36 rurkh kernel: [ 4135.914306] R13: 0000000000000000 R14: ffff88027144b1d0 R15: 0000000000000000
Mar 25 17:58:36 rurkh kernel: [ 4135.914351] FS:  00007f6ec996f700(0000) GS:ffff88027fe00000(0000) knlGS:0000000000000000
Mar 25 17:58:36 rurkh kernel: [ 4135.914415] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 17:58:36 rurkh kernel: [ 4135.914453] CR2: 0000000001ff1b10 CR3: 00000002492b3000 CR4: 0000000000042660
Mar 25 17:58:36 rurkh kernel: [ 4135.914495] Stack:
Mar 25 17:58:36 rurkh kernel: [ 4135.914524]  ffff88024ccc3e5c ffff88024a48eb5d ffffffffffffffff ffff88024a48eb28
Mar 25 17:58:36 rurkh kernel: [ 4135.914610]  ffff88027144b1c8 ffff8802656cc718 0000000000000000 ffff88027144b1d0
Mar 25 17:58:36 rurkh kernel: [ 4135.914689]  0000000000000000 ffffffffa02e3595 0000000000000015 ffff8802656cc770
Mar 25 17:58:36 rurkh kernel: [ 4135.914768] Call Trace:
Mar 25 17:58:36 rurkh kernel: [ 4135.914809]  [<ffffffffa02e3595>] ? dispatch+0x3e4/0x55e [libceph]
Mar 25 17:58:36 rurkh kernel: [ 4135.914854]  [<ffffffffa02de0fc>] ? con_work+0xf6e/0x1a65 [libceph]
Mar 25 17:58:36 rurkh kernel: [ 4135.914901]  [<ffffffff81005f00>] ? xen_timer_resume+0x4f/0x4f
Mar 25 17:58:36 rurkh kernel: [ 4135.914944]  [<ffffffff81051f83>] ? mmdrop+0xd/0x1c
Mar 25 17:58:36 rurkh kernel: [ 4135.914984]  [<ffffffff8105265e>] ? finish_task_switch+0x4d/0x83
Mar 25 17:58:36 rurkh kernel: [ 4135.915029]  [<ffffffff810484d7>] ? process_one_work+0x15a/0x214
Mar 25 17:58:36 rurkh kernel: [ 4135.915072]  [<ffffffff8104895b>] ? worker_thread+0x139/0x1de
Mar 25 17:58:36 rurkh kernel: [ 4135.915113]  [<ffffffff81048822>] ? rescuer_thread+0x26e/0x26e
Mar 25 17:58:36 rurkh kernel: [ 4135.915155]  [<ffffffff8104cff6>] ? kthread+0x9e/0xa6
Mar 25 17:58:36 rurkh kernel: [ 4135.915195]  [<ffffffff8104cf58>] ? __kthread_parkme+0x55/0x55
Mar 25 17:58:36 rurkh kernel: [ 4135.915238]  [<ffffffff8137260c>] ? ret_from_fork+0x7c/0xb0
Mar 25 17:58:36 rurkh kernel: [ 4135.915279]  [<ffffffff8104cf58>] ? __kthread_parkme+0x55/0x55
Mar 25 17:58:36 rurkh kernel: [ 4135.915319] Code: 41 b5 01 48 89 44 24 08 eb 3b 48 c7 c1 2e 7c 30 a0 ba 57 08 00 00 31 c0 48 c7 c6 80 89 30 a0 48 c7 c7 1f 71 30 a0 e8 bd 35 06 e1 <0f> 0b 41 8b 45 5c ff c8 39 43 40 41 0f 92 c5 48 8b 5b 30 41 ff 
Mar 25 17:58:36 rurkh kernel: [ 4135.915701] RIP  [<ffffffffa0304b86>] rbd_img_obj_callback+0x12f/0x3d0 [rbd]
Mar 25 17:58:36 rurkh kernel: [ 4135.915749]  RSP <ffff88024933dce8>
Mar 25 17:58:36 rurkh kernel: [ 4135.916087] ---[ end trace ff823e5e2d6cd4e9 ]--




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elder March 25, 2014, 5:21 p.m. UTC | #3
On 03/25/2014 12:15 PM, Olivier Bonvalet wrote:
> Le mardi 25 mars 2014 à 08:31 -0500, Alex Elder a écrit :
>> ...
>>>> So, a (partial) fix can be this patch ?
>>>>
>>>
>>>
>>> Yes, roughly.  I'd do the following instead.  It would be great
>>> to learn whether it eliminates the one form of assertion failure
>>> you were seeing.
>>>
>>> 					-Alex
>>>
>>
>>
>> Strike that, my last patch was dead wrong.  Sorry.  Try this:
>>
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -2128,11 +2128,11 @@ static void rbd_img_obj_callback(struct
>>  	rbd_assert(img_request->obj_request_count > 0);
>>  	rbd_assert(which != BAD_WHICH);
>>  	rbd_assert(which < img_request->obj_request_count);
>> -	rbd_assert(which >= img_request->next_completion);
>>
>>  	spin_lock_irq(&img_request->completion_lock);
>> -	if (which != img_request->next_completion)
>> +	if (which > img_request->next_completion)
>>  		goto out;
>> +	rbd_assert(which == img_request->next_completion);
>>
>>  	for_each_obj_request_from(img_request, obj_request) {
>>  		rbd_assert(more);
>>
>>
>>
> 
> Well, it just hang :

It's great to know you can reproduce this.

Let me put together another quick patch that might supply a bit
more information when it happens.  I'll send something shortly.

					-Alex

> Mar 25 17:58:36 rurkh kernel: [ 4135.913079] Assertion failure in rbd_img_obj_callback() at line 2135:
> Mar 25 17:58:36 rurkh kernel: [ 4135.913079] 
> Mar 25 17:58:36 rurkh kernel: [ 4135.913079] 	rbd_assert(which == img_request->next_completion);
> Mar 25 17:58:36 rurkh kernel: [ 4135.913079] 
> Mar 25 17:58:36 rurkh kernel: [ 4135.913252] ------------[ cut here ]------------
> Mar 25 17:58:36 rurkh kernel: [ 4135.913288] kernel BUG at drivers/block/rbd.c:2135!
> Mar 25 17:58:36 rurkh kernel: [ 4135.913331] invalid opcode: 0000 [#1] SMP 
> Mar 25 17:58:36 rurkh kernel: [ 4135.913373] Modules linked in: cbc rbd libceph xen_gntdev xt_physdev iptable_filter ip_tables x_tables xfs libcrc32c bridge loop iTCO_wdt iTCO_vendor_support gpio_ich serio_raw sb_edac edac_core i2c_i801 lpc_ich mfd_core evdev ioatdma shpchp ipmi_si ipmi_msghandler wmi ac button dm_mod hid_generic usbhid hid sg sd_mod crc_t10dif crct10dif_common isci ahci libsas libahci megaraid_sas libata scsi_transport_sas ehci_pci igb scsi_mod ehci_hcd ixgbe i2c_algo_bit i2c_core usbcore dca ptp usb_common pps_core mdio
> Mar 25 17:58:36 rurkh kernel: [ 4135.913821] CPU: 0 PID: 30629 Comm: kworker/0:1 Not tainted 3.13-dae-dom0 #20
> Mar 25 17:58:36 rurkh kernel: [ 4135.913863] Hardware name: Supermicro X9DRW-7TPF+/X9DRW-7TPF+, BIOS 3.0 07/24/2013
> Mar 25 17:58:36 rurkh kernel: [ 4135.913931] Workqueue: ceph-msgr con_work [libceph]
> Mar 25 17:58:36 rurkh kernel: [ 4135.913970] task: ffff88027374b760 ti: ffff88024933c000 task.ti: ffff88024933c000
> Mar 25 17:58:36 rurkh kernel: [ 4135.914033] RIP: e030:[<ffffffffa0304b86>]  [<ffffffffa0304b86>] rbd_img_obj_callback+0x12f/0x3d0 [rbd]
> Mar 25 17:58:36 rurkh kernel: [ 4135.914104] RSP: e02b:ffff88024933dce8  EFLAGS: 00010082
> Mar 25 17:58:36 rurkh kernel: [ 4135.914141] RAX: 0000000000000070 RBX: ffff88024d2dcc48 RCX: 0000000000000000
> Mar 25 17:58:36 rurkh kernel: [ 4135.914182] RDX: ffff88027fe0eb50 RSI: ffff88027fe0e1a8 RDI: ffff8802493300a8
> Mar 25 17:58:36 rurkh kernel: [ 4135.914223] RBP: ffff88024ccc3e20 R08: 0000000000000000 R09: 0000000000000000
> Mar 25 17:58:36 rurkh kernel: [ 4135.914265] R10: 0000000000000000 R11: 0000000000000098 R12: 0000000000000001
> Mar 25 17:58:36 rurkh kernel: [ 4135.914306] R13: 0000000000000000 R14: ffff88027144b1d0 R15: 0000000000000000
> Mar 25 17:58:36 rurkh kernel: [ 4135.914351] FS:  00007f6ec996f700(0000) GS:ffff88027fe00000(0000) knlGS:0000000000000000
> Mar 25 17:58:36 rurkh kernel: [ 4135.914415] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> Mar 25 17:58:36 rurkh kernel: [ 4135.914453] CR2: 0000000001ff1b10 CR3: 00000002492b3000 CR4: 0000000000042660
> Mar 25 17:58:36 rurkh kernel: [ 4135.914495] Stack:
> Mar 25 17:58:36 rurkh kernel: [ 4135.914524]  ffff88024ccc3e5c ffff88024a48eb5d ffffffffffffffff ffff88024a48eb28
> Mar 25 17:58:36 rurkh kernel: [ 4135.914610]  ffff88027144b1c8 ffff8802656cc718 0000000000000000 ffff88027144b1d0
> Mar 25 17:58:36 rurkh kernel: [ 4135.914689]  0000000000000000 ffffffffa02e3595 0000000000000015 ffff8802656cc770
> Mar 25 17:58:36 rurkh kernel: [ 4135.914768] Call Trace:
> Mar 25 17:58:36 rurkh kernel: [ 4135.914809]  [<ffffffffa02e3595>] ? dispatch+0x3e4/0x55e [libceph]
> Mar 25 17:58:36 rurkh kernel: [ 4135.914854]  [<ffffffffa02de0fc>] ? con_work+0xf6e/0x1a65 [libceph]
> Mar 25 17:58:36 rurkh kernel: [ 4135.914901]  [<ffffffff81005f00>] ? xen_timer_resume+0x4f/0x4f
> Mar 25 17:58:36 rurkh kernel: [ 4135.914944]  [<ffffffff81051f83>] ? mmdrop+0xd/0x1c
> Mar 25 17:58:36 rurkh kernel: [ 4135.914984]  [<ffffffff8105265e>] ? finish_task_switch+0x4d/0x83
> Mar 25 17:58:36 rurkh kernel: [ 4135.915029]  [<ffffffff810484d7>] ? process_one_work+0x15a/0x214
> Mar 25 17:58:36 rurkh kernel: [ 4135.915072]  [<ffffffff8104895b>] ? worker_thread+0x139/0x1de
> Mar 25 17:58:36 rurkh kernel: [ 4135.915113]  [<ffffffff81048822>] ? rescuer_thread+0x26e/0x26e
> Mar 25 17:58:36 rurkh kernel: [ 4135.915155]  [<ffffffff8104cff6>] ? kthread+0x9e/0xa6
> Mar 25 17:58:36 rurkh kernel: [ 4135.915195]  [<ffffffff8104cf58>] ? __kthread_parkme+0x55/0x55
> Mar 25 17:58:36 rurkh kernel: [ 4135.915238]  [<ffffffff8137260c>] ? ret_from_fork+0x7c/0xb0
> Mar 25 17:58:36 rurkh kernel: [ 4135.915279]  [<ffffffff8104cf58>] ? __kthread_parkme+0x55/0x55
> Mar 25 17:58:36 rurkh kernel: [ 4135.915319] Code: 41 b5 01 48 89 44 24 08 eb 3b 48 c7 c1 2e 7c 30 a0 ba 57 08 00 00 31 c0 48 c7 c6 80 89 30 a0 48 c7 c7 1f 71 30 a0 e8 bd 35 06 e1 <0f> 0b 41 8b 45 5c ff c8 39 43 40 41 0f 92 c5 48 8b 5b 30 41 ff 
> Mar 25 17:58:36 rurkh kernel: [ 4135.915701] RIP  [<ffffffffa0304b86>] rbd_img_obj_callback+0x12f/0x3d0 [rbd]
> Mar 25 17:58:36 rurkh kernel: [ 4135.915749]  RSP <ffff88024933dce8>
> Mar 25 17:58:36 rurkh kernel: [ 4135.916087] ---[ end trace ff823e5e2d6cd4e9 ]--
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olivier Bonvalet March 25, 2014, 6:53 p.m. UTC | #4
Le mardi 25 mars 2014 à 12:21 -0500, Alex Elder a écrit :
> It's great to know you can reproduce this.

Yes... I understand your point of view but... it's a production
cluster ;)

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2128,11 +2128,11 @@  static void rbd_img_obj_callback(struct
 	rbd_assert(img_request->obj_request_count > 0);
 	rbd_assert(which != BAD_WHICH);
 	rbd_assert(which < img_request->obj_request_count);
-	rbd_assert(which >= img_request->next_completion);

 	spin_lock_irq(&img_request->completion_lock);
-	if (which != img_request->next_completion)
+	if (which > img_request->next_completion)
 		goto out;
+	rbd_assert(which == img_request->next_completion);

 	for_each_obj_request_from(img_request, obj_request) {
 		rbd_assert(more);