Message ID | 53321896.1080606@ieee.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Le mardi 25 mars 2014 à 19:00 -0500, Alex Elder a écrit : > On 03/25/2014 06:04 PM, Olivier Bonvalet wrote: > > Le mardi 25 mars 2014 à 17:46 -0500, Alex Elder a écrit : > >> On 03/25/2014 05:17 PM, Olivier Bonvalet wrote: > >>> > >>> I now have this one very often (here 5 minutes after the host boot) : > >> > >> I am fairly sure this indicates a use-after-free scenario, > >> likely caused by something getting deleted before every > >> user was done with it. > >> > >> I believe Ilya is done for the night; I'm going to spend some > >> time looking at this to try to determine the cause. If you > >> are willing I'd love to have you try whatever fix I come up > >> with. I'd rather find a fix than just collect more information, > >> but I may need to get more, we'll see. > >> > >> Thank you for all your reports, they help a lot. > >> > >> -Alex > > > > Ok. I can apply some patch to help debug that yes. > > I will try to reproduce on a different host, without customer data. > > > > But I think I will stop here for the night too. > > > > Thanks for your time, > > Olivier > > Here's something that will provide a few more pieces of > information. If you're around and you're able to try it > out it might confirm something had likely been destroyed. > > I'll keep sending stuff as I come up with it (even though > I realize you may not be around until morning). > > -Alex > > > Index: b/drivers/block/rbd.c > =================================================================== > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -2132,6 +2132,35 @@ static void rbd_img_obj_callback(struct > spin_lock_irq(&img_request->completion_lock); > if (which > img_request->next_completion) > goto out; > + if (which != img_request->next_completion) { > + printk("%s: bad image object request information:\n", __func__); > + printk("obj_request %p\n", obj_request); > + printk(" ->object_name <%s>\n", obj_request->object_name); > + printk(" ->offset %llu\n", obj_request->offset); > + printk(" ->length %llu\n", obj_request->length); > + printk(" ->type 0x%x\n", (u32)obj_request->type); > + printk(" ->flags 0x%lx\n", obj_request->flags); > + printk(" ->img_request %p\n", obj_request->img_request); > + printk(" ->which %u\n", obj_request->which); > + printk(" ->xferred %llu\n", obj_request->xferred); > + printk(" ->result %d\n", obj_request->result); > + printk(" ->kref %d\n", atomic_read(&obj_request->kref)); > + > + printk("img_request %p\n", img_request); > + printk(" ->snap 0x%016llx\n", img_request->snap_id); > + printk(" ->offset %llu\n", img_request->offset); > + printk(" ->length %llu\n", img_request->length); > + printk(" ->flags 0x%lx\n", img_request->flags); > + printk(" ->obj_request_count %u\n", > + img_request->obj_request_count); > + printk(" ->next_completion %u\n", > + img_request->next_completion); > + printk(" ->xferred %llu\n", img_request->xferred); > + printk(" ->result %d\n", img_request->result); > + printk(" ->obj_requests head %p\n", > + img_request->obj_requests.next); > + printk(" ->kref %d\n", atomic_read(&img_request->kref)); > + } > rbd_assert(which == img_request->next_completion); > > for_each_obj_request_from(img_request, obj_request) { > > Thanks for your patch. This is an output of a crash case : Mar 26 02:31:18 alg kernel: [ 965.366895] rbd_img_obj_callback: bad image object request information: Mar 26 02:31:18 alg kernel: [ 965.366905] obj_request ffff880224bc9528 Mar 26 02:31:18 alg kernel: [ 965.366909] ->object_name <(null)> Mar 26 02:31:18 alg kernel: [ 965.366913] ->offset 0 Mar 26 02:31:18 alg kernel: [ 965.366917] ->length 4096 Mar 26 02:31:18 alg kernel: [ 965.366921] ->type 0x1 Mar 26 02:31:18 alg kernel: [ 965.366925] ->flags 0x3 Mar 26 02:31:18 alg kernel: [ 965.366929] ->img_request (null) Mar 26 02:31:18 alg kernel: [ 965.366933] ->which 4294967295 Mar 26 02:31:18 alg kernel: [ 965.366936] ->xferred 4096 Mar 26 02:31:18 alg kernel: [ 965.366940] ->result 0 Mar 26 02:31:18 alg kernel: [ 965.366943] ->kref 0 Mar 26 02:31:18 alg kernel: [ 965.366947] img_request ffff880222f4fb50 Mar 26 02:31:18 alg kernel: [ 965.366950] ->snap 0xfffffffffffffffe Mar 26 02:31:18 alg kernel: [ 965.366954] ->offset 1417662464 Mar 26 02:31:18 alg kernel: [ 965.366957] ->length 16384 Mar 26 02:31:18 alg kernel: [ 965.366960] ->flags 0x0 Mar 26 02:31:18 alg kernel: [ 965.366963] ->obj_request_count 0 Mar 26 02:31:18 alg kernel: [ 965.366966] ->next_completion 2 Mar 26 02:31:18 alg kernel: [ 965.366969] ->xferred 16384 Mar 26 02:31:18 alg kernel: [ 965.366973] ->result 0 Mar 26 02:31:18 alg kernel: [ 965.366976] ->obj_requests head ffff880222f4fbb0 Mar 26 02:31:18 alg kernel: [ 965.366980] ->kref 0 Mar 26 02:31:18 alg kernel: [ 965.366985] Mar 26 02:31:18 alg kernel: [ 965.366985] Assertion failure in rbd_img_obj_callback() at line 2165: Mar 26 02:31:18 alg kernel: [ 965.366985] Mar 26 02:31:18 alg kernel: [ 965.366985] rbd_assert(which == img_request->next_completion); Mar 26 02:31:18 alg kernel: [ 965.366985] Mar 26 02:31:18 alg kernel: [ 965.367185] ------------[ cut here ]------------ Mar 26 02:31:18 alg kernel: [ 965.367241] kernel BUG at drivers/block/rbd.c:2165! I hope it can help. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit : > Thanks for your patch. > > This is an output of a crash case : > > Mar 26 02:31:18 alg kernel: [ 965.366895] rbd_img_obj_callback: bad image object request information: > Mar 26 02:31:18 alg kernel: [ 965.366905] obj_request ffff880224bc9528 > Mar 26 02:31:18 alg kernel: [ 965.366909] ->object_name <(null)> > Mar 26 02:31:18 alg kernel: [ 965.366913] ->offset 0 > Mar 26 02:31:18 alg kernel: [ 965.366917] ->length 4096 > Mar 26 02:31:18 alg kernel: [ 965.366921] ->type 0x1 > Mar 26 02:31:18 alg kernel: [ 965.366925] ->flags 0x3 > Mar 26 02:31:18 alg kernel: [ 965.366929] ->img_request (null) > Mar 26 02:31:18 alg kernel: [ 965.366933] ->which 4294967295 > Mar 26 02:31:18 alg kernel: [ 965.366936] ->xferred 4096 > Mar 26 02:31:18 alg kernel: [ 965.366940] ->result 0 > Mar 26 02:31:18 alg kernel: [ 965.366943] ->kref 0 > Mar 26 02:31:18 alg kernel: [ 965.366947] img_request ffff880222f4fb50 > Mar 26 02:31:18 alg kernel: [ 965.366950] ->snap 0xfffffffffffffffe > Mar 26 02:31:18 alg kernel: [ 965.366954] ->offset 1417662464 > Mar 26 02:31:18 alg kernel: [ 965.366957] ->length 16384 > Mar 26 02:31:18 alg kernel: [ 965.366960] ->flags 0x0 > Mar 26 02:31:18 alg kernel: [ 965.366963] ->obj_request_count 0 > Mar 26 02:31:18 alg kernel: [ 965.366966] ->next_completion 2 > Mar 26 02:31:18 alg kernel: [ 965.366969] ->xferred 16384 > Mar 26 02:31:18 alg kernel: [ 965.366973] ->result 0 > Mar 26 02:31:18 alg kernel: [ 965.366976] ->obj_requests head ffff880222f4fbb0 > Mar 26 02:31:18 alg kernel: [ 965.366980] ->kref 0 > Mar 26 02:31:18 alg kernel: [ 965.366985] > Mar 26 02:31:18 alg kernel: [ 965.366985] Assertion failure in rbd_img_obj_callback() at line 2165: > Mar 26 02:31:18 alg kernel: [ 965.366985] > Mar 26 02:31:18 alg kernel: [ 965.366985] rbd_assert(which == img_request->next_completion); > Mar 26 02:31:18 alg kernel: [ 965.366985] > Mar 26 02:31:18 alg kernel: [ 965.367185] ------------[ cut here ]------------ > Mar 26 02:31:18 alg kernel: [ 965.367241] kernel BUG at drivers/block/rbd.c:2165! > > > I hope it can help. > > > -- and a second one, very similar : Mar 26 02:48:27 alg kernel: [ 681.167833] rbd_img_obj_callback: bad image object request information: Mar 26 02:48:27 alg kernel: [ 681.167836] obj_request ffff88022e1e2828 Mar 26 02:48:27 alg kernel: [ 681.167837] ->object_name <(null)> Mar 26 02:48:27 alg kernel: [ 681.167838] ->offset 0 Mar 26 02:48:27 alg kernel: [ 681.167839] ->length 4096 Mar 26 02:48:27 alg kernel: [ 681.167840] ->type 0x1 Mar 26 02:48:27 alg kernel: [ 681.167840] ->flags 0x3 Mar 26 02:48:27 alg kernel: [ 681.167841] ->img_request (null) Mar 26 02:48:27 alg kernel: [ 681.167842] ->which 4294967295 Mar 26 02:48:27 alg kernel: [ 681.167843] ->xferred 4096 Mar 26 02:48:27 alg kernel: [ 681.167844] ->result 0 Mar 26 02:48:27 alg kernel: [ 681.167844] ->kref 0 Mar 26 02:48:27 alg kernel: [ 681.167845] img_request ffff88021f555f10 Mar 26 02:48:27 alg kernel: [ 681.167846] ->snap 0xfffffffffffffffe Mar 26 02:48:27 alg kernel: [ 681.167847] ->offset 28072464384 Mar 26 02:48:27 alg kernel: [ 681.167847] ->length 16384 Mar 26 02:48:27 alg kernel: [ 681.167848] ->flags 0x0 Mar 26 02:48:27 alg kernel: [ 681.167849] ->obj_request_count 0 Mar 26 02:48:27 alg kernel: [ 681.167850] ->next_completion 2 Mar 26 02:48:27 alg kernel: [ 681.167850] ->xferred 16384 Mar 26 02:48:27 alg kernel: [ 681.167851] ->result 0 Mar 26 02:48:27 alg kernel: [ 681.167852] ->obj_requests head ffff88021f555f70 Mar 26 02:48:27 alg kernel: [ 681.167853] ->kref 0 Mar 26 02:48:27 alg kernel: [ 681.167854] Mar 26 02:48:27 alg kernel: [ 681.167854] Assertion failure in rbd_img_obj_callback() at line 2165: Mar 26 02:48:27 alg kernel: [ 681.167854] Mar 26 02:48:27 alg kernel: [ 681.167854] rbd_assert(which == img_request->next_completion); Mar 26 02:48:27 alg kernel: [ 681.167854] Mar 26 02:48:27 alg kernel: [ 681.168117] ------------[ cut here ]------------ Mar 26 02:48:27 alg kernel: [ 681.168211] kernel BUG at drivers/block/rbd.c:2165! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/25/2014 08:50 PM, Olivier Bonvalet wrote: > Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit : >> Thanks for your patch. >> >> This is an output of a crash case : >> >> Mar 26 02:31:18 alg kernel: [ 965.366895] rbd_img_obj_callback: bad image object request information: >> Mar 26 02:31:18 alg kernel: [ 965.366905] obj_request ffff880224bc9528 >> Mar 26 02:31:18 alg kernel: [ 965.366909] ->object_name <(null)> >> Mar 26 02:31:18 alg kernel: [ 965.366913] ->offset 0 >> Mar 26 02:31:18 alg kernel: [ 965.366917] ->length 4096 >> Mar 26 02:31:18 alg kernel: [ 965.366921] ->type 0x1 >> Mar 26 02:31:18 alg kernel: [ 965.366925] ->flags 0x3 >> Mar 26 02:31:18 alg kernel: [ 965.366929] ->img_request (null) >> Mar 26 02:31:18 alg kernel: [ 965.366933] ->which 4294967295 >> Mar 26 02:31:18 alg kernel: [ 965.366936] ->xferred 4096 >> Mar 26 02:31:18 alg kernel: [ 965.366940] ->result 0 >> Mar 26 02:31:18 alg kernel: [ 965.366943] ->kref 0 >> Mar 26 02:31:18 alg kernel: [ 965.366947] img_request ffff880222f4fb50 >> Mar 26 02:31:18 alg kernel: [ 965.366950] ->snap 0xfffffffffffffffe >> Mar 26 02:31:18 alg kernel: [ 965.366954] ->offset 1417662464 >> Mar 26 02:31:18 alg kernel: [ 965.366957] ->length 16384 >> Mar 26 02:31:18 alg kernel: [ 965.366960] ->flags 0x0 >> Mar 26 02:31:18 alg kernel: [ 965.366963] ->obj_request_count 0 >> Mar 26 02:31:18 alg kernel: [ 965.366966] ->next_completion 2 >> Mar 26 02:31:18 alg kernel: [ 965.366969] ->xferred 16384 >> Mar 26 02:31:18 alg kernel: [ 965.366973] ->result 0 >> Mar 26 02:31:18 alg kernel: [ 965.366976] ->obj_requests head ffff880222f4fbb0 >> Mar 26 02:31:18 alg kernel: [ 965.366980] ->kref 0 >> Mar 26 02:31:18 alg kernel: [ 965.366985] >> Mar 26 02:31:18 alg kernel: [ 965.366985] Assertion failure in rbd_img_obj_callback() at line 2165: >> Mar 26 02:31:18 alg kernel: [ 965.366985] >> Mar 26 02:31:18 alg kernel: [ 965.366985] rbd_assert(which == img_request->next_completion); >> Mar 26 02:31:18 alg kernel: [ 965.366985] >> Mar 26 02:31:18 alg kernel: [ 965.367185] ------------[ cut here ]------------ >> Mar 26 02:31:18 alg kernel: [ 965.367241] kernel BUG at drivers/block/rbd.c:2165! >> >> >> I hope it can help. >> >> Thanks for sending these. > > and a second one, very similar : > > Mar 26 02:48:27 alg kernel: [ 681.167833] rbd_img_obj_callback: bad image object request information: > Mar 26 02:48:27 alg kernel: [ 681.167836] obj_request ffff88022e1e2828 > Mar 26 02:48:27 alg kernel: [ 681.167837] ->object_name <(null)> > Mar 26 02:48:27 alg kernel: [ 681.167838] ->offset 0 > Mar 26 02:48:27 alg kernel: [ 681.167839] ->length 4096 > Mar 26 02:48:27 alg kernel: [ 681.167840] ->type 0x1 > Mar 26 02:48:27 alg kernel: [ 681.167840] ->flags 0x3 > Mar 26 02:48:27 alg kernel: [ 681.167841] ->img_request (null) > Mar 26 02:48:27 alg kernel: [ 681.167842] ->which 4294967295 > Mar 26 02:48:27 alg kernel: [ 681.167843] ->xferred 4096 > Mar 26 02:48:27 alg kernel: [ 681.167844] ->result 0 > Mar 26 02:48:27 alg kernel: [ 681.167844] ->kref 0 This confirms the reference count of the object request has gone to zero. This object request has already been destroyed (yet we're handling a callback for it). > Mar 26 02:48:27 alg kernel: [ 681.167845] img_request ffff88021f555f10 > Mar 26 02:48:27 alg kernel: [ 681.167846] ->snap 0xfffffffffffffffe > Mar 26 02:48:27 alg kernel: [ 681.167847] ->offset 28072464384 > Mar 26 02:48:27 alg kernel: [ 681.167847] ->length 16384 > Mar 26 02:48:27 alg kernel: [ 681.167848] ->flags 0x0 > Mar 26 02:48:27 alg kernel: [ 681.167849] ->obj_request_count 0 > Mar 26 02:48:27 alg kernel: [ 681.167850] ->next_completion 2 > Mar 26 02:48:27 alg kernel: [ 681.167850] ->xferred 16384 > Mar 26 02:48:27 alg kernel: [ 681.167851] ->result 0 > Mar 26 02:48:27 alg kernel: [ 681.167852] ->obj_requests head ffff88021f555f70 The object request list is empty. > Mar 26 02:48:27 alg kernel: [ 681.167853] ->kref 0 This confirms the reference count of the image request has gone to zero. So not only has the object request already completed, the image request has as well. I'm almost done composing a very large e-mail with some detailed analysis. No answer quite yet, but I am certain that we're getting duplicate callbacks on the second object request of an image request that spans two objects. That should help narrow the search for the root cause. -Alex > Mar 26 02:48:27 alg kernel: [ 681.167854] > Mar 26 02:48:27 alg kernel: [ 681.167854] Assertion failure in rbd_img_obj_callback() at line 2165: > Mar 26 02:48:27 alg kernel: [ 681.167854] > Mar 26 02:48:27 alg kernel: [ 681.167854] rbd_assert(which == img_request->next_completion); > Mar 26 02:48:27 alg kernel: [ 681.167854] > Mar 26 02:48:27 alg kernel: [ 681.168117] ------------[ cut here ]------------ > Mar 26 02:48:27 alg kernel: [ 681.168211] kernel BUG at drivers/block/rbd.c:2165! > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 26 mars 2014 à 02:50 +0100, Olivier Bonvalet a écrit : > Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit : > > Thanks for your patch. > > > > This is an output of a crash case : > > > > Mar 26 02:31:18 alg kernel: [ 965.366895] rbd_img_obj_callback: bad image object request information: > > Mar 26 02:31:18 alg kernel: [ 965.366905] obj_request ffff880224bc9528 > > Mar 26 02:31:18 alg kernel: [ 965.366909] ->object_name <(null)> > > Mar 26 02:31:18 alg kernel: [ 965.366913] ->offset 0 > > Mar 26 02:31:18 alg kernel: [ 965.366917] ->length 4096 > > Mar 26 02:31:18 alg kernel: [ 965.366921] ->type 0x1 > > Mar 26 02:31:18 alg kernel: [ 965.366925] ->flags 0x3 > > Mar 26 02:31:18 alg kernel: [ 965.366929] ->img_request (null) > > Mar 26 02:31:18 alg kernel: [ 965.366933] ->which 4294967295 > > Mar 26 02:31:18 alg kernel: [ 965.366936] ->xferred 4096 > > Mar 26 02:31:18 alg kernel: [ 965.366940] ->result 0 > > Mar 26 02:31:18 alg kernel: [ 965.366943] ->kref 0 > > Mar 26 02:31:18 alg kernel: [ 965.366947] img_request ffff880222f4fb50 > > Mar 26 02:31:18 alg kernel: [ 965.366950] ->snap 0xfffffffffffffffe > > Mar 26 02:31:18 alg kernel: [ 965.366954] ->offset 1417662464 > > Mar 26 02:31:18 alg kernel: [ 965.366957] ->length 16384 > > Mar 26 02:31:18 alg kernel: [ 965.366960] ->flags 0x0 > > Mar 26 02:31:18 alg kernel: [ 965.366963] ->obj_request_count 0 > > Mar 26 02:31:18 alg kernel: [ 965.366966] ->next_completion 2 > > Mar 26 02:31:18 alg kernel: [ 965.366969] ->xferred 16384 > > Mar 26 02:31:18 alg kernel: [ 965.366973] ->result 0 > > Mar 26 02:31:18 alg kernel: [ 965.366976] ->obj_requests head ffff880222f4fbb0 > > Mar 26 02:31:18 alg kernel: [ 965.366980] ->kref 0 > > Mar 26 02:31:18 alg kernel: [ 965.366985] > > Mar 26 02:31:18 alg kernel: [ 965.366985] Assertion failure in rbd_img_obj_callback() at line 2165: > > Mar 26 02:31:18 alg kernel: [ 965.366985] > > Mar 26 02:31:18 alg kernel: [ 965.366985] rbd_assert(which == img_request->next_completion); > > Mar 26 02:31:18 alg kernel: [ 965.366985] > > Mar 26 02:31:18 alg kernel: [ 965.367185] ------------[ cut here ]------------ > > Mar 26 02:31:18 alg kernel: [ 965.367241] kernel BUG at drivers/block/rbd.c:2165! > > > > > > I hope it can help. > > > > > > -- > > and a second one, very similar : > > Mar 26 02:48:27 alg kernel: [ 681.167833] rbd_img_obj_callback: bad image object request information: > Mar 26 02:48:27 alg kernel: [ 681.167836] obj_request ffff88022e1e2828 > Mar 26 02:48:27 alg kernel: [ 681.167837] ->object_name <(null)> > Mar 26 02:48:27 alg kernel: [ 681.167838] ->offset 0 > Mar 26 02:48:27 alg kernel: [ 681.167839] ->length 4096 > Mar 26 02:48:27 alg kernel: [ 681.167840] ->type 0x1 > Mar 26 02:48:27 alg kernel: [ 681.167840] ->flags 0x3 > Mar 26 02:48:27 alg kernel: [ 681.167841] ->img_request (null) > Mar 26 02:48:27 alg kernel: [ 681.167842] ->which 4294967295 > Mar 26 02:48:27 alg kernel: [ 681.167843] ->xferred 4096 > Mar 26 02:48:27 alg kernel: [ 681.167844] ->result 0 > Mar 26 02:48:27 alg kernel: [ 681.167844] ->kref 0 > Mar 26 02:48:27 alg kernel: [ 681.167845] img_request ffff88021f555f10 > Mar 26 02:48:27 alg kernel: [ 681.167846] ->snap 0xfffffffffffffffe > Mar 26 02:48:27 alg kernel: [ 681.167847] ->offset 28072464384 > Mar 26 02:48:27 alg kernel: [ 681.167847] ->length 16384 > Mar 26 02:48:27 alg kernel: [ 681.167848] ->flags 0x0 > Mar 26 02:48:27 alg kernel: [ 681.167849] ->obj_request_count 0 > Mar 26 02:48:27 alg kernel: [ 681.167850] ->next_completion 2 > Mar 26 02:48:27 alg kernel: [ 681.167850] ->xferred 16384 > Mar 26 02:48:27 alg kernel: [ 681.167851] ->result 0 > Mar 26 02:48:27 alg kernel: [ 681.167852] ->obj_requests head ffff88021f555f70 > Mar 26 02:48:27 alg kernel: [ 681.167853] ->kref 0 > Mar 26 02:48:27 alg kernel: [ 681.167854] > Mar 26 02:48:27 alg kernel: [ 681.167854] Assertion failure in rbd_img_obj_callback() at line 2165: > Mar 26 02:48:27 alg kernel: [ 681.167854] > Mar 26 02:48:27 alg kernel: [ 681.167854] rbd_assert(which == img_request->next_completion); > Mar 26 02:48:27 alg kernel: [ 681.167854] > Mar 26 02:48:27 alg kernel: [ 681.168117] ------------[ cut here ]------------ > Mar 26 02:48:27 alg kernel: [ 681.168211] kernel BUG at drivers/block/rbd.c:2165! > > > -- And a last one. As you said, it looks like to be same problem, each time : Mar 26 03:32:53 murmillia kernel: [ 533.506391] rbd_img_obj_callback: bad image object request information: Mar 26 03:32:53 murmillia kernel: [ 533.506395] obj_request ffff88024fe73da8 Mar 26 03:32:53 murmillia kernel: [ 533.506396] ->object_name <(null)> Mar 26 03:32:53 murmillia kernel: [ 533.506397] ->offset 0 Mar 26 03:32:53 murmillia kernel: [ 533.506398] ->length 24576 Mar 26 03:32:53 murmillia kernel: [ 533.506399] ->type 0x1 Mar 26 03:32:53 murmillia kernel: [ 533.506400] ->flags 0x3 Mar 26 03:32:53 murmillia kernel: [ 533.506400] ->img_request (null) Mar 26 03:32:53 murmillia kernel: [ 533.506401] ->which 4294967295 Mar 26 03:32:53 murmillia kernel: [ 533.506402] ->xferred 24576 Mar 26 03:32:53 murmillia kernel: [ 533.506403] ->result 0 Mar 26 03:32:53 murmillia kernel: [ 533.506404] ->kref 0 Mar 26 03:32:53 murmillia kernel: [ 533.506405] img_request ffff880256f0cad8 Mar 26 03:32:53 murmillia kernel: [ 533.506405] ->snap 0xfffffffffffffffe Mar 26 03:32:53 murmillia kernel: [ 533.506406] ->offset 67637329920 Mar 26 03:32:53 murmillia kernel: [ 533.506407] ->length 40960 Mar 26 03:32:53 murmillia kernel: [ 533.506408] ->flags 0x0 Mar 26 03:32:53 murmillia kernel: [ 533.506409] ->obj_request_count 0 Mar 26 03:32:53 murmillia kernel: [ 533.506410] ->next_completion 2 Mar 26 03:32:53 murmillia kernel: [ 533.506410] ->xferred 40960 Mar 26 03:32:53 murmillia kernel: [ 533.506411] ->result 0 Mar 26 03:32:53 murmillia kernel: [ 533.506412] ->obj_requests head ffff880256f0cb38 Mar 26 03:32:53 murmillia kernel: [ 533.506413] ->kref 0 Mar 26 03:32:53 murmillia kernel: [ 533.506414] Mar 26 03:32:53 murmillia kernel: [ 533.506414] Assertion failure in rbd_img_obj_callback() at line 2165: Mar 26 03:32:53 murmillia kernel: [ 533.506414] Mar 26 03:32:53 murmillia kernel: [ 533.506414] rbd_assert(which == img_request->next_completion); Mar 26 03:32:53 murmillia kernel: [ 533.506414] Mar 26 03:32:53 murmillia kernel: [ 533.506620] ------------[ cut here ]------------ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 25 mars 2014 à 20:55 -0500, Alex Elder a écrit : > On 03/25/2014 08:50 PM, Olivier Bonvalet wrote: > > Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit : > >> Thanks for your patch. > >> > >> This is an output of a crash case : > >> > >> Mar 26 02:31:18 alg kernel: [ 965.366895] rbd_img_obj_callback: bad image object request information: > >> Mar 26 02:31:18 alg kernel: [ 965.366905] obj_request ffff880224bc9528 > >> Mar 26 02:31:18 alg kernel: [ 965.366909] ->object_name <(null)> > >> Mar 26 02:31:18 alg kernel: [ 965.366913] ->offset 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366917] ->length 4096 > >> Mar 26 02:31:18 alg kernel: [ 965.366921] ->type 0x1 > >> Mar 26 02:31:18 alg kernel: [ 965.366925] ->flags 0x3 > >> Mar 26 02:31:18 alg kernel: [ 965.366929] ->img_request (null) > >> Mar 26 02:31:18 alg kernel: [ 965.366933] ->which 4294967295 > >> Mar 26 02:31:18 alg kernel: [ 965.366936] ->xferred 4096 > >> Mar 26 02:31:18 alg kernel: [ 965.366940] ->result 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366943] ->kref 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366947] img_request ffff880222f4fb50 > >> Mar 26 02:31:18 alg kernel: [ 965.366950] ->snap 0xfffffffffffffffe > >> Mar 26 02:31:18 alg kernel: [ 965.366954] ->offset 1417662464 > >> Mar 26 02:31:18 alg kernel: [ 965.366957] ->length 16384 > >> Mar 26 02:31:18 alg kernel: [ 965.366960] ->flags 0x0 > >> Mar 26 02:31:18 alg kernel: [ 965.366963] ->obj_request_count 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366966] ->next_completion 2 > >> Mar 26 02:31:18 alg kernel: [ 965.366969] ->xferred 16384 > >> Mar 26 02:31:18 alg kernel: [ 965.366973] ->result 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366976] ->obj_requests head ffff880222f4fbb0 > >> Mar 26 02:31:18 alg kernel: [ 965.366980] ->kref 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366985] > >> Mar 26 02:31:18 alg kernel: [ 965.366985] Assertion failure in rbd_img_obj_callback() at line 2165: > >> Mar 26 02:31:18 alg kernel: [ 965.366985] > >> Mar 26 02:31:18 alg kernel: [ 965.366985] rbd_assert(which == img_request->next_completion); > >> Mar 26 02:31:18 alg kernel: [ 965.366985] > >> Mar 26 02:31:18 alg kernel: [ 965.367185] ------------[ cut here ]------------ > >> Mar 26 02:31:18 alg kernel: [ 965.367241] kernel BUG at drivers/block/rbd.c:2165! > >> > >> > >> I hope it can help. > >> > >> > > > Thanks for sending these. > > > > > and a second one, very similar : > > > > Mar 26 02:48:27 alg kernel: [ 681.167833] rbd_img_obj_callback: bad image object request information: > > Mar 26 02:48:27 alg kernel: [ 681.167836] obj_request ffff88022e1e2828 > > Mar 26 02:48:27 alg kernel: [ 681.167837] ->object_name <(null)> > > Mar 26 02:48:27 alg kernel: [ 681.167838] ->offset 0 > > Mar 26 02:48:27 alg kernel: [ 681.167839] ->length 4096 > > Mar 26 02:48:27 alg kernel: [ 681.167840] ->type 0x1 > > Mar 26 02:48:27 alg kernel: [ 681.167840] ->flags 0x3 > > Mar 26 02:48:27 alg kernel: [ 681.167841] ->img_request (null) > > Mar 26 02:48:27 alg kernel: [ 681.167842] ->which 4294967295 > > Mar 26 02:48:27 alg kernel: [ 681.167843] ->xferred 4096 > > Mar 26 02:48:27 alg kernel: [ 681.167844] ->result 0 > > Mar 26 02:48:27 alg kernel: [ 681.167844] ->kref 0 > > This confirms the reference count of the object request has gone > to zero. This object request has already been destroyed (yet > we're handling a callback for it). > > > Mar 26 02:48:27 alg kernel: [ 681.167845] img_request ffff88021f555f10 > > Mar 26 02:48:27 alg kernel: [ 681.167846] ->snap 0xfffffffffffffffe > > Mar 26 02:48:27 alg kernel: [ 681.167847] ->offset 28072464384 > > Mar 26 02:48:27 alg kernel: [ 681.167847] ->length 16384 > > Mar 26 02:48:27 alg kernel: [ 681.167848] ->flags 0x0 > > Mar 26 02:48:27 alg kernel: [ 681.167849] ->obj_request_count 0 > > Mar 26 02:48:27 alg kernel: [ 681.167850] ->next_completion 2 > > Mar 26 02:48:27 alg kernel: [ 681.167850] ->xferred 16384 > > Mar 26 02:48:27 alg kernel: [ 681.167851] ->result 0 > > Mar 26 02:48:27 alg kernel: [ 681.167852] ->obj_requests head ffff88021f555f70 > > The object request list is empty. > > > Mar 26 02:48:27 alg kernel: [ 681.167853] ->kref 0 > > This confirms the reference count of the image request has gone > to zero. So not only has the object request already completed, > the image request has as well. > > I'm almost done composing a very large e-mail with some detailed > analysis. No answer quite yet, but I am certain that we're > getting duplicate callbacks on the second object request of > an image request that spans two objects. That should help > narrow the search for the root cause. > > -Alex Thanks again to took time to analyze that problem. All my RBD images have daily snapshots, can this bug be related to snapshots ? Maybe it's a stupid question, but is there a workaround that I could use to reduce that problem in production, until a proper fix is found ? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/25/2014 09:40 PM, Olivier Bonvalet wrote: > Thanks again to took time to analyze that problem. > > All my RBD images have daily snapshots, can this bug be related to > snapshots ? > > Maybe it's a stupid question, but is there a workaround that I could use > to reduce that problem in production, until a proper fix is found ? I don't believe this has anything to do with snapshots. And I wish I knew a workaround, but unfortunately I don't know of one. Once I understand the problem (I'm getting close) I might have one, but at that point I'll probably have a fix. I'll try to find a fix as soon as I can. -Alex PS I thought you said you were going to stop for the night! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 25 mars 2014 à 21:42 -0500, Alex Elder a écrit :
> PS I thought you said you were going to stop for the night!
Yes, I would love ! But my phone doesn't stop ring about ceph crash :D
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: b/drivers/block/rbd.c =================================================================== --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2132,6 +2132,35 @@ static void rbd_img_obj_callback(struct spin_lock_irq(&img_request->completion_lock); if (which > img_request->next_completion) goto out; + if (which != img_request->next_completion) { + printk("%s: bad image object request information:\n", __func__); + printk("obj_request %p\n", obj_request); + printk(" ->object_name <%s>\n", obj_request->object_name); + printk(" ->offset %llu\n", obj_request->offset); + printk(" ->length %llu\n", obj_request->length); + printk(" ->type 0x%x\n", (u32)obj_request->type); + printk(" ->flags 0x%lx\n", obj_request->flags); + printk(" ->img_request %p\n", obj_request->img_request); + printk(" ->which %u\n", obj_request->which); + printk(" ->xferred %llu\n", obj_request->xferred); + printk(" ->result %d\n", obj_request->result); + printk(" ->kref %d\n", atomic_read(&obj_request->kref)); + + printk("img_request %p\n", img_request); + printk(" ->snap 0x%016llx\n", img_request->snap_id); + printk(" ->offset %llu\n", img_request->offset); + printk(" ->length %llu\n", img_request->length); + printk(" ->flags 0x%lx\n", img_request->flags); + printk(" ->obj_request_count %u\n", + img_request->obj_request_count); + printk(" ->next_completion %u\n", + img_request->next_completion); + printk(" ->xferred %llu\n", img_request->xferred); + printk(" ->result %d\n", img_request->result); + printk(" ->obj_requests head %p\n", + img_request->obj_requests.next); + printk(" ->kref %d\n", atomic_read(&img_request->kref)); + } rbd_assert(which == img_request->next_completion); for_each_obj_request_from(img_request, obj_request) {