diff mbox

Issue #5876 : assertion failure in rbd_img_obj_callback()

Message ID 533184AF.9050101@ieee.org (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Elder March 25, 2014, 1:29 p.m. UTC
On 03/25/2014 08:18 AM, Olivier Bonvalet wrote:
> 
> 
> Le mardi 25 mars 2014 à 14:57 +0200, Ilya Dryomov a écrit :
>> On Tue, Mar 25, 2014 at 2:51 PM, Alex Elder <elder@ieee.org> wrote:
>>> On 03/25/2014 07:34 AM, Ilya Dryomov wrote:
>>>>> On 03/25/2014 04:04 AM, Ilya Dryomov wrote:
>>>>>> On Tue, Mar 25, 2014 at 10:39 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> what can/should I do to help fix that problem ?
>>>>>>>
>>>>>>> for now, RBD kernel client hang on :
>>>>>>>         Assertion failure in rbd_img_obj_callback() at line 2131:
>>>>>>>            rbd_assert(which >= img_request->next_completion);
>>>>>
>>>>> If you can build your own kernel as Ilya says I'd like to
>>>>> see the values of which and img_request->next_completion
>>>>> here.
>>>>
>>>> Looks like which was 1, which means that next_completion had to be 2 or
>>>> greater.  I miss solaris crash dumps ...
>>>>
>>>> On a different note, why are we asserting next_completion outside of
>>>> a spinlock which is supposed to protect next_completion?
>>>
>>> That's a very good point (which could be easily remedied by moving
>>> the assertion down a couple lines).  The image object request (#1)
>>> in this case will have been marked done at this point; it's possible
>>> that request #2 (or later) was concurrently getting handled by the
>>> for_each_obj_request_from() loop below in that same function, but
>>> may not have updated next_completion yet.
>>>
>>> So that *could* explain the tripped assertion.  The assertion
>>> should be moved in any case, it's a bug.
>>>
>>> That being said, it doesn't explain the other assertion:
>>>         rbd_assert(img_request != NULL);
>>> So there's at least one other thing going on.
>>
>> Yeah, exactly my thoughts.
>>
>> Thanks,
>>
>>                 Ilya
> 
> So, a (partial) fix can be this patch ?
> 
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2123,6 +2123,7 @@ static void rbd_img_obj_callback(struct rbd_obj_request *obj_request)
>         rbd_assert(obj_request_img_data_test(obj_request));
>         img_request = obj_request->img_request;
>  
> +       spin_lock_irq(&img_request->completion_lock);
>         dout("%s: img %p obj %p\n", __func__, img_request, obj_request);
>         rbd_assert(img_request != NULL);
>         rbd_assert(img_request->obj_request_count > 0);
> @@ -2130,7 +2131,6 @@ static void rbd_img_obj_callback(struct rbd_obj_request *obj_request)
>         rbd_assert(which < img_request->obj_request_count);
>         rbd_assert(which >= img_request->next_completion);
>  
> -       spin_lock_irq(&img_request->completion_lock);
>         if (which != img_request->next_completion)
>                 goto out;


Yes, roughly.  I'd do the following instead.  It would be great
to learn whether it eliminates the one form of assertion failure
you were seeing.

					-Alex



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2128,11 +2128,11 @@  static void rbd_img_obj_callback(struct
 	rbd_assert(img_request->obj_request_count > 0);
 	rbd_assert(which != BAD_WHICH);
 	rbd_assert(which < img_request->obj_request_count);
-	rbd_assert(which >= img_request->next_completion);

 	spin_lock_irq(&img_request->completion_lock);
 	if (which != img_request->next_completion)
 		goto out;
+	rbd_assert(which > img_request->next_completion);

 	for_each_obj_request_from(img_request, obj_request) {
 		rbd_assert(more);