Message ID | 53325F09.7000306@ieee.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 03/26/2014 12:00 AM, Alex Elder wrote: > I think I've got it. I'll explain, and we'll have to look at > the explanation more closely in the morning. > > Bottom line is that I think the assertion is bogus. Like > the other assertion that was not done under protection of > a lock, this one is being done in a way that's not safe. > > First, here's a patch that I think might avoid the problem: > > ---------------------------------- > Index: b/drivers/block/rbd.c > =================================================================== > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -2130,9 +2130,8 @@ static void rbd_img_obj_callback(struct > rbd_assert(which < img_request->obj_request_count); > > spin_lock_irq(&img_request->completion_lock); > - if (which > img_request->next_completion) > + if (which != img_request->next_completion) > goto out; > - rbd_assert(which == img_request->next_completion); > > for_each_obj_request_from(img_request, obj_request) { > rbd_assert(more); > ---------------------------------- After some sleep (though not a lot), I remain convinced the above will fix the problem, for now. But it's not quite enough. The problem arises out of the fact that the thread of control that marks an object request done is not necessarily the thread of control that processing the completion of that object (nor its associated image) request. There are no duplicate requests and no duplicate responses. Object request completions are handled sequentially, based on the range of address space involved in the requests. This is the model required by the Linux blk_end_request() interface. Let's assume we have a single image request that has two contiguous object requests, with the "lower" object request involving bytes ending at the point the "higher" one begins. These object requests can be completed by their OSDs in any order, and we will handle their completion in the order their acknowledgement is received. If we process the higher request's acknowledgement first, that will be the first to enter rbd_img_obj_callback(). Since we will not have completed the lower one yet, nothing more is done. When the lower object request completes, it will enter rbd_img_obj_callback() and find that it *is* the next one we want completed. So we'll enter a loop that calls rbd_img_obj_end_request() on this request, followed by its successor, doing completion processing on any request that has by now been marked done. In the case I've described, the higher request will be marked done, so it is here that its completion processing is performed. Once the last object request has been completed, we call rbd_img_request_complete(), which will ultimately lead to dropping a reference on the image request, which leads to rbd_img_request_destroy(). What's happening in Olivier's case is that the requests are actually completing *in order*, but nearly simultaneously. The lower and higher request acknowledgements from their OSDs arrive, and their consequent calls to rbd_img_obj_end_request() occur at the same time. The lower request wins the race to get the spinlock before checking if it is the next one to complete. So the higher request waits; but note that both of them have been marked "done" at this point. The lower request enters the completion loop, and calls rbd_imb_obj_end_request() on both requests, and updates the next_completion value before releasing the spinlock. Now the higher request acquires the lock and gets its chance to check its position against next_completion. Since completion of this higher request was now been done already while processing the lower request, next_completion has now gone past the higher request. Therefore its "which" value will not be higher than "next_completion" but neither will it be equal to it, so the old assertion would fail. So my proposed fix just recognizes that, and allows only the object request matching next_completion to enter the completion loop, without assuming any other request must have a greater "which" value. Hopefully that's a more complete and clear explanation of the problem. Now, onto why that's not sufficient. Basically, at the instant an image object request is marked done, it is eligible to be completed by a call to rbd_img_obj_end_request() by a concurrent thread processing a lower-numbered object request in the same image request. And at just after that instant (essentially, because of the protection of the completion lock), the image request itself will be completed by a call to rbd_img_request_complete(). It's likely this will drop the last reference to the image request, so rbd_img_request_destroy() will be called. And that calls rbd_img_obj_request_del() on each of its object requests. In other words, conceivably the higher object request that was stuck waiting for the spinlock in rbd_img_obj_end_request() may now be getting deleted from its image's request list. (It could be worse, it may have *just* finished getting marked done and may not have even entered rbd_img_obj_end_request() yet.) And in the process of getting deleted from its image request's object request list, an object request gets partially reinitialized, freed, and may get reused. So... in order to have a complete fix we need to do some careful work with reference counting of image requests and object requests, and avoid having any of their content get disturbed until the last reference is dropped. Right now it's not quite right. In particular, each object request needs to have a *reference* to its image request (not just a pointer to it), and it should not be dropped until the the object request gets destroyed. And because one thread can destroy an object request being actively used by another, each object request should have reference that gets held until it is no longer needed (probably at the end of rbd_osd_req_callback()). I think for the time being, the simple fix I described last night will do, and it's somewhat unlikely--though not impossible--for the problems due to concurrent destruction will arise. But we do need a fix. I'm prepared to create it, but for now I'd like to have another opinion on my treatise above. -Alex -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 26, 2014 at 1:13 PM, Alex Elder <elder@ieee.org> wrote: > On 03/26/2014 12:00 AM, Alex Elder wrote: >> I think I've got it. I'll explain, and we'll have to look at >> the explanation more closely in the morning. >> >> Bottom line is that I think the assertion is bogus. Like >> the other assertion that was not done under protection of >> a lock, this one is being done in a way that's not safe. >> >> First, here's a patch that I think might avoid the problem: >> >> ---------------------------------- >> Index: b/drivers/block/rbd.c >> =================================================================== >> --- a/drivers/block/rbd.c >> +++ b/drivers/block/rbd.c >> @@ -2130,9 +2130,8 @@ static void rbd_img_obj_callback(struct >> rbd_assert(which < img_request->obj_request_count); >> >> spin_lock_irq(&img_request->completion_lock); >> - if (which > img_request->next_completion) >> + if (which != img_request->next_completion) >> goto out; >> - rbd_assert(which == img_request->next_completion); >> >> for_each_obj_request_from(img_request, obj_request) { >> rbd_assert(more); >> ---------------------------------- > > > After some sleep (though not a lot), I remain convinced > the above will fix the problem, for now. But it's not > quite enough. > > The problem arises out of the fact that the thread of > control that marks an object request done is not necessarily > the thread of control that processing the completion of > that object (nor its associated image) request. There are > no duplicate requests and no duplicate responses. > > > Object request completions are handled sequentially, based > on the range of address space involved in the requests. > This is the model required by the Linux blk_end_request() > interface. Let's assume we have a single image request > that has two contiguous object requests, with the "lower" > object request involving bytes ending at the point the > "higher" one begins. > > These object requests can be completed by their OSDs in > any order, and we will handle their completion in the > order their acknowledgement is received. If we process > the higher request's acknowledgement first, that will > be the first to enter rbd_img_obj_callback(). Since > we will not have completed the lower one yet, nothing > more is done. > > When the lower object request completes, it will enter > rbd_img_obj_callback() and find that it *is* the > next one we want completed. So we'll enter a loop > that calls rbd_img_obj_end_request() on this request, > followed by its successor, doing completion processing > on any request that has by now been marked done. In > the case I've described, the higher request will be > marked done, so it is here that its completion processing > is performed. > > Once the last object request has been completed, we > call rbd_img_request_complete(), which will ultimately > lead to dropping a reference on the image request, > which leads to rbd_img_request_destroy(). > > > What's happening in Olivier's case is that the > requests are actually completing *in order*, but > nearly simultaneously. The lower and higher request > acknowledgements from their OSDs arrive, and their > consequent calls to rbd_img_obj_end_request() occur > at the same time. The lower request wins the race > to get the spinlock before checking if it is the > next one to complete. So the higher request waits; > but note that both of them have been marked "done" > at this point. > > The lower request enters the completion loop, and > calls rbd_imb_obj_end_request() on both requests, > and updates the next_completion value before releasing > the spinlock. Now the higher request acquires the > lock and gets its chance to check its position against > next_completion. Since completion of this higher > request was now been done already while processing > the lower request, next_completion has now gone > past the higher request. Therefore its "which" > value will not be higher than "next_completion" > but neither will it be equal to it, so the old > assertion would fail. > > So my proposed fix just recognizes that, and allows > only the object request matching next_completion > to enter the completion loop, without assuming any > other request must have a greater "which" value. > > Hopefully that's a more complete and clear > explanation of the problem. > > > Now, onto why that's not sufficient. > > Basically, at the instant an image object request is > marked done, it is eligible to be completed by a > call to rbd_img_obj_end_request() by a concurrent > thread processing a lower-numbered object request in > the same image request. And at just after that > instant (essentially, because of the protection > of the completion lock), the image request itself will > be completed by a call to rbd_img_request_complete(). > It's likely this will drop the last reference to the > image request, so rbd_img_request_destroy() will be > called. And that calls rbd_img_obj_request_del() on > each of its object requests. > > In other words, conceivably the higher object > request that was stuck waiting for the spinlock > in rbd_img_obj_end_request() may now be getting > deleted from its image's request list. (It > could be worse, it may have *just* finished > getting marked done and may not have even entered > rbd_img_obj_end_request() yet.) And in the > process of getting deleted from its image request's > object request list, an object request gets > partially reinitialized, freed, and may get reused. > > > So... in order to have a complete fix we need to > do some careful work with reference counting of > image requests and object requests, and avoid > having any of their content get disturbed until > the last reference is dropped. > > Right now it's not quite right. In particular, each > object request needs to have a *reference* to its > image request (not just a pointer to it), and it > should not be dropped until the the object request > gets destroyed. And because one thread can destroy > an object request being actively used by another, > each object request should have reference that gets > held until it is no longer needed (probably at the > end of rbd_osd_req_callback()). > > > I think for the time being, the simple fix I > described last night will do, and it's somewhat > unlikely--though not impossible--for the problems > due to concurrent destruction will arise. But > we do need a fix. I'm prepared to create it, but > for now I'd like to have another opinion on my > treatise above. This all makes sense, I think reference counting is the right thing to do. We definitely do need a real fix, with the simple fix we still have a potential use-after-free on img_request itself, not to mention obj_requests. It looks like img_request kref currenlty exists for posterity only. Unless I'm missing something, its counter is set to 1 in rbd_img_request_create() and is not incremented anywhere else, which means that the instant rbd_img_request_put() is called, img_request is freed. I naively assumed it was incremented and decremented in rbd_img_obj_request_add() and rbd_img_obj_request_del() respectively.. Maybe that's something we should look at first? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/26/2014 06:43 AM, Ilya Dryomov wrote: > It looks like img_request kref currenlty exists for posterity only. > Unless I'm missing something, its counter is set to 1 in > rbd_img_request_create() and is not incremented anywhere else, which > means that the instant rbd_img_request_put() is called, img_request is > freed. I naively assumed it was incremented and decremented in > rbd_img_obj_request_add() and rbd_img_obj_request_del() respectively.. > Maybe that's something we should look at first? I believe it was something that I never finished implementing. So yes, the kref is there to be used, it just isn't, really. If you want to give it a try, be my guest. I'll review it carefully. (I can't really test my changes very well anyway.) If you want me to do it, let me know. -Alex -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 26, 2014 at 1:47 PM, Alex Elder <elder@ieee.org> wrote: > On 03/26/2014 06:43 AM, Ilya Dryomov wrote: >> It looks like img_request kref currenlty exists for posterity only. >> Unless I'm missing something, its counter is set to 1 in >> rbd_img_request_create() and is not incremented anywhere else, which >> means that the instant rbd_img_request_put() is called, img_request is >> freed. I naively assumed it was incremented and decremented in >> rbd_img_obj_request_add() and rbd_img_obj_request_del() respectively.. >> Maybe that's something we should look at first? > > I believe it was something that I never finished implementing. > > So yes, the kref is there to be used, it just isn't, really. > > If you want to give it a try, be my guest. I'll review it > carefully. (I can't really test my changes very well anyway.) > > If you want me to do it, let me know. I think you should do it ;) If time is not a big issue, that is. You have a much more complete picture in your head. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Olivier reports that with the simple patch I provided (which changed a "<" to a "!=" and removed an assertion) he is running successfully. To me this is fantastic news, and you can see I posted a patch with the fix. There remains a race condition though, one which I described in a separate message earlier today. I don't think it will prove to be a problem in practice, but I agreed to work on a fix to ensure the race condition is eliminated. It will require some work with reference counting image and object requests. The fix won't be coming today. But I aim to provide it in a matter of several days. -Alex -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 26 mars 2014 à 15:58 -0500, Alex Elder a écrit : > Olivier reports that with the simple patch I provided > (which changed a "<" to a "!=" and removed an assertion) > he is running successfully. > > To me this is fantastic news, and you can see I posted > a patch with the fix. > > There remains a race condition though, one which I described > in a separate message earlier today. I don't think it will > prove to be a problem in practice, but I agreed to work on > a fix to ensure the race condition is eliminated. It will > require some work with reference counting image and object > requests. > > The fix won't be coming today. But I aim to provide it > in a matter of several days. > > -Alex > One question from one of my customers : why am I the only one to complain about that problem ? I know that Ceph users often use qemu/librbd instead of kernel client, but what is the trigger of those «race condition» ? Having "multiple requests" per RBD image ? It should be a normal use, no ? If someone can help me give an explanation, thanks :) Olivier -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 27, 2014 at 9:48 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote: > Le mercredi 26 mars 2014 à 15:58 -0500, Alex Elder a écrit : >> Olivier reports that with the simple patch I provided >> (which changed a "<" to a "!=" and removed an assertion) >> he is running successfully. >> >> To me this is fantastic news, and you can see I posted >> a patch with the fix. >> >> There remains a race condition though, one which I described >> in a separate message earlier today. I don't think it will >> prove to be a problem in practice, but I agreed to work on >> a fix to ensure the race condition is eliminated. It will >> require some work with reference counting image and object >> requests. >> >> The fix won't be coming today. But I aim to provide it >> in a matter of several days. >> >> -Alex >> > > One question from one of my customers : why am I the only one to > complain about that problem ? > I know that Ceph users often use qemu/librbd instead of kernel client, > but what is the trigger of those «race condition» ? Having "multiple > requests" per RBD image ? It should be a normal use, no ? > > If someone can help me give an explanation, thanks :) We've had a couple more, similar reports in the last few months. However you are the first reporter who was able to trigger this race often enough to track it down. This race condition (read: bug) is kernel client specific, qemu/librbd is unaffected. Having an rbd request that spans multiple RADOS objects and therefore results in multiple object requests is normal use, it's just that particular piece of code turned out to be prone to a subtle race. You have to keep in mind that races are all about timing and relative order of events, so simply issuing a multi-object rbd request is not enough to trigger it, stars have to align too ;) Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 27 mars 2014 à 10:45 +0200, Ilya Dryomov a écrit : > On Thu, Mar 27, 2014 at 9:48 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote: > > Le mercredi 26 mars 2014 à 15:58 -0500, Alex Elder a écrit : > >> Olivier reports that with the simple patch I provided > >> (which changed a "<" to a "!=" and removed an assertion) > >> he is running successfully. > >> > >> To me this is fantastic news, and you can see I posted > >> a patch with the fix. > >> > >> There remains a race condition though, one which I described > >> in a separate message earlier today. I don't think it will > >> prove to be a problem in practice, but I agreed to work on > >> a fix to ensure the race condition is eliminated. It will > >> require some work with reference counting image and object > >> requests. > >> > >> The fix won't be coming today. But I aim to provide it > >> in a matter of several days. > >> > >> -Alex > >> > > > > One question from one of my customers : why am I the only one to > > complain about that problem ? > > I know that Ceph users often use qemu/librbd instead of kernel client, > > but what is the trigger of those «race condition» ? Having "multiple > > requests" per RBD image ? It should be a normal use, no ? > > > > If someone can help me give an explanation, thanks :) > > We've had a couple more, similar reports in the last few months. > However you are the first reporter who was able to trigger this race > often enough to track it down. This race condition (read: bug) is > kernel client specific, qemu/librbd is unaffected. Having an rbd > request that spans multiple RADOS objects and therefore results in > multiple object requests is normal use, it's just that particular piece > of code turned out to be prone to a subtle race. You have to keep in > mind that races are all about timing and relative order of events, so > simply issuing a multi-object rbd request is not enough to trigger it, > stars have to align too ;) > > Thanks, > > Ilya > Great, thanks ! Olivier -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: b/drivers/block/rbd.c =================================================================== --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2130,9 +2130,8 @@ static void rbd_img_obj_callback(struct rbd_assert(which < img_request->obj_request_count); spin_lock_irq(&img_request->completion_lock); - if (which > img_request->next_completion) + if (which != img_request->next_completion) goto out; - rbd_assert(which == img_request->next_completion); for_each_obj_request_from(img_request, obj_request) { rbd_assert(more);