Message ID | 87eflmpqkb.fsf@notabene.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Feb 15 2018 at 4:09am -0500, NeilBrown <neilb@suse.com> wrote: > > If two bios are chained under the one parent (with bio_chain()) > it is possible that one will succeed and the other will fail. > __bio_chain_endio must ensure that the failure error status > is reported for the whole, rather than the success. > > It currently tries to be careful, but this test is racy. > If both children finish at the same time, they might both see that > parent->bi_status as zero, and so will assign their own status. > If the assignment to parent->bi_status by the successful bio happens > last, the error status will be lost which can lead to silent data > corruption. > > Instead, __bio_chain_endio should only assign a non-zero status > to parent->bi_status. There is then no need to test the current > value of parent->bi_status - a test that would be racy anyway. > > Note that this bug hasn't been seen in practice. It was only discovered > by examination after a similar bug was found in dm.c > > Signed-off-by: NeilBrown <neilb@suse.com> > --- > block/bio.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/bio.c b/block/bio.c > index e1708db48258..ad77140edc6f 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio) > { > struct bio *parent = bio->bi_private; > > - if (!parent->bi_status) > + if (bio->bi_status) > parent->bi_status = bio->bi_status; > bio_put(bio); > return parent; > -- > 2.14.0.rc0.dirty > Reviewed-by: Mike Snitzer <snitzer@redhat.com> Jens, this one slipped through the crack just over a year ago. It is available in patchwork here: https://patchwork.kernel.org/patch/10220727/
On 2/22/19 2:10 PM, Mike Snitzer wrote: > On Thu, Feb 15 2018 at 4:09am -0500, > NeilBrown <neilb@suse.com> wrote: > >> >> If two bios are chained under the one parent (with bio_chain()) >> it is possible that one will succeed and the other will fail. >> __bio_chain_endio must ensure that the failure error status >> is reported for the whole, rather than the success. >> >> It currently tries to be careful, but this test is racy. >> If both children finish at the same time, they might both see that >> parent->bi_status as zero, and so will assign their own status. >> If the assignment to parent->bi_status by the successful bio happens >> last, the error status will be lost which can lead to silent data >> corruption. >> >> Instead, __bio_chain_endio should only assign a non-zero status >> to parent->bi_status. There is then no need to test the current >> value of parent->bi_status - a test that would be racy anyway. >> >> Note that this bug hasn't been seen in practice. It was only discovered >> by examination after a similar bug was found in dm.c >> >> Signed-off-by: NeilBrown <neilb@suse.com> >> --- >> block/bio.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/block/bio.c b/block/bio.c >> index e1708db48258..ad77140edc6f 100644 >> --- a/block/bio.c >> +++ b/block/bio.c >> @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio) >> { >> struct bio *parent = bio->bi_private; >> >> - if (!parent->bi_status) >> + if (bio->bi_status) >> parent->bi_status = bio->bi_status; >> bio_put(bio); >> return parent; >> -- >> 2.14.0.rc0.dirty >> > > Reviewed-by: Mike Snitzer <snitzer@redhat.com> > > Jens, this one slipped through the crack just over a year ago. > It is available in patchwork here: > https://patchwork.kernel.org/patch/10220727/ Should this be: if (!parent->bi_status && bio->bi_status) parent->bi_status = bio->bi_status; perhaps?
On Fri, Feb 22 2019 at 5:46pm -0500, Jens Axboe <axboe@kernel.dk> wrote: > On 2/22/19 2:10 PM, Mike Snitzer wrote: > > On Thu, Feb 15 2018 at 4:09am -0500, > > NeilBrown <neilb@suse.com> wrote: > > > >> > >> If two bios are chained under the one parent (with bio_chain()) > >> it is possible that one will succeed and the other will fail. > >> __bio_chain_endio must ensure that the failure error status > >> is reported for the whole, rather than the success. > >> > >> It currently tries to be careful, but this test is racy. > >> If both children finish at the same time, they might both see that > >> parent->bi_status as zero, and so will assign their own status. > >> If the assignment to parent->bi_status by the successful bio happens > >> last, the error status will be lost which can lead to silent data > >> corruption. > >> > >> Instead, __bio_chain_endio should only assign a non-zero status > >> to parent->bi_status. There is then no need to test the current > >> value of parent->bi_status - a test that would be racy anyway. > >> > >> Note that this bug hasn't been seen in practice. It was only discovered > >> by examination after a similar bug was found in dm.c > >> > >> Signed-off-by: NeilBrown <neilb@suse.com> > >> --- > >> block/bio.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/block/bio.c b/block/bio.c > >> index e1708db48258..ad77140edc6f 100644 > >> --- a/block/bio.c > >> +++ b/block/bio.c > >> @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio) > >> { > >> struct bio *parent = bio->bi_private; > >> > >> - if (!parent->bi_status) > >> + if (bio->bi_status) > >> parent->bi_status = bio->bi_status; > >> bio_put(bio); > >> return parent; > >> -- > >> 2.14.0.rc0.dirty > >> > > > > Reviewed-by: Mike Snitzer <snitzer@redhat.com> > > > > Jens, this one slipped through the crack just over a year ago. > > It is available in patchwork here: > > https://patchwork.kernel.org/patch/10220727/ > > Should this be: > > if (!parent->bi_status && bio->bi_status) > parent->bi_status = bio->bi_status; > > perhaps? Yeap, even better. Not seeing any reason to have the last error win, the first in the chain is likely the most important.
I am perhaps not understanding the intricacies here, or not seeing a barrier protecting it, so forgive me if I'm off base. I think reading parent->bi_status here is unsafe. Consider the following sequence of events on two threads. Thread 0 Thread 1 In __bio_chain_endio: In __bio_chain_endio: [A] Child 0 reads parent->bi_status, no error. Child bio 1 reads parent, no error seen It sets parent->bi_status to an error It calls bio_put. Child bio 0 calls bio_put [end __bio_chain_endio] [end __bio_chain_endio] In bio_chain_endio(), bio_endio(parent) is called, calling bio_remaining_done() which decrements __bi_remaining to 1 and returns false, so no further endio stuff is done. In bio_chain_endio(), bio_endio(parent) is called, calling bio_remaining_done(), decrementing parent->__bi_remaining to 0, and continuing to finish parent. Either for block tracing or for parent's bi_end_io(), this thread tries to read parent->bi_status again. The compiler or the CPU may cache the read from [A], and since there are no intervening barriers, parent->bi_status is still believed on thread 0 to be success. Thus the bio may still be falsely believed to have completed successfully, even though child 1 set an error in it. Am I missing a subtlety here?
On Fri, Feb 22 2019 at 9:02pm -0500, John Dorminy <jdorminy@redhat.com> wrote: > I am perhaps not understanding the intricacies here, or not seeing a > barrier protecting it, so forgive me if I'm off base. I think reading > parent->bi_status here is unsafe. > Consider the following sequence of events on two threads. > > Thread 0 Thread 1 > In __bio_chain_endio: In __bio_chain_endio: > [A] Child 0 reads parent->bi_status, > no error. > Child bio 1 reads parent, no error seen > It sets parent->bi_status to an error > It calls bio_put. > Child bio 0 calls bio_put > [end __bio_chain_endio] [end __bio_chain_endio] > In bio_chain_endio(), bio_endio(parent) > is called, calling bio_remaining_done() > which decrements __bi_remaining to 1 > and returns false, so no further endio > stuff is done. > In bio_chain_endio(), bio_endio(parent) > is called, calling bio_remaining_done(), > decrementing parent->__bi_remaining to > 0, and continuing to finish parent. > Either for block tracing or for parent's > bi_end_io(), this thread tries to read > parent->bi_status again. > > The compiler or the CPU may cache the read from [A], and since there > are no intervening barriers, parent->bi_status is still believed on > thread 0 to be success. Thus the bio may still be falsely believed to > have completed successfully, even though child 1 set an error in it. > > Am I missing a subtlety here? Either neilb's original or even Jens' suggestion would be fine though. > if (!parent->bi_status && bio->bi_status) > parent->bi_status = bio->bi_status; Even if your scenario did play out (which I agree it looks possible) it'd just degenerate to neilb's version: > if (bio->bi_status) > parent->bi_status = bio->bi_status; Which also accomplishes fixing what Neil originally detailed in his patch header.
I'm also worried about the other two versions, though: memory-barriers.txt#1724: 1724 (*) The compiler is within its rights to invent stores to a variable, i.e. the compiler is free to decide __bio_chain_endio looks like this: static struct bio *__bio_chain_endio(struct bio *bio) { struct bio *parent = bio->bi_private; blk_status_t tmp = parent->bi_status; parent->bi_status = bio->bi_status; if (!bio->bi_status) parent->bi_status = tmp; bio_put(bio); return parent; } In which case, the read and later store on the two different threads may overlap in such a way that bio_endio sometimes sees success, even if one child had an error. As a result, I believe the setting of parent->bi_status needs to be a WRITE_ONCE() and the later reading needs to be a READ_ONCE() [although, since the later reading happens in many different functions, perhaps some other barrier to make sure all readers get the correct value is in order.]
Having studied the code in question more thoroughly, my first email's scenario can't occur, I believe. bio_put() contains a atomic_dec_and_test(), which (according to Documentation/atomic_t.txt), having a return value, are fully ordered and therefore impose a general memory barrier. A general memory barrier means that no value set before bio_put() may be observed afterwards, if I accurately understand Documentation/memory_barriers.txt. Thus, the scenario I imagined, with a load from inside __bio_chain_endio() being visible in bi_end_io(), cannot occur. However, the second email's scenario, of a compiler making two stores out of a conditional store, is still accurate according to my understanding. I continue to believe setting parent->bi_status needs to be a WRITE_ONCE() and any reading of parent->bi_status before bio_put() needs to be at least a READ_ONCE(). The last thing in that email is wrong for the same reason that the first email couldn't happen: the bio_put() general barrier means later accesses to the field from a single thread will freshly read the field and thereby not get an incorrectly cached value. As a concrete proposal, I believe either of the following work and fix the race NeilB described, as well as the compiler (or CPU) race I described: - if (!parent->bi_status) - parent->bi_status = bio->bi_status; + if (bio->bi_status) + WRITE_ONCE(parent->bi_status, bio->bi_status); --or-- - if (!parent->bi_status) - parent->bi_status = bio->bi_status; + if (!READ_ONCE(parent->bi_status) && bio->bi_status) + WRITE_ONCE(parent->bi_status, bio->bi_status); I believe the second of these might, but is not guaranteed to, preserve the first error observed in a child; I believe if you want to definitely save the first error you need an atomic.
On Tue, Jun 11, 2019 at 10:56:42PM -0400, John Dorminy wrote: > I believe the second of these might, but is not guaranteed to, > preserve the first error observed in a child; I believe if you want to > definitely save the first error you need an atomic. Is there any reason not to simply use a cmpxchg? Yes, it is a relatively expensive operation, but once we are chaining bios we are out of the super hot path anyway. We do something similar in xfs and iomap already.
On 6/12/19 9:01 AM, Christoph Hellwig wrote: > On Tue, Jun 11, 2019 at 10:56:42PM -0400, John Dorminy wrote: >> I believe the second of these might, but is not guaranteed to, >> preserve the first error observed in a child; I believe if you want to >> definitely save the first error you need an atomic. > > Is there any reason not to simply use a cmpxchg? Yes, it is a > relatively expensive operation, but once we are chaining bios we are out > of the super hot path anyway. We do something similar in xfs and iomap > already. > Agree. Thing is, we need to check if the parent status is NULL, _and_ the parent status might be modified asynchronously. So even a READ_ONCE() wouldn't cut it, as it would tell us that the parent status _was_ NULL, not that the parent status _is_ NULL by the time we're setting it. So cmpxchg() is it. Cheers, Hannes
diff --git a/block/bio.c b/block/bio.c index e1708db48258..ad77140edc6f 100644 --- a/block/bio.c +++ b/block/bio.c @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio) { struct bio *parent = bio->bi_private; - if (!parent->bi_status) + if (bio->bi_status) parent->bi_status = bio->bi_status; bio_put(bio); return parent;
If two bios are chained under the one parent (with bio_chain()) it is possible that one will succeed and the other will fail. __bio_chain_endio must ensure that the failure error status is reported for the whole, rather than the success. It currently tries to be careful, but this test is racy. If both children finish at the same time, they might both see that parent->bi_status as zero, and so will assign their own status. If the assignment to parent->bi_status by the successful bio happens last, the error status will be lost which can lead to silent data corruption. Instead, __bio_chain_endio should only assign a non-zero status to parent->bi_status. There is then no need to test the current value of parent->bi_status - a test that would be racy anyway. Note that this bug hasn't been seen in practice. It was only discovered by examination after a similar bug was found in dm.c Signed-off-by: NeilBrown <neilb@suse.com> --- block/bio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)