diff mbox

error propagation problem on xfs over dm stripe

Message ID b44b5dc2-eac0-cd40-6de8-b615cb4dece8@sandeen.net (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Eric Sandeen Feb. 1, 2017, 3:12 a.m. UTC
xfstest generic/108 creates a stripe then fails one leg to
see if io errors on only a subset of an io get reported
to userspace:

# Test partial block device failure. Calls like fsync() should report failure
# on partial I/O failure, e.g. a single failed disk in a raid 0 stripe.
#
# Test motivated by an XFS bug, and this commit fixed the issue
# xfs: return errors from partial I/O failures to files


This started failing with a recent xfs commit, but the change wasn't
expected to change anything related to this behavior at all.

My best guess was that allocation and IO patterns shifted over
the stripe, and I think that turned out to be right.

I tracked this down to being unique to dm; an md stripe of
the same geometry doesn't have this issue.  Root cause seems
to be that dm's dec_pending() overwrites a bio's bi_error regardless
of current state, and in some cases will overwrite an -EIO with
a zero.  This seems to fix it for me:


but Mike was a little uneasy, not knowing for sure how we got here to
overwrite this bio's error (hopefully I'm representing his concerns
fairly and correctly).

One other clue is that I think this is a chained bio, something xfs
does use.

I'll submit the above as a proper dm patch if it seems sane, but figured
I'd throw this out on the lists as a bit of a heads up / question / RFC
before I do that.

-Eric

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Christoph Hellwig Feb. 1, 2017, 7:42 a.m. UTC | #1
On Tue, Jan 31, 2017 at 09:12:07PM -0600, Eric Sandeen wrote:
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5..3555ba8 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -808,7 +808,9 @@ static void dec_pending(struct dm_io *io, int error)
>  		} else {
>  			/* done with normal IO or empty flush */
>  			trace_block_bio_complete(md->queue, bio, io_error);
> -			bio->bi_error = io_error;
> +			/* don't overwrite or clear existing errors */
> +			if (!bio->bi_error)
> +				bio->bi_error = io_error;
>  			bio_endio(bio);
>  		}
>  	}
> 
> but Mike was a little uneasy, not knowing for sure how we got here to
> overwrite this bio's error (hopefully I'm representing his concerns
> fairly and correctly).

FYI, what we do both in the XFS buffer cache and the new direct I/O
code is to use a

	        cmpxchg(&dio->error, 0, ret);

in a similar situation, which should work here, too.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Junichi Nomura Feb. 1, 2017, 10:28 a.m. UTC | #2
On 02/01/17 12:12, Eric Sandeen wrote:
> xfstest generic/108 creates a stripe then fails one leg to
> see if io errors on only a subset of an io get reported
> to userspace:
> 
> # Test partial block device failure. Calls like fsync() should report failure
> # on partial I/O failure, e.g. a single failed disk in a raid 0 stripe.
> #
> # Test motivated by an XFS bug, and this commit fixed the issue
> # xfs: return errors from partial I/O failures to files
> 
> 
> This started failing with a recent xfs commit, but the change wasn't
> expected to change anything related to this behavior at all.
> 
> My best guess was that allocation and IO patterns shifted over
> the stripe, and I think that turned out to be right.
> 
> I tracked this down to being unique to dm; an md stripe of
> the same geometry doesn't have this issue.  Root cause seems
> to be that dm's dec_pending() overwrites a bio's bi_error regardless
> of current state, and in some cases will overwrite an -EIO with
> a zero.  This seems to fix it for me:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5..3555ba8 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -808,7 +808,9 @@ static void dec_pending(struct dm_io *io, int error)
>  		} else {
>  			/* done with normal IO or empty flush */
>  			trace_block_bio_complete(md->queue, bio, io_error);
> -			bio->bi_error = io_error;
> +			/* don't overwrite or clear existing errors */
> +			if (!bio->bi_error)
> +				bio->bi_error = io_error;
>  			bio_endio(bio);
>  		}
>  	}
> 
> but Mike was a little uneasy, not knowing for sure how we got here to
> overwrite this bio's error (hopefully I'm representing his concerns
> fairly and correctly).
> 
> One other clue is that I think this is a chained bio, something xfs
> does use.
> 
> I'll submit the above as a proper dm patch if it seems sane, but figured
> I'd throw this out on the lists as a bit of a heads up / question / RFC
> before I do that.

I couldn't see why the patch is needed.

dec_pending() already takes care of mixed error/non-error returns
from underlying devices by recording the last observed error:

        if (unlikely(error)) {
                spin_lock_irqsave(&io->endio_lock, flags);
                if (!(io->error > 0 && __noflush_suspending(md)))
                        io->error = error;
                spin_unlock_irqrestore(&io->endio_lock, flags);
        }

So successful return from healthy stripe doesn't overwrite
error return from faulty stripe.

Was bi_error already set by xfs when submitted and DM had to keep it?
If so, similar fix should be necessary for other drivers, not only for DM.
Mike Snitzer Feb. 1, 2017, 2:33 p.m. UTC | #3
On Wed, Feb 01 2017 at  2:42am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 31, 2017 at 09:12:07PM -0600, Eric Sandeen wrote:
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 3086da5..3555ba8 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -808,7 +808,9 @@ static void dec_pending(struct dm_io *io, int error)
> >  		} else {
> >  			/* done with normal IO or empty flush */
> >  			trace_block_bio_complete(md->queue, bio, io_error);
> > -			bio->bi_error = io_error;
> > +			/* don't overwrite or clear existing errors */
> > +			if (!bio->bi_error)
> > +				bio->bi_error = io_error;
> >  			bio_endio(bio);
> >  		}
> >  	}
> > 
> > but Mike was a little uneasy, not knowing for sure how we got here to
> > overwrite this bio's error (hopefully I'm representing his concerns
> > fairly and correctly).

Well that is just it, I'm not seeing how io_error (io->error) can ever
transition from non-zero to zero.  And bio->bi_error shouldn't be set
without having first set io->error.  But just cause I cannot see it
doesn't change the fact that it is clearly happening to you.

It does concern me that this kind of fundamental error propagation
change is needed.  Speaks to a regression.  Would be nice to bisect
this.. Eric? ;)
 
> FYI, what we do both in the XFS buffer cache and the new direct I/O
> code is to use a
> 
> 	        cmpxchg(&dio->error, 0, ret);
> 
> in a similar situation, which should work here, too.

What is the benefit?  Faster than the conditional?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Feb. 1, 2017, 2:49 p.m. UTC | #4
On Wed, Feb 01, 2017 at 09:33:51AM -0500, Mike Snitzer wrote:
> > FYI, what we do both in the XFS buffer cache and the new direct I/O
> > code is to use a
> > 
> > 	        cmpxchg(&dio->error, 0, ret);
> > 
> > in a similar situation, which should work here, too.
> 
> What is the benefit?  Faster than the conditional?

In the cases where I added the code the concern is that multiple
bio completions could race updating the flag, cmpxchg is a good
protection against that.  I'll need to take a look at the dm
case Eric reported in more detail to comment on it, though.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5..3555ba8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -808,7 +808,9 @@  static void dec_pending(struct dm_io *io, int error)
 		} else {
 			/* done with normal IO or empty flush */
 			trace_block_bio_complete(md->queue, bio, io_error);
-			bio->bi_error = io_error;
+			/* don't overwrite or clear existing errors */
+			if (!bio->bi_error)
+				bio->bi_error = io_error;
 			bio_endio(bio);
 		}
 	}