[RFC,v1,0/7] Block/XFS: Support alternative mirror device retry

Message ID	1543376991-5764-1-git-send-email-allison.henderson@oracle.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Allison Henderson <allison.henderson@oracle.com> To: linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: martin.petersen@oracle.com, shirley.ma@oracle.com, bob.liu@oracle.com, allison.henderson@oracle.com Subject: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry Date: Tue, 27 Nov 2018 20:49:44 -0700 Message-Id: <1543376991-5764-1-git-send-email-allison.henderson@oracle.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	Block/XFS: Support alternative mirror device retry \| expand [RFC,v1,0/7] Block/XFS: Support alternative mirror device retry [v1,1/7] block: add nr_mirrors to request_queue [v1,2/7] block: expand write_hint of bio/request to rw_hint [v1,3/7] md: raid1: handle bi_rw_hint accordingly [v1,4/7] xfs: Add b_rw_hint to xfs_buf [v1,5/7] xfs: Add device retry [v1,6/7] xfs: Rewrite retried read [v1,7/7] xfs: Add tracepoints and logging to alternate device retry

Allison Henderson Nov. 28, 2018, 3:49 a.m. UTC

Motivation:
When fs data/metadata checksum mismatch, lower block devices may have other
correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
decides that the metadata is garbage, today it will shut down the entire
filesystem without trying any of the other mirrors.  This is a severe
loss of service, and we propose these patches to have XFS try harder to
avoid failure.

This patch prototype this mirror retry idea by:
* Adding @nr_mirrors to struct request_queue which is similar as
  blk_queue_nonrot(), filesystem can grab device request queue and check max
  mirrors this block device has.
  Helper functions were also added to get/set the nr_mirrors.

* Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
 1.Original write_hint.
 2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
 3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.

* Modify md/raid1 to support this retry feature.

* Add b_rw_hint to xfs_buf
  This patch adds a new field b_rw_hint to xfs_buf.  We will use this to set the
  new bio->bi_rw_hint when submitting the read request, and also to store the
  returned mirror when the read compleates

* Add device retry
  This patch add some logic to xfs_buf_read_map.  If the read verify
  fails, we loop over the available mirrors and retry the read

* Rewrite retried read
  When the read verification fails, but the retry succeedes
  write the buffer back to correct the bad mirror

* Add tracepoints and logging to alternate device retry.
  This patch adds new log entries and trace points to the alternate device retry
  error path.

We're not planning to take over all 16 bits of the read hint field; just looking for
feedback about the sanity of the overall approach.

Allison Henderson (4):
  xfs: Add b_rw_hint to xfs_buf
  xfs: Add device retry
  xfs: Rewrite retried read
  xfs: Add tracepoints and logging to alternate device retry

Bob Liu (3):
  block: add nr_mirrors to request_queue
  block: expand write_hint of bio/request to rw_hint
  md: raid1: handle bi_rw_hint accordingly

 Documentation/block/biodoc.txt |  7 ++++++
 block/bio.c                    |  2 +-
 block/blk-core.c               | 13 ++++++++++-
 block/blk-merge.c              |  8 +++----
 block/blk-settings.c           | 18 ++++++++++++++
 block/bounce.c                 |  2 +-
 drivers/md/raid1.c             | 33 ++++++++++++++++++++++----
 drivers/md/raid5.c             | 10 ++++----
 drivers/md/raid5.h             |  2 +-
 drivers/nvme/host/core.c       |  2 +-
 fs/block_dev.c                 |  6 +++--
 fs/btrfs/extent_io.c           |  3 ++-
 fs/buffer.c                    |  3 ++-
 fs/direct-io.c                 |  3 ++-
 fs/ext4/page-io.c              |  7 ++++--
 fs/f2fs/data.c                 |  2 +-
 fs/iomap.c                     |  3 ++-
 fs/mpage.c                     |  2 +-
 fs/xfs/xfs_aops.c              |  4 ++--
 fs/xfs/xfs_buf.c               | 53 ++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_buf.h               |  8 +++++++
 fs/xfs/xfs_trace.h             |  6 ++++-
 include/linux/blk_types.h      |  2 +-
 include/linux/blkdev.h         |  5 +++-
 24 files changed, 169 insertions(+), 35 deletions(-)

Dave Chinner Nov. 28, 2018, 5:33 a.m. UTC | #1

On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
> Motivation:
> When fs data/metadata checksum mismatch, lower block devices may have other
> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> decides that the metadata is garbage, today it will shut down the entire
> filesystem without trying any of the other mirrors.  This is a severe
> loss of service, and we propose these patches to have XFS try harder to
> avoid failure.
> 
> This patch prototype this mirror retry idea by:
> * Adding @nr_mirrors to struct request_queue which is similar as
>   blk_queue_nonrot(), filesystem can grab device request queue and check max
>   mirrors this block device has.
>   Helper functions were also added to get/set the nr_mirrors.
> 
> * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
>  1.Original write_hint.
>  2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
>  3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
> 
> * Modify md/raid1 to support this retry feature.
> 
> * Add b_rw_hint to xfs_buf
>   This patch adds a new field b_rw_hint to xfs_buf.  We will use this to set the
>   new bio->bi_rw_hint when submitting the read request, and also to store the
>   returned mirror when the read compleates

One thing that is going to make this more complex at the XFS layer
is discontiguous buffers. They require multiple IOs (and therefore
bios) and so we are going to need to ensure that all the bios use
the same bi_rw_hint.

This is another reason I suggest that bi_rw_hint has a magic value
for "block layer selects mirror" and separate the initial read from
the retry iterations. That allows us to let he block layer ot pick
whatever leg it wants for the initial read, but if we get a failure
we directly control the mirror we retry from and all bios in the
buffer go to that same mirror.

> We're not planning to take over all 16 bits of the read hint field; just looking for
> feedback about the sanity of the overall approach.

It seems conceptually simple enough - the biggest questions I have
are:

	- how does propagation through stacked layers work?
	- is it generic/abstract enough to be able to work with
	  RAID5/6 to trigger verification/recovery from the parity
	  information in the stripe?

Cheers,

Dave.

Darrick J. Wong Nov. 28, 2018, 5:49 a.m. UTC | #2

On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
> > Motivation:
> > When fs data/metadata checksum mismatch, lower block devices may have other
> > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> > decides that the metadata is garbage, today it will shut down the entire
> > filesystem without trying any of the other mirrors.  This is a severe
> > loss of service, and we propose these patches to have XFS try harder to
> > avoid failure.
> > 
> > This patch prototype this mirror retry idea by:
> > * Adding @nr_mirrors to struct request_queue which is similar as
> >   blk_queue_nonrot(), filesystem can grab device request queue and check max
> >   mirrors this block device has.
> >   Helper functions were also added to get/set the nr_mirrors.
> > 
> > * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
> >  1.Original write_hint.
> >  2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
> >  3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
> > 
> > * Modify md/raid1 to support this retry feature.
> > 
> > * Add b_rw_hint to xfs_buf
> >   This patch adds a new field b_rw_hint to xfs_buf.  We will use this to set the
> >   new bio->bi_rw_hint when submitting the read request, and also to store the
> >   returned mirror when the read compleates
> 
> One thing that is going to make this more complex at the XFS layer
> is discontiguous buffers. They require multiple IOs (and therefore
> bios) and so we are going to need to ensure that all the bios use
> the same bi_rw_hint.

Hmm, we hadn't thought about that.  What happens if we have a
discontiguous buffer mapped to multiple blocks, and there's only one
good copy of each block on separate disks in the whole array?

e.g. we have 8k directory blocks on a 4k block filesystem, only disk 0
has a good copy of block 0 and only disk 1 has a good copy of block 1?

I think we're just stuck with failing the whole thing because we can't
check the halves of the 8k block independently and there's too much of a
combinatoric explosion potential to try to mix and match.

> This is another reason I suggest that bi_rw_hint has a magic value
> for "block layer selects mirror" and separate the initial read from

(As mentioned in a previous reply of mine, setting rw_hint == 0 is the
magic value for "device picks mirror"...)

> the retry iterations. That allows us to let he block layer ot pick
> whatever leg it wants for the initial read, but if we get a failure
> we directly control the mirror we retry from and all bios in the
> buffer go to that same mirror.
> 
> > We're not planning to take over all 16 bits of the read hint field; just looking for
> > feedback about the sanity of the overall approach.
> 
> It seems conceptually simple enough - the biggest questions I have
> are:
> 
> 	- how does propagation through stacked layers work?

Right now it doesn't, though once we work out how to make stacking work
through device mapper (my guess is that simple dm targets like linear
and crypt can set the mirror count to min(all underlying devices).

> 	- is it generic/abstract enough to be able to work with
> 	  RAID5/6 to trigger verification/recovery from the parity
> 	  information in the stripe?

In theory we could supply a raid5 implementation, wherein rw_hint == 0
lets the raid do as it pleases; rw_hint == 1 reads from the stripe; and
rw_hint == 2 forces stripe recovery for the given block.

A trickier scenario that I have no idea how to solve is the question of
how to handle dynamic redundancy levels.  We don't have a standard bio
error value that means "this mirror is temporarily offline", so if you
have a raid1 of two disks and disk 0 goes offline, the retry loop in xfs
will hit the EIO and abort without even asking disk 1.  It's also
unclear if we need to designate a second bio error value to mean "this
mirror is permanently gone".

[Also insert handwaving about whether or not online fsck will want to
control retries and automatic rewrite; I suspect the answer is that it
doesn't care.]

[[Also insert severe handwaving about do we expose this to userspace so
that xfs_repair can use it?]]

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner Nov. 28, 2018, 6:30 a.m. UTC | #3

On Tue, Nov 27, 2018 at 09:49:23PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> > On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
> > > Motivation:
> > > When fs data/metadata checksum mismatch, lower block devices may have other
> > > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> > > decides that the metadata is garbage, today it will shut down the entire
> > > filesystem without trying any of the other mirrors.  This is a severe
> > > loss of service, and we propose these patches to have XFS try harder to
> > > avoid failure.
> > > 
> > > This patch prototype this mirror retry idea by:
> > > * Adding @nr_mirrors to struct request_queue which is similar as
> > >   blk_queue_nonrot(), filesystem can grab device request queue and check max
> > >   mirrors this block device has.
> > >   Helper functions were also added to get/set the nr_mirrors.
> > > 
> > > * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
> > >  1.Original write_hint.
> > >  2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
> > >  3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
> > > 
> > > * Modify md/raid1 to support this retry feature.
> > > 
> > > * Add b_rw_hint to xfs_buf
> > >   This patch adds a new field b_rw_hint to xfs_buf.  We will use this to set the
> > >   new bio->bi_rw_hint when submitting the read request, and also to store the
> > >   returned mirror when the read compleates
> > 
> > One thing that is going to make this more complex at the XFS layer
> > is discontiguous buffers. They require multiple IOs (and therefore
> > bios) and so we are going to need to ensure that all the bios use
> > the same bi_rw_hint.
> 
> Hmm, we hadn't thought about that.  What happens if we have a
> discontiguous buffer mapped to multiple blocks, and there's only one
> good copy of each block on separate disks in the whole array?
> 
> e.g. we have 8k directory blocks on a 4k block filesystem, only disk 0
> has a good copy of block 0 and only disk 1 has a good copy of block 1?

Then the user has a disaster on their hands because they have
multiple failing disks. 

> I think we're just stuck with failing the whole thing because we can't
> check the halves of the 8k block independently and there's too much of a
> combinatoric explosion potential to try to mix and match.

Yup, user needs to fix their storage before the filesystem can
attempt recovery.

> > > We're not planning to take over all 16 bits of the read hint field; just looking for
> > > feedback about the sanity of the overall approach.
> > 
> > It seems conceptually simple enough - the biggest questions I have
> > are:
> > 
> > 	- how does propagation through stacked layers work?
> 
> Right now it doesn't, though once we work out how to make stacking work
> through device mapper (my guess is that simple dm targets like linear
> and crypt can set the mirror count to min(all underlying devices).
> 
> > 	- is it generic/abstract enough to be able to work with
> > 	  RAID5/6 to trigger verification/recovery from the parity
> > 	  information in the stripe?
> 
> In theory we could supply a raid5 implementation, wherein rw_hint == 0
> lets the raid do as it pleases; rw_hint == 1 reads from the stripe; and
> rw_hint == 2 forces stripe recovery for the given block.

So more magic numbers to define complex behaviours? :P

> A trickier scenario that I have no idea how to solve is the question of
> how to handle dynamic redundancy levels.  We don't have a standard bio
> error value that means "this mirror is temporarily offline", so if you

We can get ETIMEDOUT, ENOLINK, EBUSY and EAGAIN from the block layer
which all indicate temporary errors (see blk_errors[]). Whether the
specific storage layers are actually using them is another matter...

> have a raid1 of two disks and disk 0 goes offline, the retry loop in xfs
> will hit the EIO and abort without even asking disk 1.  It's also
> unclear if we need to designate a second bio error value to mean "this
> mirror is permanently gone".

If we have a mirror based retries, we should probably consider EIO
as "try next mirror", not as a hard failure.

> [Also insert handwaving about whether or not online fsck will want to
> control retries and automatic rewrite; I suspect the answer is that it
> doesn't care.]

Don't care - have the storage fix itself, then check what comes
back and fix it from there.

> [[Also insert severe handwaving about do we expose this to userspace so
> that xfs_repair can use it?]]

I suspect the answer there is through the AIO interfaces....

Cheers,

Dave.

Darrick J. Wong Nov. 28, 2018, 7:15 a.m. UTC | #4

On Wed, Nov 28, 2018 at 05:30:46PM +1100, Dave Chinner wrote:
> On Tue, Nov 27, 2018 at 09:49:23PM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> > > On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
> > > > Motivation:
> > > > When fs data/metadata checksum mismatch, lower block devices may have other
> > > > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> > > > decides that the metadata is garbage, today it will shut down the entire
> > > > filesystem without trying any of the other mirrors.  This is a severe
> > > > loss of service, and we propose these patches to have XFS try harder to
> > > > avoid failure.
> > > > 
> > > > This patch prototype this mirror retry idea by:
> > > > * Adding @nr_mirrors to struct request_queue which is similar as
> > > >   blk_queue_nonrot(), filesystem can grab device request queue and check max
> > > >   mirrors this block device has.
> > > >   Helper functions were also added to get/set the nr_mirrors.
> > > > 
> > > > * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
> > > >  1.Original write_hint.
> > > >  2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
> > > >  3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
> > > > 
> > > > * Modify md/raid1 to support this retry feature.
> > > > 
> > > > * Add b_rw_hint to xfs_buf
> > > >   This patch adds a new field b_rw_hint to xfs_buf.  We will use this to set the
> > > >   new bio->bi_rw_hint when submitting the read request, and also to store the
> > > >   returned mirror when the read compleates
> > > 
> > > One thing that is going to make this more complex at the XFS layer
> > > is discontiguous buffers. They require multiple IOs (and therefore
> > > bios) and so we are going to need to ensure that all the bios use
> > > the same bi_rw_hint.
> > 
> > Hmm, we hadn't thought about that.  What happens if we have a
> > discontiguous buffer mapped to multiple blocks, and there's only one
> > good copy of each block on separate disks in the whole array?
> > 
> > e.g. we have 8k directory blocks on a 4k block filesystem, only disk 0
> > has a good copy of block 0 and only disk 1 has a good copy of block 1?
> 
> Then the user has a disaster on their hands because they have
> multiple failing disks. 

Or lives in the crazy modern age, where we have rapidly autodegrading
flash storage and hard disks whose heads pop off with no warning. :D

(But seriously, ugh.)

> > I think we're just stuck with failing the whole thing because we can't
> > check the halves of the 8k block independently and there's too much of a
> > combinatoric explosion potential to try to mix and match.
> 
> Yup, user needs to fix their storage before the filesystem can
> attempt recovery.
> 
> > > > We're not planning to take over all 16 bits of the read hint field; just looking for
> > > > feedback about the sanity of the overall approach.
> > > 
> > > It seems conceptually simple enough - the biggest questions I have
> > > are:
> > > 
> > > 	- how does propagation through stacked layers work?
> > 
> > Right now it doesn't, though once we work out how to make stacking work
> > through device mapper (my guess is that simple dm targets like linear
> > and crypt can set the mirror count to min(all underlying devices).
> > 
> > > 	- is it generic/abstract enough to be able to work with
> > > 	  RAID5/6 to trigger verification/recovery from the parity
> > > 	  information in the stripe?
> > 
> > In theory we could supply a raid5 implementation, wherein rw_hint == 0
> > lets the raid do as it pleases; rw_hint == 1 reads from the stripe; and
> > rw_hint == 2 forces stripe recovery for the given block.
> 
> So more magic numbers to define complex behaviours? :P

Yes!!!

I mean... you /could/ allow devices more expansive reporting of their
redundancy capabilities so that xfs could look at its read-retry-time
budget and try mirrors in decreasing order of likelihood of a good
response:

struct blkdev_redundancy_level {
	unsigned		latency;		/* ms */
	unsigned		chance_of_success;	/* 0 to 100 */
} redundancy_levels[blk_queue_get_mirrors()] = {
	{ 10,	    90 }, /* tries another mirror */
	{ 300,      85 }, /* erasure decoding */
	{ 7000,	    30 }, /* long slow disk scraping via SCT ERC */
	{ 1000000,   5 }, /* boils the oceans looking for data */
};

So at least the indices wouldn't be *completely* magic.  But now we have
the question of how do you populate this table?  And how many callers
are going to do something smarter than the dumb loop that it's worth the
extra code?

(Anyone?  Now would be a great time to pipe up.)

> > A trickier scenario that I have no idea how to solve is the question of
> > how to handle dynamic redundancy levels.  We don't have a standard bio
> > error value that means "this mirror is temporarily offline", so if you
> 
> We can get ETIMEDOUT, ENOLINK, EBUSY and EAGAIN from the block layer
> which all indicate temporary errors (see blk_errors[]). Whether the
> specific storage layers are actually using them is another matter...

<nod>

> > have a raid1 of two disks and disk 0 goes offline, the retry loop in xfs
> > will hit the EIO and abort without even asking disk 1.  It's also
> > unclear if we need to designate a second bio error value to mean "this
> > mirror is permanently gone".
> 
> If we have a mirror based retries, we should probably consider EIO
> as "try next mirror", not as a hard failure.

Yeah.

> > [Also insert handwaving about whether or not online fsck will want to
> > control retries and automatic rewrite; I suspect the answer is that it
> > doesn't care.]
> 
> Don't care - have the storage fix itself, then check what comes
> back and fix it from there.

<nod> Admittedly, the auto retry and rewrite are dependent solely on the
lack of EIO and the verifiers giving their blessing, and for the most
part online fsck doesn't go digging through buffers that don't pass the
verifiers, so it'll likely never see any of this anyway.

> > [[Also insert severe handwaving about do we expose this to userspace so
> > that xfs_repair can use it?]]
> 
> I suspect the answer there is through the AIO interfaces....

Y{ay,uck}...

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Christoph Hellwig Nov. 28, 2018, 7:37 a.m. UTC | #5

On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> One thing that is going to make this more complex at the XFS layer
> is discontiguous buffers. They require multiple IOs (and therefore
> bios) and so we are going to need to ensure that all the bios use
> the same bi_rw_hint.

Well, in case of raid 1 the load balancing code might actually
map different bios to different initial legs.   What we really need
is to keep the 'index' or each bio.  One good way to archive that
is to just reuse the bio for the retry instead of allocating a new one.

Christoph Hellwig Nov. 28, 2018, 7:45 a.m. UTC | #6

On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> 	- how does propagation through stacked layers work?

The only way it works is by each layering driving it.  Thus my
recommendation above bilding on your earlier one to use an index
that is filled by the driver at I/O completion time.

E.g.

	bio_init:		bi_leg = -1

	raid1:			submit bio to lower driver
	raid 1 completion:	set bi_leg to 0 or 1

Now if we want to allow stacking we need to save/restore bi_leg
before submitting to the underlying device.  Which is possible,
but quite a bit of work in the drivers.

> 	- is it generic/abstract enough to be able to work with
> 	  RAID5/6 to trigger verification/recovery from the parity
> 	  information in the stripe?

If we get the non -1 bi_leg for paritity raid this is an inidicator
that parity rebuild needs to happen.  For multi-parity setups we could
also use different levels there.

Dave Chinner Nov. 28, 2018, 7:46 a.m. UTC | #7

On Tue, Nov 27, 2018 at 11:37:22PM -0800, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> > One thing that is going to make this more complex at the XFS layer
> > is discontiguous buffers. They require multiple IOs (and therefore
> > bios) and so we are going to need to ensure that all the bios use
> > the same bi_rw_hint.
> 
> Well, in case of raid 1 the load balancing code might actually
> map different bios to different initial legs.   What we really need
> is to keep the 'index' or each bio.  One good way to archive that
> is to just reuse the bio for the retry instead of allocating a new one.

Not sure that is practical, because by the time we run the verifier
that discovers the error we've already released and freed all the
bios. And we don't know when we complete the individual bios whether
to kep it or not as the failure may occur in a bio that has not yet
completed.

Maybe we should be chaining bios for discontig buffers rather than
submitting them individually - that keeps the whole chain around
until all bios in the chain have completed, right?

Cheers,

Dave.

Christoph Hellwig Nov. 28, 2018, 7:51 a.m. UTC | #8

On Wed, Nov 28, 2018 at 06:46:13PM +1100, Dave Chinner wrote:
> Maybe we should be chaining bios for discontig buffers rather than
> submitting them individually - that keeps the whole chain around
> until all bios in the chain have completed, right?

No, it doesn't.  It just keeps the head of the chain around.

But we generally submit one buffer per map, only if each map was
bigger than BIO_MAX_PAGE * PAGE_SIZE we'd submit multiple bios.
That should always be bigger than our buffer sizes.

We also have the additional problem that a single bio submitted
by the file system can be split into multiple by the block layer,
which happens for raid 5 at least, but at least that splitting
is driven by the drivers make_request function, so it can do
smarts there.

Andreas Dilger Nov. 28, 2018, 7:38 p.m. UTC | #9

On Nov 27, 2018, at 10:49 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
>> On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
>>> Motivation:
>>> When fs data/metadata checksum mismatch, lower block devices may have other
>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1
>>> but decides that the metadata is garbage, today it will shut down the entire
>>> filesystem without trying any of the other mirrors.  This is a severe
>>> loss of service, and we propose these patches to have XFS try harder to
>>> avoid failure.
>>> 
>>> This patch prototype this mirror retry idea by:
>>> * Adding @nr_mirrors to struct request_queue which is similar as
>>>  blk_queue_nonrot(), filesystem can grab device request queue and check max
>>>  mirrors this block device has.
>>>  Helper functions were also added to get/set the nr_mirrors.
>>> 
>>> * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
>>> 1.Original write_hint.
>>> 2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
>>> 3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
>>> 
>>> * Modify md/raid1 to support this retry feature.
>>> 
>>> * Add b_rw_hint to xfs_buf
>>>  This patch adds a new field b_rw_hint to xfs_buf.  We will use this to set the
>>>  new bio->bi_rw_hint when submitting the read request, and also to store the
>>>  returned mirror when the read completes
> 
>> the retry iterations. That allows us to let he block layer ot pick
>> whatever leg it wants for the initial read, but if we get a failure
>> we directly control the mirror we retry from and all bios in the
>> buffer go to that same mirror.
>> 	- is it generic/abstract enough to be able to work with
>> 	  RAID5/6 to trigger verification/recovery from the parity
>> 	  information in the stripe?
> 
> In theory we could supply a raid5 implementation, wherein rw_hint == 0
> lets the raid do as it pleases; rw_hint == 1 reads from the stripe; and
> rw_hint == 2 forces stripe recovery for the given block.

Definitely this API needs to be useful for RAID-5/6 storage as well, and
I don't think that needs too complex an interface to achieve.

Basically, the "nr_mirrors" parameter would instead be "nr_retries" or
similar, so that the caller knows how many possible data combinations
there are to try and validate.  For mirrors this is easy, and as it is
currently implemented.  For RAID-5/6 this would essentially be the
number of data rebuild combinations in the RAID group (e.g. 8 in a
RAID-5 8+1 setup, and 16 in a RAID-6 8+2).

For each call with nr_retries != 0, the MD RAID-5/6 driver would skip
one of the data drives, and rebuild that part of the data from parity.
This wouldn't take too long, since the blocks are already in memory,
they just need the parity to be recomputed in a few different ways to
try and find a combination that returns valid data (e.g. if a drive
failed and the parity also has a latent corrupt sector, not uncommon).

The next step is to have an API that says "retry=N returned the correct
data, rebuild the parity/drive with that combination of devices" so
that the corrupt parity sector isn't used during the rebuild.

Cheers, Andreas

Bob Liu Dec. 8, 2018, 2:49 p.m. UTC | #10

On 11/28/18 3:45 PM, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
>> 	- how does propagation through stacked layers work?
> 
> The only way it works is by each layering driving it.  Thus my
> recommendation above bilding on your earlier one to use an index
> that is filled by the driver at I/O completion time.
> 
> E.g.
> 
> 	bio_init:		bi_leg = -1
> 
> 	raid1:			submit bio to lower driver
> 	raid 1 completion:	set bi_leg to 0 or 1
> 
> Now if we want to allow stacking we need to save/restore bi_leg
> before submitting to the underlying device.  Which is possible,
> but quite a bit of work in the drivers.
> 

I found it's still very challenge while writing the code.
save/restore bi_leg may not enough because the drivers don't know how to do fs-metadata verify.

E.g two layer raid1 stacking

fs:                  md0(copies:2)
                     /          \
layer1/raid1   md1(copies:2)    md2(copies:2)
                  /    \          /     \
layer2/raid1   dev0   dev1      dev2    dev3

Assume dev2 is corrupted
 => md2: don't know how to do fs-metadata verify. 
   => md0: fs verify fail, retry md1(preserve md2).
Then md2 will never be retried even dev3 may also has the right copy.
Unless the upper layer device(md0) can know the amount of copy is 4 instead of 2? 
And need a way to handle the mapping.
Did I miss something? Thanks!

-Bob

>> 	- is it generic/abstract enough to be able to work with
>> 	  RAID5/6 to trigger verification/recovery from the parity
>> 	  information in the stripe?
> 
> If we get the non -1 bi_leg for paritity raid this is an inidicator
> that parity rebuild needs to happen.  For multi-parity setups we could
> also use different levels there.
>

Darrick J. Wong Dec. 10, 2018, 4:30 a.m. UTC | #11

On Sat, Dec 08, 2018 at 10:49:44PM +0800, Bob Liu wrote:
> On 11/28/18 3:45 PM, Christoph Hellwig wrote:
> > On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
> >> 	- how does propagation through stacked layers work?
> > 
> > The only way it works is by each layering driving it.  Thus my
> > recommendation above bilding on your earlier one to use an index
> > that is filled by the driver at I/O completion time.
> > 
> > E.g.
> > 
> > 	bio_init:		bi_leg = -1
> > 
> > 	raid1:			submit bio to lower driver
> > 	raid 1 completion:	set bi_leg to 0 or 1
> > 
> > Now if we want to allow stacking we need to save/restore bi_leg
> > before submitting to the underlying device.  Which is possible,
> > but quite a bit of work in the drivers.
> > 
> 
> I found it's still very challenge while writing the code.
> save/restore bi_leg may not enough because the drivers don't know how to do fs-metadata verify.
> 
> E.g two layer raid1 stacking
> 
> fs:                  md0(copies:2)
>                      /          \
> layer1/raid1   md1(copies:2)    md2(copies:2)
>                   /    \          /     \
> layer2/raid1   dev0   dev1      dev2    dev3
> 
> Assume dev2 is corrupted
>  => md2: don't know how to do fs-metadata verify. 
>    => md0: fs verify fail, retry md1(preserve md2).
> Then md2 will never be retried even dev3 may also has the right copy.
> Unless the upper layer device(md0) can know the amount of copy is 4 instead of 2? 
> And need a way to handle the mapping.
> Did I miss something? Thanks!

<shrug> It seems reasonable to me that the raid1 layer should set the
number of retries to (number of raid1 mirrors) * min(retry count of all
mirrors) so that the upper layer device (md0) would advertise 4 retry
possibilities instead of 2.

--D


> -Bob
> 
> >> 	- is it generic/abstract enough to be able to work with
> >> 	  RAID5/6 to trigger verification/recovery from the parity
> >> 	  information in the stripe?
> > 
> > If we get the non -1 bi_leg for paritity raid this is an inidicator
> > that parity rebuild needs to happen.  For multi-parity setups we could
> > also use different levels there.
> > 
>

[RFC,v1,0/7] Block/XFS: Support alternative mirror device retry

Message

Comments