diff mbox series

[03/15] btrfs: look at full bi_io_vec for repair decision

Message ID c0f65f07b18eee7cef4e0b0b439a45ae437a11c6.1583789410.git.osandov@fb.com (mailing list archive)
State New, archived
Headers show
Series btrfs: read repair/direct I/O improvements | expand

Commit Message

Omar Sandoval March 9, 2020, 9:32 p.m. UTC
From: Omar Sandoval <osandov@fb.com>

Read repair does two things: it finds a good copy of data to return to
the reader, and it corrects the bad copy on disk. If a read of multiple
sectors has an I/O error, repair does an extra "validation" step that
issues a separate read for each sector. This allows us to find the exact
failing sectors and only rewrite those.

This heuristic is implemented in
bio_readpage_error()/btrfs_check_repairable() as:

	failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
	if (failed_bio_pages > 1)
		do validation

However, at this point, bi_iter may have already been advanced. This
means that we'll skip the validation step and rewrite the entire failed
read.

Fix it by getting the actual size from the biovec (which we can do
because this is only called for non-cloned bios, although that will
change in a later commit).

Fixes: 8a2ee44a371c ("btrfs: look at bi_size for repair decisions")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent_io.c | 28 ++++++++++++++++++++++------
 fs/btrfs/extent_io.h |  5 +++--
 2 files changed, 25 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig March 10, 2020, 4:33 p.m. UTC | #1
On Mon, Mar 09, 2020 at 02:32:29PM -0700, Omar Sandoval wrote:
> +	/*
> +	 * We need to validate each sector individually if the I/O was for
> +	 * multiple sectors.
> +	 */
> +	len = 0;
> +	for (i = 0; i < failed_bio->bi_vcnt; i++) {
> +		len += failed_bio->bi_io_vec[i].bv_len;
> +		if (len > inode->i_sb->s_blocksize) {
> +			need_validation = true;
> +			break;
> +		}

So given that btrfs is the I/O submitter iterating over all bio_vecs
should probably be fine.  That being said I deeply dislike the idea
of having code like this inside the file system.  Maybe we can add
a helper like:

u32 bio_size_all(struct bio *bio)
{
	u32 len, i;

	for (i = 0; i < bio->bi_vcnt; i++)
		len += bio->bi_io_vec[i].bv_len;
	return len;
}

in the core block code, with a kerneldoc comment documenting the
exact use cases, and then use that?
Omar Sandoval March 11, 2020, 9:07 a.m. UTC | #2
On Tue, Mar 10, 2020 at 05:33:19PM +0100, Christoph Hellwig wrote:
> On Mon, Mar 09, 2020 at 02:32:29PM -0700, Omar Sandoval wrote:
> > +	/*
> > +	 * We need to validate each sector individually if the I/O was for
> > +	 * multiple sectors.
> > +	 */
> > +	len = 0;
> > +	for (i = 0; i < failed_bio->bi_vcnt; i++) {
> > +		len += failed_bio->bi_io_vec[i].bv_len;
> > +		if (len > inode->i_sb->s_blocksize) {
> > +			need_validation = true;
> > +			break;
> > +		}
> 
> So given that btrfs is the I/O submitter iterating over all bio_vecs
> should probably be fine.  That being said I deeply dislike the idea
> of having code like this inside the file system.  Maybe we can add
> a helper like:
> 
> u32 bio_size_all(struct bio *bio)
> {
> 	u32 len, i;
> 
> 	for (i = 0; i < bio->bi_vcnt; i++)
> 		len += bio->bi_io_vec[i].bv_len;
> 	return len;
> }
> 
> in the core block code, with a kerneldoc comment documenting the
> exact use cases, and then use that?

That works for me. I was microoptimizing here since I can stop iterating
once I know that the bio is greater than one sector, but since this is
for the rare repair case, it really doesn't matter.

On the other hand, a bio_for_each_bvec_all() helper would feel less
intrusive into the bio guts and also support bailing early. I'm fine
either way. Thoughts?
Christoph Hellwig March 16, 2020, 10:48 a.m. UTC | #3
On Wed, Mar 11, 2020 at 02:07:47AM -0700, Omar Sandoval wrote:
> That works for me. I was microoptimizing here since I can stop iterating
> once I know that the bio is greater than one sector, but since this is
> for the rare repair case, it really doesn't matter.
> 
> On the other hand, a bio_for_each_bvec_all() helper would feel less
> intrusive into the bio guts and also support bailing early. I'm fine
> either way. Thoughts?

I don't really care too much as long as we don't open code the bi_io_vec
access.
Nikolay Borisov March 17, 2020, 2:38 p.m. UTC | #4
On 9.03.20 г. 23:32 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Read repair does two things: it finds a good copy of data to return to
> the reader, and it corrects the bad copy on disk. If a read of multiple
> sectors has an I/O error, repair does an extra "validation" step that
> issues a separate read for each sector. This allows us to find the exact
> failing sectors and only rewrite those.
> 
> This heuristic is implemented in
> bio_readpage_error()/btrfs_check_repairable() as:
> 
> 	failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
> 	if (failed_bio_pages > 1)
> 		do validation
> 
> However, at this point, bi_iter may have already been advanced. This
> means that we'll skip the validation step and rewrite the entire failed
> read.
> 
> Fix it by getting the actual size from the biovec (which we can do
> because this is only called for non-cloned bios, although that will
> change in a later commit).
> 
> Fixes: 8a2ee44a371c ("btrfs: look at bi_size for repair decisions")
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/extent_io.c | 28 ++++++++++++++++++++++------
>  fs/btrfs/extent_io.h |  5 +++--
>  2 files changed, 25 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 837262d54e28..279731bff0a8 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2528,8 +2528,9 @@ int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
>  	return 0;
>  }
>  
> -bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
> -			   struct io_failure_record *failrec, int failed_mirror)
> +bool btrfs_check_repairable(struct inode *inode, bool need_validation,

nit: While at it this function can be made static. It's only used in
extent_io.c and it's defined before its sole caller - bio_readpage_error.
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 837262d54e28..279731bff0a8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2528,8 +2528,9 @@  int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
 	return 0;
 }
 
-bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
-			   struct io_failure_record *failrec, int failed_mirror)
+bool btrfs_check_repairable(struct inode *inode, bool need_validation,
+			    struct io_failure_record *failrec,
+			    int failed_mirror)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	int num_copies;
@@ -2552,7 +2553,7 @@  bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
 	 *	a) deliver good data to the caller
 	 *	b) correct the bad sectors on disk
 	 */
-	if (failed_bio_pages > 1) {
+	if (need_validation) {
 		/*
 		 * to fulfill b), we need to know the exact failing sectors, as
 		 * we don't want to rewrite any more than the failed ones. thus,
@@ -2638,11 +2639,13 @@  static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	struct inode *inode = page->mapping->host;
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+	bool need_validation = false;
+	u64 len;
+	int i;
 	struct bio *bio;
 	int read_mode = 0;
 	blk_status_t status;
 	int ret;
-	unsigned failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
 
 	BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
@@ -2650,13 +2653,26 @@  static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
 	if (ret)
 		return ret;
 
-	if (!btrfs_check_repairable(inode, failed_bio_pages, failrec,
+	/*
+	 * We need to validate each sector individually if the I/O was for
+	 * multiple sectors.
+	 */
+	len = 0;
+	for (i = 0; i < failed_bio->bi_vcnt; i++) {
+		len += failed_bio->bi_io_vec[i].bv_len;
+		if (len > inode->i_sb->s_blocksize) {
+			need_validation = true;
+			break;
+		}
+	}
+
+	if (!btrfs_check_repairable(inode, need_validation, failrec,
 				    failed_mirror)) {
 		free_io_failure(failure_tree, tree, failrec);
 		return -EIO;
 	}
 
-	if (failed_bio_pages > 1)
+	if (need_validation)
 		read_mode |= REQ_FAILFAST_DEV;
 
 	phy_offset >>= inode->i_sb->s_blocksize_bits;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 234622101230..64e176995af2 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -312,8 +312,9 @@  struct io_failure_record {
 };
 
 
-bool btrfs_check_repairable(struct inode *inode, unsigned failed_bio_pages,
-			    struct io_failure_record *failrec, int fail_mirror);
+bool btrfs_check_repairable(struct inode *inode, bool need_validation,
+			    struct io_failure_record *failrec,
+			    int failed_mirror);
 struct bio *btrfs_create_repair_bio(struct inode *inode, struct bio *failed_bio,
 				    struct io_failure_record *failrec,
 				    struct page *page, int pg_offset, int icsum,