diff mbox

[3/6] Btrfs: fix race setting block group readonly during device replace

Message ID 1463719487-1056-1-git-send-email-fdmanana@kernel.org (mailing list archive)
State Accepted
Headers show

Commit Message

Filipe Manana May 20, 2016, 4:44 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:

          CPU 1                                            CPU 2

 btrfs_dev_replace_start()
   btrfs_scrub_dev()
     scrub_enumerate_chunks()
       --> finds device extent belonging
           to block group X

                               <transaction N starts>

                                                      starts buffered write
                                                      against some inode

                                                      writepages is run against
                                                      that inode forcing dellaloc
                                                      to run

                                                      btrfs_writepages()
                                                        extent_writepages()
                                                          extent_write_cache_pages()
                                                            __extent_writepage()
                                                              writepage_delalloc()
                                                                run_delalloc_range()
                                                                  cow_file_range()
                                                                    btrfs_reserve_extent()
                                                                      --> allocates an extent
                                                                          from block group X
                                                                          (which is not yet
                                                                           in RO mode)
                                                                    btrfs_add_ordered_extent()
                                                                      --> creates ordered extent Y
                                                        flush_epd_write_bio()
                                                          --> bio against the extent from
                                                              block group X is submitted

       btrfs_inc_block_group_ro(bg X)
         --> sets block group X to readonly

       scrub_chunk(bg X)
         scrub_stripe(device extent from srcdev)
           --> keeps searching for extent items
               belonging to the block group using
               the extent tree's commit root
           --> it never blocks due to
               fs_info->scrub_pause_req as no
               one tries to commit transaction N
           --> copies all extents found from the
               source device into the target device
           --> finishes search loop

                                                        bio completes

                                                        ordered extent Y completes
                                                        and creates delayed data
                                                        reference which will add an
                                                        extent item to the extent
                                                        tree when run (typically
                                                        at transaction commit time)

                                                          --> so the task doing the
                                                              scrub/device replace
                                                              at CPU 1 misses this
                                                              and does not copy this
                                                              extent into the new/target
                                                              device

       btrfs_dec_block_group_ro(bg X)
         --> turns block group X back to RW mode

       dev_replace->cursor_left is set to the
       logical end offset of block group X

So fix this by waiting for all cow and nocow writes after setting a block
group to readonly mode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ordered-data.c |  6 +++++-
 fs/btrfs/ordered-data.h |  2 +-
 fs/btrfs/scrub.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 2 deletions(-)

Comments

Josef Bacik May 20, 2016, 3:20 p.m. UTC | #1
On Fri, May 20, 2016 at 12:44 AM,  <fdmanana@kernel.org> wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When we do a device replace, for each device extent we find from the
> source device, we set the corresponding block group to readonly mode to
> prevent writes into it from happening while we are copying the device
> extent from the source to the target device. However just before we set
> the block group to readonly mode some concurrent task might have already
> allocated an extent from it or decided it could perform a nocow write
> into one of its extents, which can make the device replace process to
> miss copying an extent since it uses the extent tree's commit root to
> search for extents and only once it finishes searching for all extents
> belonging to the block group it does set the left cursor to the logical
> end address of the block group - this is a problem if the respective
> ordered extents finish while we are searching for extents using the
> extent tree's commit root and no transaction commit happens while we
> are iterating the tree, since it's the delayed references created by the
> ordered extents (when they complete) that insert the extent items into
> the extent tree (using the non-commit root of course).
> Example:
>
>           CPU 1                                            CPU 2
>
>  btrfs_dev_replace_start()
>    btrfs_scrub_dev()
>      scrub_enumerate_chunks()
>        --> finds device extent belonging
>            to block group X
>
>                                <transaction N starts>
>
>                                                       starts buffered write
>                                                       against some inode
>
>                                                       writepages is run against
>                                                       that inode forcing dellaloc
>                                                       to run
>
>                                                       btrfs_writepages()
>                                                         extent_writepages()
>                                                           extent_write_cache_pages()
>                                                             __extent_writepage()
>                                                               writepage_delalloc()
>                                                                 run_delalloc_range()
>                                                                   cow_file_range()
>                                                                     btrfs_reserve_extent()
>                                                                       --> allocates an extent
>                                                                           from block group X
>                                                                           (which is not yet
>                                                                            in RO mode)
>                                                                     btrfs_add_ordered_extent()
>                                                                       --> creates ordered extent Y
>                                                         flush_epd_write_bio()
>                                                           --> bio against the extent from
>                                                               block group X is submitted
>
>        btrfs_inc_block_group_ro(bg X)
>          --> sets block group X to readonly
>
>        scrub_chunk(bg X)
>          scrub_stripe(device extent from srcdev)
>            --> keeps searching for extent items
>                belonging to the block group using
>                the extent tree's commit root
>            --> it never blocks due to
>                fs_info->scrub_pause_req as no
>                one tries to commit transaction N
>            --> copies all extents found from the
>                source device into the target device
>            --> finishes search loop
>
>                                                         bio completes
>
>                                                         ordered extent Y completes
>                                                         and creates delayed data
>                                                         reference which will add an
>                                                         extent item to the extent
>                                                         tree when run (typically
>                                                         at transaction commit time)
>
>                                                           --> so the task doing the
>                                                               scrub/device replace
>                                                               at CPU 1 misses this
>                                                               and does not copy this
>                                                               extent into the new/target
>                                                               device
>
>        btrfs_dec_block_group_ro(bg X)
>          --> turns block group X back to RW mode
>
>        dev_replace->cursor_left is set to the
>        logical end offset of block group X
>
> So fix this by waiting for all cow and nocow writes after setting a block
> group to readonly mode.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks!

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 5591704..e96634a 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -718,12 +718,13 @@  int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr,
 	return count;
 }
 
-void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr,
+int btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr,
 			      const u64 range_start, const u64 range_len)
 {
 	struct btrfs_root *root;
 	struct list_head splice;
 	int done;
+	int total_done = 0;
 
 	INIT_LIST_HEAD(&splice);
 
@@ -742,6 +743,7 @@  void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr,
 		done = btrfs_wait_ordered_extents(root, nr,
 						  range_start, range_len);
 		btrfs_put_fs_root(root);
+		total_done += done;
 
 		spin_lock(&fs_info->ordered_root_lock);
 		if (nr != -1) {
@@ -752,6 +754,8 @@  void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr,
 	list_splice_tail(&splice, &fs_info->ordered_roots);
 	spin_unlock(&fs_info->ordered_root_lock);
 	mutex_unlock(&fs_info->ordered_operations_mutex);
+
+	return total_done;
 }
 
 /*
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 8ef1262..700a594 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -199,7 +199,7 @@  int btrfs_find_ordered_sum(struct inode *inode, u64 offset, u64 disk_bytenr,
 			   u32 *sum, int len);
 int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr,
 			       const u64 range_start, const u64 range_len);
-void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr,
+int btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr,
 			      const u64 range_start, const u64 range_len);
 void btrfs_get_logged_extents(struct inode *inode,
 			      struct list_head *logged_list,
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 4678f03..a181b52 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3580,6 +3580,46 @@  int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 */
 		scrub_pause_on(fs_info);
 		ret = btrfs_inc_block_group_ro(root, cache);
+		if (!ret && is_dev_replace) {
+			/*
+			 * If we are doing a device replace wait for any tasks
+			 * that started dellaloc right before we set the block
+			 * group to RO mode, as they might have just allocated
+			 * an extent from it or decided they could do a nocow
+			 * write. And if any such tasks did that, wait for their
+			 * ordered extents to complete and then commit the
+			 * current transaction, so that we can later see the new
+			 * extent items in the extent tree - the ordered extents
+			 * create delayed data references (for cow writes) when
+			 * they complete, which will be run and insert the
+			 * corresponding extent items into the extent tree when
+			 * we commit the transaction they used when running
+			 * inode.c:btrfs_finish_ordered_io(). We later use
+			 * the commit root of the extent tree to find extents
+			 * to copy from the srcdev into the tgtdev, and we don't
+			 * want to miss any new extents.
+			 */
+			btrfs_wait_block_group_reservations(cache);
+			btrfs_wait_nocow_writers(cache);
+			ret = btrfs_wait_ordered_roots(fs_info, -1,
+						       cache->key.objectid,
+						       cache->key.offset);
+			if (ret > 0) {
+				struct btrfs_trans_handle *trans;
+
+				trans = btrfs_join_transaction(root);
+				if (IS_ERR(trans))
+					ret = PTR_ERR(trans);
+				else
+					ret = btrfs_commit_transaction(trans,
+								       root);
+				if (ret) {
+					scrub_pause_off(fs_info);
+					btrfs_put_block_group(cache);
+					break;
+				}
+			}
+		}
 		scrub_pause_off(fs_info);
 
 		if (ret == 0) {