[2/8] btrfs: zoned: defer advancing meta_write_pointer

Message ID	0c1e65736a8263e514ffb6f7ce8dd1047fbb916a.1690171333.git.naohiro.aota@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> IronPort-SDR: KWP8RZHO89IHooLuVCLo4I0Vm2KiwXv5h3pUe5iv+DZIodZsfs3H4IAqFmLxNYg1e9k/mbcOyz JHzaDvCyfEfbcyxRq+9v6IZq/MKw/fAjPrERj1dwc0efgQHvlfqX/hLMj7Zd89/dyMY0QkU4hR Ra4xaJKC81XLThlPKoYz67w2EDCPaxvNOzBy5CsAUMnXR2n9UxAYVdtKgsTpq2b7NR4dPBLfs2 W3NFe/3vjd3vvV2K0syPbLIgBk6sjh8x9FCWrLV9MWfTBah+wvZ6DDEyp/BKI7rGzZwo3pyZ+u LgE= IronPort-SDR: Xd7EMD95+FIfZ4Rj8xl3d47SEJ95TaGdLk6X1DZzJg6gdMyZCMiG0h4wjREuovEPOvRsq0qdui 5KvaFvNe5xDokyZ0/psK5s5C4fiSj6Zvu2Iq6W7e1PXORIXf3E0rGnJNHFvHMvR0xbVk3rhWI3 dO+dWkoIM1ldAMN2wM8HqfpGoRNlKEF7BXEPD0qA4SXLdoMgZ7bqrmfiKa8QPCd7binogqU++K MOUT7mfr2u8+XobVAaY7inMTj3NnUHKR+c3DBRpO3+ZvLPZiubPej9YYlbdiYqLP2Hdjm984t3 +gE= WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org Cc: Naohiro Aota <naohiro.aota@wdc.com> Subject: [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer Date: Mon, 24 Jul 2023 13:18:31 +0900 Message-ID: <0c1e65736a8263e514ffb6f7ce8dd1047fbb916a.1690171333.git.naohiro.aota@wdc.com> In-Reply-To: <cover.1690171333.git.naohiro.aota@wdc.com> References: <cover.1690171333.git.naohiro.aota@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: zoned: write-time activation of metadata block group \| expand [0/8] btrfs: zoned: write-time activation of metadata block group [1/8] btrfs: zoned: introduce block_group context for submit_eb_page() [2/8] btrfs: zoned: defer advancing meta_write_pointer [3/8] btrfs: zoned: update meta_write_pointer on zone finish [4/8] btrfs: zoned: reserve zones for an active metadata/system block group [5/8] btrfs: zoned: activate metadata block group on write time [6/8] btrfs: zoned: no longer count fresh BG region as zone unusable [7/8] btrfs: zoned: don't activate non-DATA BG on allocation [8/8] btrfs: zoned: re-enable metadata over-commit for zoned mode

Message ID

0c1e65736a8263e514ffb6f7ce8dd1047fbb916a.1690171333.git.naohiro.aota@wdc.com (mailing list archive)

State

New, archived

Headers

IronPort-SDR: 
 KWP8RZHO89IHooLuVCLo4I0Vm2KiwXv5h3pUe5iv+DZIodZsfs3H4IAqFmLxNYg1e9k/mbcOyz
 JHzaDvCyfEfbcyxRq+9v6IZq/MKw/fAjPrERj1dwc0efgQHvlfqX/hLMj7Zd89/dyMY0QkU4hR
 Ra4xaJKC81XLThlPKoYz67w2EDCPaxvNOzBy5CsAUMnXR2n9UxAYVdtKgsTpq2b7NR4dPBLfs2
 W3NFe/3vjd3vvV2K0syPbLIgBk6sjh8x9FCWrLV9MWfTBah+wvZ6DDEyp/BKI7rGzZwo3pyZ+u
 LgE=
IronPort-SDR: 
 Xd7EMD95+FIfZ4Rj8xl3d47SEJ95TaGdLk6X1DZzJg6gdMyZCMiG0h4wjREuovEPOvRsq0qdui
 5KvaFvNe5xDokyZ0/psK5s5C4fiSj6Zvu2Iq6W7e1PXORIXf3E0rGnJNHFvHMvR0xbVk3rhWI3
 dO+dWkoIM1ldAMN2wM8HqfpGoRNlKEF7BXEPD0qA4SXLdoMgZ7bqrmfiKa8QPCd7binogqU++K
 MOUT7mfr2u8+XobVAaY7inMTj3NnUHKR+c3DBRpO3+ZvLPZiubPej9YYlbdiYqLP2Hdjm984t3
 +gE=
WDCIronportException: Internal
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org
Cc: Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH 2/8] btrfs: zoned: defer advancing meta_write_pointer
Date: Mon, 24 Jul 2023 13:18:31 +0900
Message-ID: 
 <0c1e65736a8263e514ffb6f7ce8dd1047fbb916a.1690171333.git.naohiro.aota@wdc.com>
In-Reply-To: <cover.1690171333.git.naohiro.aota@wdc.com>
References: <cover.1690171333.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

btrfs: zoned: write-time activation of metadata block group | expand

Commit Message

Naohiro Aota July 24, 2023, 4:18 a.m. UTC

We currently advance the meta_write_pointer in
btrfs_check_meta_write_pointer(). That make it necessary to revert to it
when locking the buffer failed. Instead, we can advance it just before
sending the buffer.

Also, this is necessary for the following commit. In the commit, it needs
to release the zoned_meta_io_lock to allow IOs to come in and wait for them
to fill the currently active block group. If we advance the
meta_write_pointer before locking the extent buffer, the following extent
buffer can pass the meta_write_pointer check, resuting in an unaligned
write failure.

Advancing the pointer is still thread-safe as the extent buffer is locked.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 11 ++++++-----
 fs/btrfs/zoned.c     | 14 +++-----------
 fs/btrfs/zoned.h     |  8 --------
 3 files changed, 9 insertions(+), 24 deletions(-)

Comments

Christoph Hellwig July 24, 2023, 3:04 p.m. UTC | #1

> @@ -1773,23 +1773,15 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
>  		*cache_ret = cache;
>  	}
>  
> +	/* Someone already start writing this eb. */
> +	if (cache->meta_write_pointer > eb->start)
> +		return true;

This is actually a behavior change and not mentioned in the commit log.
How is this supposed to work?  If eb->start is before the write pointer
we're going to get a write reordering problem.

The other bits in the patch look like a very nice improvement, though.

Naohiro Aota July 25, 2023, 8:59 a.m. UTC | #2

On Mon, Jul 24, 2023 at 08:04:10AM -0700, Christoph Hellwig wrote:
> > @@ -1773,23 +1773,15 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
> >  		*cache_ret = cache;
> >  	}
> >  
> > +	/* Someone already start writing this eb. */
> > +	if (cache->meta_write_pointer > eb->start)
> > +		return true;
> 
> This is actually a behavior change and not mentioned in the commit log.
> How is this supposed to work?  If eb->start is before the write pointer
> we're going to get a write reordering problem.

This won't happen at this point, but can happen after the write-time
activation patch. When one process wait for current metadata writeback to
finish, other process can take the same contiguous extent buffers. That
region will be processed by the first process, and the second process see
the write pointer larger than the extent buffer it has.

In detail, consider we have two extent buffers (eb#0 and eb#1) to write,
and they are going to a new non-active block group.

Process A                                                        Process B

btree_write_cache_pages()
  btrfs_zoned_meta_io_lock();
  filemap_get_folio_tags() returns pages for the two ebs.
  submit_eb_page()
    btrfs_check_meta_write_pointer()
      # unlock and wait for the writeback
      btrfs_zoned_meta_io_unlock();
      wait_eb_writebacks();                                      
                                                                 btree_write_cache_pages();
								   btrfs_zoned_meta_lock();
								   filemap_get_folio_tags() returns pages for the two ebs.
								   submit_eb_page();
								     btrfs_check_meta_write_pointer()
								       # We may still need to wait.
								       btrfs_zoned_meta_unlock()
								       wait_eb_writebacks();
      # Now, writeback finished
      btrfs_zoned_meta_io_lock();
    lock_extent_buffer_for_io(eb#0)
    # write_pointer == end of eb#0
    write_one_eb(eb#0);
    ...
    # and, write eb#1 as well.
    # write_pointer == end of eb#1
    write_one_eb(eb#1);
  btrfs_zoned_meta_io_unlock();
                                                                       btrfs_zoned_meta_io_lock();
								     lock_extent_buffer_for_io(eb#0) fails, return
								   # and, proceed to eb#1
								   submit_eb_page(#1)
								     btrfs_check_meta_write_pointer(eb#1)
								       # Now, the write pointer is larger than eb#1->start.
								       # Hitting the above condition.

Returning true itself in this case should be OK, because in the end that eb
is rejected by the lock_extent_buffer_for_io(). We cannot simply return
false here, because that may lead to returning -EAGAIN in a certain
case. For the return value, we can move the wbc check from submit_eb_page()
to btrfs_check_meta_write_pointer() and return the proper int value e.g,
returning -EBUSY here and submit_eb_page() convert it to
free_extent_buffer(eb) and return 0.

But, yeah, these lines should go with the write-time activation time patch.

> 
> The other bits in the patch look like a very nice improvement, though.
>

Christoph Hellwig July 26, 2023, 1:06 p.m. UTC | #3

On Tue, Jul 25, 2023 at 08:59:17AM +0000, Naohiro Aota wrote:
> This won't happen at this point, but can happen after the write-time
> activation patch.

Let's move the check there and very clearly document it in that patch
instead of slipping it in here.

> Returning true itself in this case should be OK, because in the end that eb
> is rejected by the lock_extent_buffer_for_io(). We cannot simply return
> false here, because that may lead to returning -EAGAIN in a certain
> case. For the return value, we can move the wbc check from submit_eb_page()
> to btrfs_check_meta_write_pointer() and return the proper int value e.g,
> returning -EBUSY here and submit_eb_page() convert it to
> free_extent_buffer(eb) and return 0.

That feels like the right thing to do.  What I think would be really
good in the long run is to actually avoid the write_pointer holes
entirely by never cancelling buffers for zoned file systems, but instead
move the zeroing that's currently btrfs_redirty_list_add into
btrfs_clear_buffer_dirty.  I tried this in a naive way and it failed,
but I plan to get back to it eventually.

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c7a88d2b5555..46a0b5357009 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1910,15 +1910,16 @@  static int submit_eb_page(struct page *page, struct writeback_control *wbc,
 	*eb_context = eb;
 
 	if (!lock_extent_buffer_for_io(eb, wbc)) {
-		btrfs_revert_meta_write_pointer(*bg_context, eb);
 		free_extent_buffer(eb);
 		return 0;
 	}
 	if (*bg_context) {
-		/*
-		 * Implies write in zoned mode. Mark the last eb in a block group.
-		 */
-		btrfs_schedule_zone_finish_bg(*bg_context, eb);
+		/* Implies write in zoned mode. */
+		struct btrfs_block_group *bg = *bg_context;
+
+		/* Mark the last eb in the block group. */
+		btrfs_schedule_zone_finish_bg(bg, eb);
+		bg->meta_write_pointer += eb->len;
 	}
 	write_one_eb(eb, wbc);
 	free_extent_buffer(eb);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 58bd2de4026d..3f8f5e8c28a9 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1773,23 +1773,15 @@  bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 		*cache_ret = cache;
 	}
 
+	/* Someone already start writing this eb. */
+	if (cache->meta_write_pointer > eb->start)
+		return true;
 	if (cache->meta_write_pointer != eb->start)
 		return false;
-	cache->meta_write_pointer = eb->start + eb->len;
 
 	return true;
 }
 
-void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
-				     struct extent_buffer *eb)
-{
-	if (!btrfs_is_zoned(eb->fs_info) || !cache)
-		return;
-
-	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
-	cache->meta_write_pointer = eb->start;
-}
-
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length)
 {
 	if (!btrfs_dev_is_sequential(device, physical))
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 27322b926038..42a4df94dc29 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -61,8 +61,6 @@  void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group **cache_ret);
-void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
-				     struct extent_buffer *eb);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
 int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
 				  u64 physical_start, u64 physical_pos);
@@ -196,12 +194,6 @@  static inline bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 	return true;
 }
 
-static inline void btrfs_revert_meta_write_pointer(
-						struct btrfs_block_group *cache,
-						struct extent_buffer *eb)
-{
-}
-
 static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device,
 					    u64 physical, u64 length)
 {

[2/8] btrfs: zoned: defer advancing meta_write_pointer

Commit Message

Comments

Patch