diff mbox series

[RFC,2/4] iomap: optional zero range dirty folio processing

Message ID 20241119154656.774395-3-bfoster@redhat.com (mailing list archive)
State New
Headers show
Series iomap: zero range folio batch processing prototype | expand

Commit Message

Brian Foster Nov. 19, 2024, 3:46 p.m. UTC
The only way zero range can currently process unwritten mappings
with dirty pagecache is to check whether the range is dirty before
mapping lookup and then flush when at least one underlying mapping
is unwritten. This ordering is required to prevent iomap lookup from
racing with folio writeback and reclaim.

Since zero range can skip ranges of unwritten mappings that are
clean in cache, this operation can be improved by allowing the
filesystem to provide the set of folios backed by such mappings that
require zeroing up. In turn, rather than flush or iterate file
offsets, zero range can process each folio as normal and skip any
clean or uncached ranges in between.

As a first pass prototype solution, stuff a folio_batch in struct
iomap, provide a helper that the fs can use to populate the batch at
lookup time, and define a flag to indicate the mapping was checked.
Note that since the helper is intended for use under internal fs
locks, it trylocks folios in order to filter out clean folios. This
loosely follows the logic from filemap_range_has_writeback().

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 73 ++++++++++++++++++++++++++++++++++++++++--
 fs/iomap/iter.c        |  2 ++
 include/linux/iomap.h  |  5 +++
 3 files changed, 78 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig Nov. 20, 2024, 8:43 a.m. UTC | #1
On Tue, Nov 19, 2024 at 10:46:54AM -0500, Brian Foster wrote:
> +loff_t
> +iomap_fill_dirty_folios(
> +	struct inode		*inode,
> +	struct iomap		*iomap,

If you pass in the batch directly instead of the iomap this is
completely generic and could go into filemap.c.  Adding willy
and linux-mm for these kinds of things also tends to help to
get good review feedback and often improvements.

> +	loff_t			offset,
> +	loff_t			length)
> +{
> +	struct address_space	*mapping = inode->i_mapping;
> +	struct folio_batch	fbatch;
> +	pgoff_t			start, end;
> +	loff_t			end_pos;
> +
> +	folio_batch_init(&fbatch);
> +	folio_batch_init(&iomap->fbatch);
> +
> +	end_pos = offset + length;
> +	start = offset >> PAGE_SHIFT;
> +	end = (end_pos - 1) >> PAGE_SHIFT;

Nit: initializing these at declaration time make the code easier to
read (at least for me :)).

> +
> +	while (filemap_get_folios(mapping, &start, end, &fbatch) &&
> +	       folio_batch_space(&iomap->fbatch)) {
> +		struct folio *folio;
> +		while ((folio = folio_batch_next(&fbatch))) {
> +			if (folio_trylock(folio)) {
> +				bool clean = !folio_test_dirty(folio) &&
> +					     !folio_test_writeback(folio);
> +				folio_unlock(folio);
> +				if (clean)
> +					continue;

What does the lock protect here given that it can become stale as soon
as we unlock?

Note that there also is a filemap_get_folios_tag that only looks up
folios with the right tag (dirty or writeback).  Currently it only
supports a single tag, which wuld not be helpful here, but this might
be worth talking to the pagecache and xarray maintainer.
Brian Foster Nov. 20, 2024, 2:34 p.m. UTC | #2
On Wed, Nov 20, 2024 at 12:43:47AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 19, 2024 at 10:46:54AM -0500, Brian Foster wrote:
> > +loff_t
> > +iomap_fill_dirty_folios(
> > +	struct inode		*inode,
> > +	struct iomap		*iomap,
> 
> If you pass in the batch directly instead of the iomap this is
> completely generic and could go into filemap.c.  Adding willy
> and linux-mm for these kinds of things also tends to help to
> get good review feedback and often improvements.
> 

That is a good idea, but I'm not sure this will remain generic. One of
the tradeoffs this current patch makes is that for the sub-folio block
size (whether small blocks or large folios), we zero any subrange so
long as the high level folio is dirty.

This causes something like generic/445 to fail on small block sizes
because we explicitly zero a subrange that technically wasn't written
to, but rather some other part of the same folio was, and therefore
SEEK_DATA finds it after when a hole was expected based on the test
workload. I was thinking of improving this by using
ifs_find_dirty_range() here to further filter out unwritten subranges
based on the target range being zeroed, but haven't done that yet.

That said, I suspect that still won't be perfect and some spurious
zeroing may still be possible. Personally, I think it might be fine to
leave as is and just fix up the fstests test to be a little less strict
about SEEK_DATA at block granularity. We have to writeback the folio
anyways and so I'm not sure it makes much practical difference. So if
there's preference to try and keep this generic in favor of that
functional tradeoff, I'm certainly fine with going that route. Thoughts?

(Hmm.. thinking more about it, it may also be possible to fix up via a
post-process on the first/last folios in the batch, so maybe this
doesn't technically have to be an either or choice.)

> > +	loff_t			offset,
> > +	loff_t			length)
> > +{
> > +	struct address_space	*mapping = inode->i_mapping;
> > +	struct folio_batch	fbatch;
> > +	pgoff_t			start, end;
> > +	loff_t			end_pos;
> > +
> > +	folio_batch_init(&fbatch);
> > +	folio_batch_init(&iomap->fbatch);
> > +
> > +	end_pos = offset + length;
> > +	start = offset >> PAGE_SHIFT;
> > +	end = (end_pos - 1) >> PAGE_SHIFT;
> 
> Nit: initializing these at declaration time make the code easier to
> read (at least for me :)).
> 

Ok.

> > +
> > +	while (filemap_get_folios(mapping, &start, end, &fbatch) &&
> > +	       folio_batch_space(&iomap->fbatch)) {
> > +		struct folio *folio;
> > +		while ((folio = folio_batch_next(&fbatch))) {
> > +			if (folio_trylock(folio)) {
> > +				bool clean = !folio_test_dirty(folio) &&
> > +					     !folio_test_writeback(folio);
> > +				folio_unlock(folio);
> > +				if (clean)
> > +					continue;
> 
> What does the lock protect here given that it can become stale as soon
> as we unlock?
> 

I thought the lock was needed to correctly test for dirty and writeback,
because dirty is cleared before writeback is set under folio lock. That
said, I suppose this could also just do the folio_test_locked() thing
that filemap_range_has_writeback() does.

Just note that this is possibly affected by how we decide to handle the
sub-folio case above, as I think we'd also need the lock for looking at
the iomap_folio_state.

> Note that there also is a filemap_get_folios_tag that only looks up
> folios with the right tag (dirty or writeback).  Currently it only
> supports a single tag, which wuld not be helpful here, but this might
> be worth talking to the pagecache and xarray maintainer.
> 

Yeah, that was one of the first things I looked into before writing the
fill helper, but atm it didn't look possible to check for an OR
combination of xarray tags. That might be a nice future improvement if
reasonably possible, or if somebody who knows that code better wants to
cook something up faster than I can, but otherwise I don't really want
to make that a hard requirement.

Brian
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d1a86aea1a7a..6ed8dc8dcdd9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1363,6 +1363,10 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	loff_t written = 0;
+	bool has_folios = !!(iter->iomap.flags & IOMAP_F_HAS_FOLIOS);
+
+	if (WARN_ON_ONCE(has_folios && srcmap->type != IOMAP_UNWRITTEN))
+		return -EIO;
 
 	/*
 	 * We must zero subranges of unwritten mappings that might be dirty in
@@ -1381,7 +1385,8 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 	 * post-eof ranges can be dirtied via mapped write and the flush
 	 * triggers writeback time post-eof zeroing.
 	 */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
+	if (!has_folios &&
+	    (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)) {
 		if (*range_dirty) {
 			*range_dirty = false;
 			return iomap_zero_iter_flush_and_stale(iter);
@@ -1395,8 +1400,28 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 		int status;
 		size_t offset;
 		size_t bytes = min_t(u64, SIZE_MAX, length);
+		size_t skipped = 0;
 		bool ret;
 
+		if (has_folios) {
+			folio = folio_batch_next(&iter->iomap.fbatch);
+			if (!folio) {
+				written += length;
+				break;
+			}
+			folio_get(folio);
+			folio_lock(folio);
+			if (pos < folio_pos(folio)) {
+				skipped = folio_pos(folio) - pos;
+				if (WARN_ON_ONCE(skipped >= length))
+					break;
+				pos += skipped;
+				length -= skipped;
+				bytes = min_t(u64, SIZE_MAX, length);
+			} else
+				WARN_ON_ONCE(pos >= folio_pos(folio) + folio_size(folio));
+		}
+
 		status = iomap_write_begin(iter, pos, bytes, &folio);
 		if (status)
 			return status;
@@ -1417,7 +1442,7 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 
 		pos += bytes;
 		length -= bytes;
-		written += bytes;
+		written += bytes + skipped;
 	} while (length > 0);
 
 	if (did_zero)
@@ -1425,6 +1450,50 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 	return written;
 }
 
+loff_t
+iomap_fill_dirty_folios(
+	struct inode		*inode,
+	struct iomap		*iomap,
+	loff_t			offset,
+	loff_t			length)
+{
+	struct address_space	*mapping = inode->i_mapping;
+	struct folio_batch	fbatch;
+	pgoff_t			start, end;
+	loff_t			end_pos;
+
+	folio_batch_init(&fbatch);
+	folio_batch_init(&iomap->fbatch);
+
+	end_pos = offset + length;
+	start = offset >> PAGE_SHIFT;
+	end = (end_pos - 1) >> PAGE_SHIFT;
+
+	while (filemap_get_folios(mapping, &start, end, &fbatch) &&
+	       folio_batch_space(&iomap->fbatch)) {
+		struct folio *folio;
+		while ((folio = folio_batch_next(&fbatch))) {
+			if (folio_trylock(folio)) {
+				bool clean = !folio_test_dirty(folio) &&
+					     !folio_test_writeback(folio);
+				folio_unlock(folio);
+				if (clean)
+					continue;
+			}
+
+			folio_get(folio);
+			if (!folio_batch_add(&iomap->fbatch, folio)) {
+				end_pos = folio_pos(folio) + folio_size(folio);
+				break;
+			}
+		}
+		folio_batch_release(&fbatch);
+	}
+
+	return end_pos;
+}
+EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios);
+
 int
 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops)
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index 79a0614eaab7..3c87b5c2b1d9 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -25,6 +25,8 @@  static inline int iomap_iter_advance(struct iomap_iter *iter)
 
 	/* handle the previous iteration (if any) */
 	if (iter->iomap.length) {
+		if (iter->iomap.flags & IOMAP_F_HAS_FOLIOS)
+			folio_batch_release(&iter->iomap.fbatch);
 		if (iter->processed < 0)
 			return iter->processed;
 		if (!iter->processed && !stale)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 27048ec10e1c..7e9d86c3defa 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -9,6 +9,7 @@ 
 #include <linux/types.h>
 #include <linux/mm_types.h>
 #include <linux/blkdev.h>
+#include <linux/pagevec.h>
 
 struct address_space;
 struct fiemap_extent_info;
@@ -77,6 +78,7 @@  struct vm_fault;
  */
 #define IOMAP_F_SIZE_CHANGED	(1U << 8)
 #define IOMAP_F_STALE		(1U << 9)
+#define IOMAP_F_HAS_FOLIOS	(1U << 10)
 
 /*
  * Flags from 0x1000 up are for file system specific usage:
@@ -102,6 +104,7 @@  struct iomap {
 	void			*inline_data;
 	void			*private; /* filesystem private */
 	const struct iomap_folio_ops *folio_ops;
+	struct folio_batch	fbatch;
 	u64			validity_cookie; /* used with .iomap_valid() */
 };
 
@@ -301,6 +304,8 @@  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
+loff_t iomap_fill_dirty_folios(struct inode *inode, struct iomap *iomap,
+		loff_t offset, loff_t length);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
 		bool *did_zero, const struct iomap_ops *ops);
 int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,