diff mbox series

[v5,07/11] iomap: fix iomap_dio_zero() for fs bs > system page size

Message ID 20240503095353.3798063-8-mcgrof@kernel.org (mailing list archive)
State New, archived
Headers show
Series enable bs > ps in XFS | expand

Commit Message

Luis Chamberlain May 3, 2024, 9:53 a.m. UTC
From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox May 7, 2024, 4 p.m. UTC | #1
On Fri, May 03, 2024 at 02:53:49AM -0700, Luis Chamberlain wrote:
> +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  				  GFP_KERNEL);
> +
>  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}
>  	iomap_dio_submit_bio(iter, dio, bio, pos);

If the len is more than PAGE_SIZE * BIO_MAX_VECS, __bio_add_page()
will fail silently.  I hate this interface.

You should be doing something like ...

	while (len) {
		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);

		while (!bio || bio_add_page() < io_len) {
			if (bio)
				iomap_dio_submit_bio(iter, dio, bio, pos);
			bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
					REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
		 	fscrypt_set_bio_crypt_ctx(bio, inode,
					pos >> inode->i_blkbits, GFP_KERNEL);
		}
	}
Christoph Hellwig May 7, 2024, 4:10 p.m. UTC | #2
On Tue, May 07, 2024 at 05:00:28PM +0100, Matthew Wilcox wrote:
> If the len is more than PAGE_SIZE * BIO_MAX_VECS, __bio_add_page()
> will fail silently.  I hate this interface.

No, it won't.  You can pass an arbitray len to it.

> 
> You should be doing something like ...
> 
> 	while (len) {
> 		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> 
> 		while (!bio || bio_add_page() < io_len) {
> 			if (bio)
> 				iomap_dio_submit_bio(iter, dio, bio, pos);
> 			bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> 					REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> 		 	fscrypt_set_bio_crypt_ctx(bio, inode,
> 					pos >> inode->i_blkbits, GFP_KERNEL);
> 		}
> 	}

Wee, no.  The right way is:

	bio = iomap_dio_alloc_bio(iter, dio, 1,
			REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
	__bio_add_page(bio, page, len, 0);

	fscrypt_set_bio_crypt_ctx(bio, inode,
			pos >> inode->i_blkbits, GFP_KERNEL);

(or even better the folio version)
Matthew Wilcox May 7, 2024, 4:11 p.m. UTC | #3
On Tue, May 07, 2024 at 09:10:15AM -0700, Christoph Hellwig wrote:
> On Tue, May 07, 2024 at 05:00:28PM +0100, Matthew Wilcox wrote:
> > If the len is more than PAGE_SIZE * BIO_MAX_VECS, __bio_add_page()
> > will fail silently.  I hate this interface.
> 
> No, it won't.  You can pass an arbitray len to it.
> 
> > 
> > You should be doing something like ...
> > 
> > 	while (len) {
> > 		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> > 
> > 		while (!bio || bio_add_page() < io_len) {
> > 			if (bio)
> > 				iomap_dio_submit_bio(iter, dio, bio, pos);
> > 			bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > 					REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> > 		 	fscrypt_set_bio_crypt_ctx(bio, inode,
> > 					pos >> inode->i_blkbits, GFP_KERNEL);
> > 		}
> > 	}
> 
> Wee, no.  The right way is:
> 
> 	bio = iomap_dio_alloc_bio(iter, dio, 1,
> 			REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> 	__bio_add_page(bio, page, len, 0);

no?  len can be > PAGE_SIZE.  and that can be true in the folio version
too, because we won't necessarily be able to allocate the THP.

> 	fscrypt_set_bio_crypt_ctx(bio, inode,
> 			pos >> inode->i_blkbits, GFP_KERNEL);
> 
> (or even better the folio version)
Christoph Hellwig May 7, 2024, 4:13 p.m. UTC | #4
On Tue, May 07, 2024 at 05:11:58PM +0100, Matthew Wilcox wrote:
> > 	__bio_add_page(bio, page, len, 0);
> 
> no?  len can be > PAGE_SIZE.

Yes. So what?
Matthew Wilcox May 8, 2024, 4:24 a.m. UTC | #5
On Tue, May 07, 2024 at 09:13:17AM -0700, Christoph Hellwig wrote:
> On Tue, May 07, 2024 at 05:11:58PM +0100, Matthew Wilcox wrote:
> > > 	__bio_add_page(bio, page, len, 0);
> > 
> > no?  len can be > PAGE_SIZE.
> 
> Yes. So what?

the zero_page is only PAGE_SIZE bytes long.  so you'd be writing
from the page that's after the zero page, whatever contents that has.
Pankaj Raghav (Samsung) May 8, 2024, 11:20 a.m. UTC | #6
On Tue, May 07, 2024 at 05:00:28PM +0100, Matthew Wilcox wrote:
> On Fri, May 03, 2024 at 02:53:49AM -0700, Luis Chamberlain wrote:
> > +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> >  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> >  				  GFP_KERNEL);
> > +
> >  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> >  	bio->bi_private = dio;
> >  	bio->bi_end_io = iomap_dio_bio_end_io;
> >  
> > -	__bio_add_page(bio, page, len, 0);
> > +	while (len) {
> > +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> > +
> > +		__bio_add_page(bio, page, io_len, 0);
> > +		len -= io_len;
> > +	}
> >  	iomap_dio_submit_bio(iter, dio, bio, pos);
> 
> If the len is more than PAGE_SIZE * BIO_MAX_VECS, __bio_add_page()
> will fail silently.  I hate this interface.

I added a WARN_ON_ONCE() at the start of the function so that it does
not silently fail:
	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));

This function is used to do only sub block zeroing, and I don't think we will
cross 1MB block size in the forseeable future, and even if we do, we have
this to warn us about so that it can be changed?

> 
> You should be doing something like ...
> 
> 	while (len) {
> 		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> 
> 		while (!bio || bio_add_page() < io_len) {
> 			if (bio)
> 				iomap_dio_submit_bio(iter, dio, bio, pos);
> 			bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> 					REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> 		 	fscrypt_set_bio_crypt_ctx(bio, inode,
> 					pos >> inode->i_blkbits, GFP_KERNEL);
> 		}
> 	}
Pankaj Raghav (Samsung) May 8, 2024, 11:22 a.m. UTC | #7
On Wed, May 08, 2024 at 05:24:53AM +0100, Matthew Wilcox wrote:
> On Tue, May 07, 2024 at 09:13:17AM -0700, Christoph Hellwig wrote:
> > On Tue, May 07, 2024 at 05:11:58PM +0100, Matthew Wilcox wrote:
> > > > 	__bio_add_page(bio, page, len, 0);
> > > 
> > > no?  len can be > PAGE_SIZE.
> > 
> > Yes. So what?
> 
> the zero_page is only PAGE_SIZE bytes long.  so you'd be writing
> from the page that's after the zero page, whatever contents that has.

This is the exact reason this patch was added. We were writing the
garbage value past the PAGE_SIZE for LBS that led to FS corruption. Not
an issue where FS block size <= page size.
Christoph Hellwig May 8, 2024, 11:36 a.m. UTC | #8
On Wed, May 08, 2024 at 05:24:53AM +0100, Matthew Wilcox wrote:
> On Tue, May 07, 2024 at 09:13:17AM -0700, Christoph Hellwig wrote:
> > On Tue, May 07, 2024 at 05:11:58PM +0100, Matthew Wilcox wrote:
> > > > 	__bio_add_page(bio, page, len, 0);
> > > 
> > > no?  len can be > PAGE_SIZE.
> > 
> > Yes. So what?
> 
> the zero_page is only PAGE_SIZE bytes long.  so you'd be writing
> from the page that's after the zero page, whatever contents that has.

Except that the whole point of the exercise is to use the huge folio
so that we don't run past the end of the zero page.  Yes, if we use
ZERO_PAGE we need to chunk things up and use bio_add_page instead
bio_page, check the return value and potentially deal with multiple
bios.  I'd rather avoid that, though.
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..5f481068de5b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@  static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }