[RFC] iomap: use huge zero folio in iomap_dio_zero

Message ID	20240507145811.52987-1-kernel@pankajraghav.com (mailing list archive)
State	New, archived
Headers	show Received: from mout-p-102.mailbox.org (mout-p-102.mailbox.org [80.241.56.152]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C2C9716132B; Tue, 7 May 2024 14:58:49 +0000 (UTC) From: "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> To: hch@lst.de, willy@infradead.org Cc: mcgrof@kernel.org, akpm@linux-foundation.org, brauner@kernel.org, chandan.babu@oracle.com, david@fromorbit.com, djwong@kernel.org, gost.dev@samsung.com, hare@suse.de, john.g.garry@oracle.com, kernel@pankajraghav.com, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-xfs@vger.kernel.org, p.raghav@samsung.com, ritesh.list@gmail.com, ziy@nvidia.com Subject: [RFC] iomap: use huge zero folio in iomap_dio_zero Date: Tue, 7 May 2024 16:58:12 +0200 Message-Id: <20240507145811.52987-1-kernel@pankajraghav.com> In-Reply-To: <20240503095353.3798063-8-mcgrof@kernel.org> References: <20240503095353.3798063-8-mcgrof@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[RFC] iomap: use huge zero folio in iomap_dio_zero \| expand [RFC] iomap: use huge zero folio in iomap_dio_zero

Pankaj Raghav (Samsung) May 7, 2024, 2:58 p.m. UTC

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
I rebased on top of mm-unstable to get mm_get_huge_zero_folio().

@Christoph is this inline with what you had in mind?

 fs/iomap/direct-io.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

Zi Yan May 7, 2024, 3:11 p.m. UTC | #1

On 7 May 2024, at 10:58, Pankaj Raghav (Samsung) wrote:

> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.
>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> I rebased on top of mm-unstable to get mm_get_huge_zero_folio().
>
> @Christoph is this inline with what you had in mind?
>
>  fs/iomap/direct-io.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 5f481068de5b..7f584f9ff2c5 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -236,11 +236,18 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>  		loff_t pos, unsigned len)
>  {
>  	struct inode *inode = file_inode(dio->iocb->ki_filp);
> -	struct page *page = ZERO_PAGE(0);
> +	struct folio *zero_page_folio = page_folio(ZERO_PAGE(0));
> +	struct folio *folio = zero_page_folio;
>  	struct bio *bio;
>
>  	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
>
> +	if (len > PAGE_SIZE) {
> +		folio = mm_get_huge_zero_folio(current->mm);

The huge zero folio is order-9, but it seems here you only require len to be
bigger than PAGE_SIZE. I am not sure if it works when len is smaller than
HPAGE_PMD_SIZE.

> +		if (!folio)
> +			folio = zero_page_folio;
> +	}
> +
>  	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
>  				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> @@ -251,10 +258,10 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>
>  	while (len) {
> -		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +		unsigned int size = min(len, folio_size(folio));
>
> -		__bio_add_page(bio, page, io_len, 0);
> -		len -= io_len;
> +		bio_add_folio_nofail(bio, folio, size, 0);
> +		len -= size;

Maybe use huge zero folio when len > HPAGE_PMD_SIZE and use ZERO_PAGE(0)
for len % HPAGE_PMD_SIZE.

>  	}
>  	iomap_dio_submit_bio(iter, dio, bio, pos);
>  }
> -- 
> 2.34.1


--
Best Regards,
Yan, Zi

Christoph Hellwig May 7, 2024, 4:11 p.m. UTC | #2

On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> +	if (len > PAGE_SIZE) {
> +		folio = mm_get_huge_zero_folio(current->mm);

I don't think the mm_struct based interfaces work well here, as I/O
completions don't come in through the same mm.  You'll want to use
lower level interfaces like get_huge_zero_page and use them at
mount time.

> +		if (!folio)
> +			folio = zero_page_folio;

And then don't bother with a fallback.

Pankaj Raghav (Samsung) May 8, 2024, 11:39 a.m. UTC | #3

On Tue, May 07, 2024 at 09:11:51AM -0700, Christoph Hellwig wrote:
> On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> > +	if (len > PAGE_SIZE) {
> > +		folio = mm_get_huge_zero_folio(current->mm);
> 
> I don't think the mm_struct based interfaces work well here, as I/O
> completions don't come in through the same mm.  You'll want to use
> lower level interfaces like get_huge_zero_page and use them at
> mount time.

At the moment, we can get a reference to the huge zero folio only through
the mm interface. 

Even if change the lower level interface to return THP, it can still fail
at the mount time and we will need the fallback right?

> 
> > +		if (!folio)
> > +			folio = zero_page_folio;
> 
> And then don't bother with a fallback.
>

Christoph Hellwig May 8, 2024, 11:43 a.m. UTC | #4

On Wed, May 08, 2024 at 11:39:49AM +0000, Pankaj Raghav (Samsung) wrote:
> At the moment, we can get a reference to the huge zero folio only through
> the mm interface. 
> 
> Even if change the lower level interface to return THP, it can still fail
> at the mount time and we will need the fallback right?

Well, that's why I suggest doing it at mount time.  Asking for it deep
down in the write code is certainly going to be a bit problematic.

Pankaj Raghav (Samsung) May 9, 2024, 12:31 p.m. UTC | #5

On Wed, May 08, 2024 at 04:43:54AM -0700, Christoph Hellwig wrote:
> On Wed, May 08, 2024 at 11:39:49AM +0000, Pankaj Raghav (Samsung) wrote:
> > At the moment, we can get a reference to the huge zero folio only through
> > the mm interface. 
> > 
> > Even if change the lower level interface to return THP, it can still fail
> > at the mount time and we will need the fallback right?
> 
> Well, that's why I suggest doing it at mount time.  Asking for it deep
> down in the write code is certainly going to be a bit problematic.

Makes sense. But failing to mount because we can't get a huge zero folio
seems wrong as we still can't guarantee it even at mount time.

With the current infrastructure I don't see anyway of geting a huge zero
folio that is guaranteed so that we don't need any fallback.

Let me know what you think.

Christoph Hellwig May 9, 2024, 12:46 p.m. UTC | #6

On Thu, May 09, 2024 at 12:31:07PM +0000, Pankaj Raghav (Samsung) wrote:
> > Well, that's why I suggest doing it at mount time.  Asking for it deep
> > down in the write code is certainly going to be a bit problematic.
> 
> Makes sense. But failing to mount because we can't get a huge zero folio
> seems wrong as we still can't guarantee it even at mount time.
> 
> With the current infrastructure I don't see anyway of geting a huge zero
> folio that is guaranteed so that we don't need any fallback.

You export get_huge_zero_page, put_huge_zero_page (they might need a
rename and kerneldoc for the final version) and huge_zero_folio or a
wrapper to get it, and then call get_huge_zero_page from mount,
from unmount and just use huge_zero_folio which is guaranteed to
exist once get_huge_zero_page succeeded.

Pankaj Raghav (Samsung) May 9, 2024, 12:55 p.m. UTC | #7

On Thu, May 09, 2024 at 05:46:55AM -0700, Christoph Hellwig wrote:
> On Thu, May 09, 2024 at 12:31:07PM +0000, Pankaj Raghav (Samsung) wrote:
> > > Well, that's why I suggest doing it at mount time.  Asking for it deep
> > > down in the write code is certainly going to be a bit problematic.
> > 
> > Makes sense. But failing to mount because we can't get a huge zero folio
> > seems wrong as we still can't guarantee it even at mount time.
> > 
> > With the current infrastructure I don't see anyway of geting a huge zero
> > folio that is guaranteed so that we don't need any fallback.
> 
> You export get_huge_zero_page, put_huge_zero_page (they might need a
> rename and kerneldoc for the final version) and huge_zero_folio or a
> wrapper to get it, and then call get_huge_zero_page from mount,

static bool get_huge_zero_page(void)
{
	struct folio *zero_folio;
retry:
	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
		return true;

	zero_folio = folio_alloc((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
			HPAGE_PMD_ORDER);
	if (!zero_folio) {

We might still fail here during mount. My question is: do we also fail
the mount if folio_alloc fails?

		count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED);
		return false;
	}

...
> from unmount and just use huge_zero_folio which is guaranteed to
> exist once get_huge_zero_page succeeded.
>

Christoph Hellwig May 9, 2024, 12:58 p.m. UTC | #8

On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote:
> We might still fail here during mount. My question is: do we also fail
> the mount if folio_alloc fails?

Yes.  Like any other allocation that fails at mount time.

Darrick J. Wong May 9, 2024, 2:32 p.m. UTC | #9

On Thu, May 09, 2024 at 05:58:23AM -0700, Christoph Hellwig wrote:
> On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote:
> > We might still fail here during mount. My question is: do we also fail
> > the mount if folio_alloc fails?
> 
> Yes.  Like any other allocation that fails at mount time.

How hard is it to fallback to regular zero-page if you can't allocate
the zero-hugepage?  I think most sysadmins would rather mount with
reduced zeroing performance than get an ENOMEM.

--D

Christoph Hellwig May 9, 2024, 3:05 p.m. UTC | #10

On Thu, May 09, 2024 at 07:32:50AM -0700, Darrick J. Wong wrote:
> On Thu, May 09, 2024 at 05:58:23AM -0700, Christoph Hellwig wrote:
> > On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote:
> > > We might still fail here during mount. My question is: do we also fail
> > > the mount if folio_alloc fails?
> > 
> > Yes.  Like any other allocation that fails at mount time.
> 
> How hard is it to fallback to regular zero-page if you can't allocate
> the zero-hugepage?

We'd need the bio allocation and bio_add_page loop.  Not the end
of the world, but also a bit annoying.  If we do that we might as
well just do it unconditionally.

> I think most sysadmins would rather mount with
> reduced zeroing performance than get an ENOMEM.

If you don't have 2MB free for the zero huge folio, are you going to
do useful things with your large block size XFS file system which
only makes sense for giant storage sizes?

Darrick J. Wong May 9, 2024, 3:08 p.m. UTC | #11

On Thu, May 09, 2024 at 08:05:24AM -0700, Christoph Hellwig wrote:
> On Thu, May 09, 2024 at 07:32:50AM -0700, Darrick J. Wong wrote:
> > On Thu, May 09, 2024 at 05:58:23AM -0700, Christoph Hellwig wrote:
> > > On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote:
> > > > We might still fail here during mount. My question is: do we also fail
> > > > the mount if folio_alloc fails?
> > > 
> > > Yes.  Like any other allocation that fails at mount time.
> > 
> > How hard is it to fallback to regular zero-page if you can't allocate
> > the zero-hugepage?
> 
> We'd need the bio allocation and bio_add_page loop.  Not the end
> of the world, but also a bit annoying.  If we do that we might as
> well just do it unconditionally.
> 
> > I think most sysadmins would rather mount with
> > reduced zeroing performance than get an ENOMEM.
> 
> If you don't have 2MB free for the zero huge folio, are you going to
> do useful things with your large block size XFS file system which
> only makes sense for giant storage sizes?

Oh.  Right, this is for bs>ps.  For that case it makes sense to fail the
mount.  I was only thinking about bs<=ps with large folios, where it
doesn't.

(Would we use the zero-hugepage for large folios on a 4k fsblock fs?)

--D

Christoph Hellwig May 9, 2024, 3:09 p.m. UTC | #12

On Thu, May 09, 2024 at 08:08:28AM -0700, Darrick J. Wong wrote:
> Oh.  Right, this is for bs>ps.  For that case it makes sense to fail the
> mount.  I was only thinking about bs<=ps with large folios, where it
> doesn't.
> 
> (Would we use the zero-hugepage for large folios on a 4k fsblock fs?)

The direct I/O zeroing code always deals with sub-blocksize amounts.
The buffered zeroing path can deal with larger amounts in a few
cases, but that goes through the page cache anyway.

Matthew Wilcox May 15, 2024, 12:50 a.m. UTC | #13

On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.

So the block people say we're doing this all wrong.  We should be
issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of
using the ZERO_PAGE if the hardware doesn't natively support
WRITE_ZEROES or a DISCARD that zeroes or ...


I suspect all these places should be checked to see if they can use
WRITE_ZEROES too:

fs/bcachefs/checksum.c:                         page_address(ZERO_PAGE(0)), page_len);
fs/bcachefs/io_write.c:         if (bv->bv_page != ZERO_PAGE(0))
fs/bcachefs/quota.c:            if (memcmp(mq, page_address(ZERO_PAGE(0)), sizeof(*mq))) {
fs/cramfs/inode.c:              return page_address(ZERO_PAGE(0));
fs/crypto/bio.c:                ret = bio_add_page(bio, ZERO_PAGE(0), bytes_this_page, 0);
fs/crypto/bio.c:                                                      ZERO_PAGE(0), pages[i],
fs/direct-io.c:         dio->pages[0] = ZERO_PAGE(0);
fs/direct-io.c: page = ZERO_PAGE(0);
fs/iomap/direct-io.c:   struct page *page = ZERO_PAGE(0);

Keith Busch May 15, 2024, 2:34 a.m. UTC | #14

On Wed, May 15, 2024 at 01:50:53AM +0100, Matthew Wilcox wrote:
> On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.
> 
> So the block people say we're doing this all wrong.  We should be
> issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of
> using the ZERO_PAGE if the hardware doesn't natively support
> WRITE_ZEROES or a DISCARD that zeroes or ...

Wait a second, I think you've gone too far if you're setting the bio op
to REQ_OP_WRITE_ZEROES. The block layer handles the difference only
through the blkdev_issue_zeroout() helper. If you actually submit a bio
with that op to a block device that doesn't support it, you'll just get
a BLK_STS_NOTSUPP error from submit_bio_noacct().

Matthew Wilcox May 15, 2024, 4:04 a.m. UTC | #15

On Tue, May 14, 2024 at 08:34:09PM -0600, Keith Busch wrote:
> On Wed, May 15, 2024 at 01:50:53AM +0100, Matthew Wilcox wrote:
> > On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> > > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> > > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.
> > 
> > So the block people say we're doing this all wrong.  We should be
> > issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of
> > using the ZERO_PAGE if the hardware doesn't natively support
> > WRITE_ZEROES or a DISCARD that zeroes or ...
> 
> Wait a second, I think you've gone too far if you're setting the bio op
> to REQ_OP_WRITE_ZEROES. The block layer handles the difference only
> through the blkdev_issue_zeroout() helper. If you actually submit a bio
> with that op to a block device that doesn't support it, you'll just get
> a BLK_STS_NOTSUPP error from submit_bio_noacct().

Ohh.  This is a bit awkward, because this is the iomap direct IO path.
I don't see an obvious way to get the semantics we want with the current
blkdev_issue_zeroout().  For reference, here's the current function:

static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
                loff_t pos, unsigned len)
{
        struct inode *inode = file_inode(dio->iocb->ki_filp);
        struct page *page = ZERO_PAGE(0);
        struct bio *bio;

        bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
        fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
                                  GFP_KERNEL);
        bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
        bio->bi_private = dio;
        bio->bi_end_io = iomap_dio_bio_end_io;

        __bio_add_page(bio, page, len, 0);
        iomap_dio_submit_bio(iter, dio, bio, pos);
}

and then:

static void iomap_dio_submit_bio(const struct iomap_iter *iter,
                struct iomap_dio *dio, struct bio *bio, loff_t pos)
{
        struct kiocb *iocb = dio->iocb;

        atomic_inc(&dio->ref);

        /* Sync dio can't be polled reliably */
        if ((iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(iocb)) {
                bio_set_polled(bio, iocb);
                WRITE_ONCE(iocb->private, bio);
        }

        if (dio->dops && dio->dops->submit_io)
                dio->dops->submit_io(iter, bio, pos);
        else
                submit_bio(bio);
}

so unless submit_bio() can handle the fallback to "create a new bio
full of zeroes and resubmit it to the device" if the original fails,
we're a little mismatched.  I'm not really familiar with either part of
this code, so I don't have much in the way of bright ideas.  Perhaps
we go back to the "allocate a large folio at filesystem mount" plan.

Christoph Hellwig May 15, 2024, 11:48 a.m. UTC | #16

On Wed, May 15, 2024 at 01:50:53AM +0100, Matthew Wilcox wrote:
> On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.
> 
> So the block people say we're doing this all wrong.  We should be
> issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of
> using the ZERO_PAGE if the hardware doesn't natively support
> WRITE_ZEROES or a DISCARD that zeroes or ...

Not sure who "the block people" are, but while this sounds smart
it actually is a really bad idea.

Think about what we are doing here, we zero parts of a file system
block as part of a direct I/O write operation.  So the amount is
relatively small and it is part of a fast path I/O operation.  It
also will most likely land on the indirection entry on the device.

If you use a write zeroes it will go down a separate slow path in
the device instead of using the highly optimized write path and
slow the whole operation down.  Even worse there are chances that
it will increase write amplification because there are two separate
operations now instead of one merged one (either a block layer or
device merge).

And I'm not sure what "block layer person" still doesn't understand
that discard do not zero data, but maybe we'll need yet another
education campaign there.

Pankaj Raghav (Samsung) May 15, 2024, 3:59 p.m. UTC | #17

> so unless submit_bio() can handle the fallback to "create a new bio
> full of zeroes and resubmit it to the device" if the original fails,
> we're a little mismatched.  I'm not really familiar with either part of
> this code, so I don't have much in the way of bright ideas.  Perhaps
> we go back to the "allocate a large folio at filesystem mount" plan.

So one thing that became clear after yesterday's discussion was to
**not** use a PMD page for sub block zeroing as in some architectures
we will be using a lot of memory (such as ARM) to zero out a 64k FS block.

So Chinner proposed the idea of using iomap_init function to alloc
large zero folio that could be used in iomap_dio_zero().

The general agreement was 64k large folio is enough for now. We could
always increase it and optimize it in the future when required.

This is a rough prototype of what it might look like:

diff --git a/fs/internal.h b/fs/internal.h
index 7ca738904e34..dad5734b2f75 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -35,6 +35,12 @@ static inline void bdev_cache_init(void)
 int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
                get_block_t *get_block, const struct iomap *iomap);
 
+/*
+ * iomap/buffered-io.c
+ */
+
+extern struct folio *zero_fsb_folio;
+
 /*
  * char_dev.c
  */
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e8e41c8b3c0..48235765df7a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -42,6 +42,7 @@ struct iomap_folio_state {
 };
 
 static struct bio_set iomap_ioend_bioset;
+struct folio *zero_fsb_folio;
 
 static inline bool ifs_is_fully_uptodate(struct folio *folio,
                struct iomap_folio_state *ifs)
@@ -1985,8 +1986,15 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 }
 EXPORT_SYMBOL_GPL(iomap_writepages);
 
+
 static int __init iomap_init(void)
 {
+       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);
+
+       if (!addr)
+               return -ENOMEM;
+
+       zero_fsb_folio = virt_to_folio(addr);
        return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
                           offsetof(struct iomap_ioend, io_bio),
                           BIOSET_NEED_BVECS);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..59a65c3ccf13 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -236,17 +236,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
                loff_t pos, unsigned len)
 {
        struct inode *inode = file_inode(dio->iocb->ki_filp);
-       struct page *page = ZERO_PAGE(0);
        struct bio *bio;
 
-       bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+       /*
+        * The zero folio used is 64k.
+        */
+       WARN_ON_ONCE(len > (16 * PAGE_SIZE));
+
+       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
        fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
                                  GFP_KERNEL);
+
        bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
        bio->bi_private = dio;
        bio->bi_end_io = iomap_dio_bio_end_io;
 
-       __bio_add_page(bio, page, len, 0);
+       bio_add_folio_nofail(bio, zero_fsb_folio, len, 0);
        iomap_dio_submit_bio(iter, dio, bio, pos);
 }

Matthew Wilcox May 15, 2024, 6:03 p.m. UTC | #18

On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote:
>  static int __init iomap_init(void)
>  {
> +       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);

Don't use XFS coding style outside XFS.

kzalloc() does not guarantee page alignment much less alignment to
a folio.  It happens to work today, but that is an implementation
artefact.

> +
> +       if (!addr)
> +               return -ENOMEM;
> +
> +       zero_fsb_folio = virt_to_folio(addr);

We also don't guarantee that calling kzalloc() gives you a virtual
address that can be converted to a folio.  You need to allocate a folio
to be sure that you get a folio.

Of course, you don't actually need a folio.  You don't need any of the
folio metadata and can just use raw pages.

> +       /*
> +        * The zero folio used is 64k.
> +        */
> +       WARN_ON_ONCE(len > (16 * PAGE_SIZE));

PAGE_SIZE is not necessarily 4KiB.

> +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);

The point was that we now only need one biovec, not MAX.

Pankaj Raghav (Samsung) May 16, 2024, 3:02 p.m. UTC | #19

On Wed, May 15, 2024 at 07:03:20PM +0100, Matthew Wilcox wrote:
> On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote:
> >  static int __init iomap_init(void)
> >  {
> > +       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);
> 
> Don't use XFS coding style outside XFS.
> 
> kzalloc() does not guarantee page alignment much less alignment to
> a folio.  It happens to work today, but that is an implementation
> artefact.
> 
> > +
> > +       if (!addr)
> > +               return -ENOMEM;
> > +
> > +       zero_fsb_folio = virt_to_folio(addr);
> 
> We also don't guarantee that calling kzalloc() gives you a virtual
> address that can be converted to a folio.  You need to allocate a folio
> to be sure that you get a folio.
> 
> Of course, you don't actually need a folio.  You don't need any of the
> folio metadata and can just use raw pages.
> 
> > +       /*
> > +        * The zero folio used is 64k.
> > +        */
> > +       WARN_ON_ONCE(len > (16 * PAGE_SIZE));
> 
> PAGE_SIZE is not necessarily 4KiB.
> 
> > +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> 
> The point was that we now only need one biovec, not MAX.
> 

Thanks for the comments. I think it all makes sense:

diff --git a/fs/internal.h b/fs/internal.h
index 7ca738904e34..e152b77a77e4 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -35,6 +35,14 @@ static inline void bdev_cache_init(void)
 int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
                get_block_t *get_block, const struct iomap *iomap);
 
+/*
+ * iomap/buffered-io.c
+ */
+
+#define ZERO_FSB_SIZE (65536)
+#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
+extern struct page *zero_fs_block;
+
 /*
  * char_dev.c
  */
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e8e41c8b3c0..36d2f7edd310 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -42,6 +42,7 @@ struct iomap_folio_state {
 };
 
 static struct bio_set iomap_ioend_bioset;
+struct page *zero_fs_block;
 
 static inline bool ifs_is_fully_uptodate(struct folio *folio,
                struct iomap_folio_state *ifs)
@@ -1985,8 +1986,13 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 }
 EXPORT_SYMBOL_GPL(iomap_writepages);
 
+
 static int __init iomap_init(void)
 {
+       zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
+       if (!zero_fs_block)
+               return -ENOMEM;
+
        return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
                           offsetof(struct iomap_ioend, io_bio),
                           BIOSET_NEED_BVECS);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..50c2bca8a347 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -236,17 +236,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
                loff_t pos, unsigned len)
 {
        struct inode *inode = file_inode(dio->iocb->ki_filp);
-       struct page *page = ZERO_PAGE(0);
        struct bio *bio;
 
+       /*
+        * Max block size supported is 64k
+        */
+       WARN_ON_ONCE(len > ZERO_FSB_SIZE);
+
        bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
        fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
                                  GFP_KERNEL);
+
        bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
        bio->bi_private = dio;
        bio->bi_end_io = iomap_dio_bio_end_io;
 
-       __bio_add_page(bio, page, len, 0);
+       __bio_add_page(bio, zero_fs_block, len, 0);
        iomap_dio_submit_bio(iter, dio, bio, pos);
 }

Hannes Reinecke May 17, 2024, 12:36 p.m. UTC | #20

On 5/16/24 17:02, Pankaj Raghav (Samsung) wrote:
> On Wed, May 15, 2024 at 07:03:20PM +0100, Matthew Wilcox wrote:
>> On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote:
>>>   static int __init iomap_init(void)
>>>   {
>>> +       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);
>>
>> Don't use XFS coding style outside XFS.
>>
>> kzalloc() does not guarantee page alignment much less alignment to
>> a folio.  It happens to work today, but that is an implementation
>> artefact.
>>
>>> +
>>> +       if (!addr)
>>> +               return -ENOMEM;
>>> +
>>> +       zero_fsb_folio = virt_to_folio(addr);
>>
>> We also don't guarantee that calling kzalloc() gives you a virtual
>> address that can be converted to a folio.  You need to allocate a folio
>> to be sure that you get a folio.
>>
>> Of course, you don't actually need a folio.  You don't need any of the
>> folio metadata and can just use raw pages.
>>
>>> +       /*
>>> +        * The zero folio used is 64k.
>>> +        */
>>> +       WARN_ON_ONCE(len > (16 * PAGE_SIZE));
>>
>> PAGE_SIZE is not necessarily 4KiB.
>>
>>> +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
>>> +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>>
>> The point was that we now only need one biovec, not MAX.
>>
> 
> Thanks for the comments. I think it all makes sense:
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index 7ca738904e34..e152b77a77e4 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -35,6 +35,14 @@ static inline void bdev_cache_init(void)
>   int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
>                  get_block_t *get_block, const struct iomap *iomap);
>   
> +/*
> + * iomap/buffered-io.c
> + */
> +
> +#define ZERO_FSB_SIZE (65536)
> +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
> +extern struct page *zero_fs_block;
> +
>   /*
>    * char_dev.c
>    */
But why?
We already have a perfectly fine hugepage zero page in huge_memory.c. 
Shouldn't we rather export that one and use it?
(Actually I have some patches for doing so...)
We might allocate folios
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4e8e41c8b3c0..36d2f7edd310 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -42,6 +42,7 @@ struct iomap_folio_state {
>   };
>   
>   static struct bio_set iomap_ioend_bioset;
> +struct page *zero_fs_block;
>   
>   static inline bool ifs_is_fully_uptodate(struct folio *folio,
>                  struct iomap_folio_state *ifs)
> @@ -1985,8 +1986,13 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
>   }
>   EXPORT_SYMBOL_GPL(iomap_writepages);
>   
> +
>   static int __init iomap_init(void)
>   {
> +       zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
> +       if (!zero_fs_block)
> +               return -ENOMEM;
> +
>          return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
>                             offsetof(struct iomap_ioend, io_bio),
>                             BIOSET_NEED_BVECS);
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..50c2bca8a347 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -236,17 +236,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>                  loff_t pos, unsigned len)
>   {
>          struct inode *inode = file_inode(dio->iocb->ki_filp);
> -       struct page *page = ZERO_PAGE(0);
>          struct bio *bio;
>   
> +       /*
> +        * Max block size supported is 64k
> +        */
> +       WARN_ON_ONCE(len > ZERO_FSB_SIZE);
> +
>          bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>          fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>                                    GFP_KERNEL);
> +
>          bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>          bio->bi_private = dio;
>          bio->bi_end_io = iomap_dio_bio_end_io;
>   
> -       __bio_add_page(bio, page, len, 0);
> +       __bio_add_page(bio, zero_fs_block, len, 0);
>          iomap_dio_submit_bio(iter, dio, bio, pos);
>   }
>

Hannes Reinecke May 17, 2024, 12:56 p.m. UTC | #21

On 5/17/24 14:36, Hannes Reinecke wrote:
> On 5/16/24 17:02, Pankaj Raghav (Samsung) wrote:
>> On Wed, May 15, 2024 at 07:03:20PM +0100, Matthew Wilcox wrote:
>>> On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote:
>>>>   static int __init iomap_init(void)
>>>>   {
>>>> +       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);
>>>
>>> Don't use XFS coding style outside XFS.
>>>
>>> kzalloc() does not guarantee page alignment much less alignment to
>>> a folio.  It happens to work today, but that is an implementation
>>> artefact.
>>>
>>>> +
>>>> +       if (!addr)
>>>> +               return -ENOMEM;
>>>> +
>>>> +       zero_fsb_folio = virt_to_folio(addr);
>>>
>>> We also don't guarantee that calling kzalloc() gives you a virtual
>>> address that can be converted to a folio.  You need to allocate a folio
>>> to be sure that you get a folio.
>>>
>>> Of course, you don't actually need a folio.  You don't need any of the
>>> folio metadata and can just use raw pages.
>>>
>>>> +       /*
>>>> +        * The zero folio used is 64k.
>>>> +        */
>>>> +       WARN_ON_ONCE(len > (16 * PAGE_SIZE));
>>>
>>> PAGE_SIZE is not necessarily 4KiB.
>>>
>>>> +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
>>>> +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>>>
>>> The point was that we now only need one biovec, not MAX.
>>>
>>
>> Thanks for the comments. I think it all makes sense:
>>
>> diff --git a/fs/internal.h b/fs/internal.h
>> index 7ca738904e34..e152b77a77e4 100644
>> --- a/fs/internal.h
>> +++ b/fs/internal.h
>> @@ -35,6 +35,14 @@ static inline void bdev_cache_init(void)
>>   int __block_write_begin_int(struct folio *folio, loff_t pos, 
>> unsigned len,
>>                  get_block_t *get_block, const struct iomap *iomap);
>> +/*
>> + * iomap/buffered-io.c
>> + */
>> +
>> +#define ZERO_FSB_SIZE (65536)
>> +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
>> +extern struct page *zero_fs_block;
>> +
>>   /*
>>    * char_dev.c
>>    */
> But why?
> We already have a perfectly fine hugepage zero page in huge_memory.c. 
> Shouldn't we rather export that one and use it?
> (Actually I have some patches for doing so...)
> We might allocate folios

Bah. Hit 'enter' too soon.

We might allocate a zero folio as a fallback if the huge zero page is 
not available, but first we should try to use that.

Cheers,

Hannes

Matthew Wilcox May 17, 2024, 1:30 p.m. UTC | #22

On Fri, May 17, 2024 at 02:36:29PM +0200, Hannes Reinecke wrote:
> > +#define ZERO_FSB_SIZE (65536)
> > +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
> > +extern struct page *zero_fs_block;
> > +
> >   /*
> >    * char_dev.c
> >    */
> But why?
> We already have a perfectly fine hugepage zero page in huge_memory.c.
> Shouldn't we rather export that one and use it?
> (Actually I have some patches for doing so...)

But we don't necessarily.  We only have it if
do_huge_pmd_anonymous_page() satisfies:

        if (!(vmf->flags & FAULT_FLAG_WRITE) &&
                        !mm_forbids_zeropage(vma->vm_mm) &&
                        transparent_hugepage_use_zero_page()) {

ie we've taken a page fault on a PMD hole in a VMA, that VMA permits
PMD mappings to exist, the page fault was for read, the
forbids-huge-zeropage isn't set for this vma, and using the hugetlb zero
page isn't forbidden.

I'd like to see stats for how much the PMD-zero-page is actually used,
because I suspect it's not really used very much.

[RFC] iomap: use huge zero folio in iomap_dio_zero

Commit Message

Comments

Patch