Message ID | 20240507145811.52987-1-kernel@pankajraghav.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] iomap: use huge zero folio in iomap_dio_zero | expand |
On 7 May 2024, at 10:58, Pankaj Raghav (Samsung) wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails. > > Suggested-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > --- > I rebased on top of mm-unstable to get mm_get_huge_zero_folio(). > > @Christoph is this inline with what you had in mind? > > fs/iomap/direct-io.c | 15 +++++++++++---- > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 5f481068de5b..7f584f9ff2c5 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -236,11 +236,18 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, > loff_t pos, unsigned len) > { > struct inode *inode = file_inode(dio->iocb->ki_filp); > - struct page *page = ZERO_PAGE(0); > + struct folio *zero_page_folio = page_folio(ZERO_PAGE(0)); > + struct folio *folio = zero_page_folio; > struct bio *bio; > > WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE)); > > + if (len > PAGE_SIZE) { > + folio = mm_get_huge_zero_folio(current->mm); The huge zero folio is order-9, but it seems here you only require len to be bigger than PAGE_SIZE. I am not sure if it works when len is smaller than HPAGE_PMD_SIZE. > + if (!folio) > + folio = zero_page_folio; > + } > + > bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, > REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); > fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, > @@ -251,10 +258,10 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, > bio->bi_end_io = iomap_dio_bio_end_io; > > while (len) { > - unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE); > + unsigned int size = min(len, folio_size(folio)); > > - __bio_add_page(bio, page, io_len, 0); > - len -= io_len; > + bio_add_folio_nofail(bio, folio, size, 0); > + len -= size; Maybe use huge zero folio when len > HPAGE_PMD_SIZE and use ZERO_PAGE(0) for len % HPAGE_PMD_SIZE. > } > iomap_dio_submit_bio(iter, dio, bio, pos); > } > -- > 2.34.1 -- Best Regards, Yan, Zi
On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote: > + if (len > PAGE_SIZE) { > + folio = mm_get_huge_zero_folio(current->mm); I don't think the mm_struct based interfaces work well here, as I/O completions don't come in through the same mm. You'll want to use lower level interfaces like get_huge_zero_page and use them at mount time. > + if (!folio) > + folio = zero_page_folio; And then don't bother with a fallback.
On Tue, May 07, 2024 at 09:11:51AM -0700, Christoph Hellwig wrote: > On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote: > > + if (len > PAGE_SIZE) { > > + folio = mm_get_huge_zero_folio(current->mm); > > I don't think the mm_struct based interfaces work well here, as I/O > completions don't come in through the same mm. You'll want to use > lower level interfaces like get_huge_zero_page and use them at > mount time. At the moment, we can get a reference to the huge zero folio only through the mm interface. Even if change the lower level interface to return THP, it can still fail at the mount time and we will need the fallback right? > > > + if (!folio) > > + folio = zero_page_folio; > > And then don't bother with a fallback. >
On Wed, May 08, 2024 at 11:39:49AM +0000, Pankaj Raghav (Samsung) wrote: > At the moment, we can get a reference to the huge zero folio only through > the mm interface. > > Even if change the lower level interface to return THP, it can still fail > at the mount time and we will need the fallback right? Well, that's why I suggest doing it at mount time. Asking for it deep down in the write code is certainly going to be a bit problematic.
On Wed, May 08, 2024 at 04:43:54AM -0700, Christoph Hellwig wrote: > On Wed, May 08, 2024 at 11:39:49AM +0000, Pankaj Raghav (Samsung) wrote: > > At the moment, we can get a reference to the huge zero folio only through > > the mm interface. > > > > Even if change the lower level interface to return THP, it can still fail > > at the mount time and we will need the fallback right? > > Well, that's why I suggest doing it at mount time. Asking for it deep > down in the write code is certainly going to be a bit problematic. Makes sense. But failing to mount because we can't get a huge zero folio seems wrong as we still can't guarantee it even at mount time. With the current infrastructure I don't see anyway of geting a huge zero folio that is guaranteed so that we don't need any fallback. Let me know what you think.
On Thu, May 09, 2024 at 12:31:07PM +0000, Pankaj Raghav (Samsung) wrote: > > Well, that's why I suggest doing it at mount time. Asking for it deep > > down in the write code is certainly going to be a bit problematic. > > Makes sense. But failing to mount because we can't get a huge zero folio > seems wrong as we still can't guarantee it even at mount time. > > With the current infrastructure I don't see anyway of geting a huge zero > folio that is guaranteed so that we don't need any fallback. You export get_huge_zero_page, put_huge_zero_page (they might need a rename and kerneldoc for the final version) and huge_zero_folio or a wrapper to get it, and then call get_huge_zero_page from mount, from unmount and just use huge_zero_folio which is guaranteed to exist once get_huge_zero_page succeeded.
On Thu, May 09, 2024 at 05:46:55AM -0700, Christoph Hellwig wrote: > On Thu, May 09, 2024 at 12:31:07PM +0000, Pankaj Raghav (Samsung) wrote: > > > Well, that's why I suggest doing it at mount time. Asking for it deep > > > down in the write code is certainly going to be a bit problematic. > > > > Makes sense. But failing to mount because we can't get a huge zero folio > > seems wrong as we still can't guarantee it even at mount time. > > > > With the current infrastructure I don't see anyway of geting a huge zero > > folio that is guaranteed so that we don't need any fallback. > > You export get_huge_zero_page, put_huge_zero_page (they might need a > rename and kerneldoc for the final version) and huge_zero_folio or a > wrapper to get it, and then call get_huge_zero_page from mount, static bool get_huge_zero_page(void) { struct folio *zero_folio; retry: if (likely(atomic_inc_not_zero(&huge_zero_refcount))) return true; zero_folio = folio_alloc((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE, HPAGE_PMD_ORDER); if (!zero_folio) { We might still fail here during mount. My question is: do we also fail the mount if folio_alloc fails? count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED); return false; } ... > from unmount and just use huge_zero_folio which is guaranteed to > exist once get_huge_zero_page succeeded. >
On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote: > We might still fail here during mount. My question is: do we also fail > the mount if folio_alloc fails? Yes. Like any other allocation that fails at mount time.
On Thu, May 09, 2024 at 05:58:23AM -0700, Christoph Hellwig wrote: > On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote: > > We might still fail here during mount. My question is: do we also fail > > the mount if folio_alloc fails? > > Yes. Like any other allocation that fails at mount time. How hard is it to fallback to regular zero-page if you can't allocate the zero-hugepage? I think most sysadmins would rather mount with reduced zeroing performance than get an ENOMEM. --D
On Thu, May 09, 2024 at 07:32:50AM -0700, Darrick J. Wong wrote: > On Thu, May 09, 2024 at 05:58:23AM -0700, Christoph Hellwig wrote: > > On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote: > > > We might still fail here during mount. My question is: do we also fail > > > the mount if folio_alloc fails? > > > > Yes. Like any other allocation that fails at mount time. > > How hard is it to fallback to regular zero-page if you can't allocate > the zero-hugepage? We'd need the bio allocation and bio_add_page loop. Not the end of the world, but also a bit annoying. If we do that we might as well just do it unconditionally. > I think most sysadmins would rather mount with > reduced zeroing performance than get an ENOMEM. If you don't have 2MB free for the zero huge folio, are you going to do useful things with your large block size XFS file system which only makes sense for giant storage sizes?
On Thu, May 09, 2024 at 08:05:24AM -0700, Christoph Hellwig wrote: > On Thu, May 09, 2024 at 07:32:50AM -0700, Darrick J. Wong wrote: > > On Thu, May 09, 2024 at 05:58:23AM -0700, Christoph Hellwig wrote: > > > On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote: > > > > We might still fail here during mount. My question is: do we also fail > > > > the mount if folio_alloc fails? > > > > > > Yes. Like any other allocation that fails at mount time. > > > > How hard is it to fallback to regular zero-page if you can't allocate > > the zero-hugepage? > > We'd need the bio allocation and bio_add_page loop. Not the end > of the world, but also a bit annoying. If we do that we might as > well just do it unconditionally. > > > I think most sysadmins would rather mount with > > reduced zeroing performance than get an ENOMEM. > > If you don't have 2MB free for the zero huge folio, are you going to > do useful things with your large block size XFS file system which > only makes sense for giant storage sizes? Oh. Right, this is for bs>ps. For that case it makes sense to fail the mount. I was only thinking about bs<=ps with large folios, where it doesn't. (Would we use the zero-hugepage for large folios on a 4k fsblock fs?) --D
On Thu, May 09, 2024 at 08:08:28AM -0700, Darrick J. Wong wrote: > Oh. Right, this is for bs>ps. For that case it makes sense to fail the > mount. I was only thinking about bs<=ps with large folios, where it > doesn't. > > (Would we use the zero-hugepage for large folios on a 4k fsblock fs?) The direct I/O zeroing code always deals with sub-blocksize amounts. The buffered zeroing path can deal with larger amounts in a few cases, but that goes through the page cache anyway.
On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote: > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails. So the block people say we're doing this all wrong. We should be issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of using the ZERO_PAGE if the hardware doesn't natively support WRITE_ZEROES or a DISCARD that zeroes or ... I suspect all these places should be checked to see if they can use WRITE_ZEROES too: fs/bcachefs/checksum.c: page_address(ZERO_PAGE(0)), page_len); fs/bcachefs/io_write.c: if (bv->bv_page != ZERO_PAGE(0)) fs/bcachefs/quota.c: if (memcmp(mq, page_address(ZERO_PAGE(0)), sizeof(*mq))) { fs/cramfs/inode.c: return page_address(ZERO_PAGE(0)); fs/crypto/bio.c: ret = bio_add_page(bio, ZERO_PAGE(0), bytes_this_page, 0); fs/crypto/bio.c: ZERO_PAGE(0), pages[i], fs/direct-io.c: dio->pages[0] = ZERO_PAGE(0); fs/direct-io.c: page = ZERO_PAGE(0); fs/iomap/direct-io.c: struct page *page = ZERO_PAGE(0);
On Wed, May 15, 2024 at 01:50:53AM +0100, Matthew Wilcox wrote: > On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote: > > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the > > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails. > > So the block people say we're doing this all wrong. We should be > issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of > using the ZERO_PAGE if the hardware doesn't natively support > WRITE_ZEROES or a DISCARD that zeroes or ... Wait a second, I think you've gone too far if you're setting the bio op to REQ_OP_WRITE_ZEROES. The block layer handles the difference only through the blkdev_issue_zeroout() helper. If you actually submit a bio with that op to a block device that doesn't support it, you'll just get a BLK_STS_NOTSUPP error from submit_bio_noacct().
On Tue, May 14, 2024 at 08:34:09PM -0600, Keith Busch wrote: > On Wed, May 15, 2024 at 01:50:53AM +0100, Matthew Wilcox wrote: > > On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote: > > > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the > > > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails. > > > > So the block people say we're doing this all wrong. We should be > > issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of > > using the ZERO_PAGE if the hardware doesn't natively support > > WRITE_ZEROES or a DISCARD that zeroes or ... > > Wait a second, I think you've gone too far if you're setting the bio op > to REQ_OP_WRITE_ZEROES. The block layer handles the difference only > through the blkdev_issue_zeroout() helper. If you actually submit a bio > with that op to a block device that doesn't support it, you'll just get > a BLK_STS_NOTSUPP error from submit_bio_noacct(). Ohh. This is a bit awkward, because this is the iomap direct IO path. I don't see an obvious way to get the semantics we want with the current blkdev_issue_zeroout(). For reference, here's the current function: static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, loff_t pos, unsigned len) { struct inode *inode = file_inode(dio->iocb->ki_filp); struct page *page = ZERO_PAGE(0); struct bio *bio; bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, GFP_KERNEL); bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; __bio_add_page(bio, page, len, 0); iomap_dio_submit_bio(iter, dio, bio, pos); } and then: static void iomap_dio_submit_bio(const struct iomap_iter *iter, struct iomap_dio *dio, struct bio *bio, loff_t pos) { struct kiocb *iocb = dio->iocb; atomic_inc(&dio->ref); /* Sync dio can't be polled reliably */ if ((iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(iocb)) { bio_set_polled(bio, iocb); WRITE_ONCE(iocb->private, bio); } if (dio->dops && dio->dops->submit_io) dio->dops->submit_io(iter, bio, pos); else submit_bio(bio); } so unless submit_bio() can handle the fallback to "create a new bio full of zeroes and resubmit it to the device" if the original fails, we're a little mismatched. I'm not really familiar with either part of this code, so I don't have much in the way of bright ideas. Perhaps we go back to the "allocate a large folio at filesystem mount" plan.
On Wed, May 15, 2024 at 01:50:53AM +0100, Matthew Wilcox wrote: > On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote: > > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the > > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails. > > So the block people say we're doing this all wrong. We should be > issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of > using the ZERO_PAGE if the hardware doesn't natively support > WRITE_ZEROES or a DISCARD that zeroes or ... Not sure who "the block people" are, but while this sounds smart it actually is a really bad idea. Think about what we are doing here, we zero parts of a file system block as part of a direct I/O write operation. So the amount is relatively small and it is part of a fast path I/O operation. It also will most likely land on the indirection entry on the device. If you use a write zeroes it will go down a separate slow path in the device instead of using the highly optimized write path and slow the whole operation down. Even worse there are chances that it will increase write amplification because there are two separate operations now instead of one merged one (either a block layer or device merge). And I'm not sure what "block layer person" still doesn't understand that discard do not zero data, but maybe we'll need yet another education campaign there.
> so unless submit_bio() can handle the fallback to "create a new bio > full of zeroes and resubmit it to the device" if the original fails, > we're a little mismatched. I'm not really familiar with either part of > this code, so I don't have much in the way of bright ideas. Perhaps > we go back to the "allocate a large folio at filesystem mount" plan. So one thing that became clear after yesterday's discussion was to **not** use a PMD page for sub block zeroing as in some architectures we will be using a lot of memory (such as ARM) to zero out a 64k FS block. So Chinner proposed the idea of using iomap_init function to alloc large zero folio that could be used in iomap_dio_zero(). The general agreement was 64k large folio is enough for now. We could always increase it and optimize it in the future when required. This is a rough prototype of what it might look like: diff --git a/fs/internal.h b/fs/internal.h index 7ca738904e34..dad5734b2f75 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -35,6 +35,12 @@ static inline void bdev_cache_init(void) int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len, get_block_t *get_block, const struct iomap *iomap); +/* + * iomap/buffered-io.c + */ + +extern struct folio *zero_fsb_folio; + /* * char_dev.c */ diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 4e8e41c8b3c0..48235765df7a 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -42,6 +42,7 @@ struct iomap_folio_state { }; static struct bio_set iomap_ioend_bioset; +struct folio *zero_fsb_folio; static inline bool ifs_is_fully_uptodate(struct folio *folio, struct iomap_folio_state *ifs) @@ -1985,8 +1986,15 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc, } EXPORT_SYMBOL_GPL(iomap_writepages); + static int __init iomap_init(void) { + void *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL); + + if (!addr) + return -ENOMEM; + + zero_fsb_folio = virt_to_folio(addr); return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE), offsetof(struct iomap_ioend, io_bio), BIOSET_NEED_BVECS); diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index f3b43d223a46..59a65c3ccf13 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -236,17 +236,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, loff_t pos, unsigned len) { struct inode *inode = file_inode(dio->iocb->ki_filp); - struct page *page = ZERO_PAGE(0); struct bio *bio; - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); + /* + * The zero folio used is 64k. + */ + WARN_ON_ONCE(len > (16 * PAGE_SIZE)); + + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, GFP_KERNEL); + bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - __bio_add_page(bio, page, len, 0); + bio_add_folio_nofail(bio, zero_fsb_folio, len, 0); iomap_dio_submit_bio(iter, dio, bio, pos); }
On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote: > static int __init iomap_init(void) > { > + void *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL); Don't use XFS coding style outside XFS. kzalloc() does not guarantee page alignment much less alignment to a folio. It happens to work today, but that is an implementation artefact. > + > + if (!addr) > + return -ENOMEM; > + > + zero_fsb_folio = virt_to_folio(addr); We also don't guarantee that calling kzalloc() gives you a virtual address that can be converted to a folio. You need to allocate a folio to be sure that you get a folio. Of course, you don't actually need a folio. You don't need any of the folio metadata and can just use raw pages. > + /* > + * The zero folio used is 64k. > + */ > + WARN_ON_ONCE(len > (16 * PAGE_SIZE)); PAGE_SIZE is not necessarily 4KiB. > + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, > + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); The point was that we now only need one biovec, not MAX.
On Wed, May 15, 2024 at 07:03:20PM +0100, Matthew Wilcox wrote: > On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote: > > static int __init iomap_init(void) > > { > > + void *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL); > > Don't use XFS coding style outside XFS. > > kzalloc() does not guarantee page alignment much less alignment to > a folio. It happens to work today, but that is an implementation > artefact. > > > + > > + if (!addr) > > + return -ENOMEM; > > + > > + zero_fsb_folio = virt_to_folio(addr); > > We also don't guarantee that calling kzalloc() gives you a virtual > address that can be converted to a folio. You need to allocate a folio > to be sure that you get a folio. > > Of course, you don't actually need a folio. You don't need any of the > folio metadata and can just use raw pages. > > > + /* > > + * The zero folio used is 64k. > > + */ > > + WARN_ON_ONCE(len > (16 * PAGE_SIZE)); > > PAGE_SIZE is not necessarily 4KiB. > > > + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, > > + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); > > The point was that we now only need one biovec, not MAX. > Thanks for the comments. I think it all makes sense: diff --git a/fs/internal.h b/fs/internal.h index 7ca738904e34..e152b77a77e4 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -35,6 +35,14 @@ static inline void bdev_cache_init(void) int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len, get_block_t *get_block, const struct iomap *iomap); +/* + * iomap/buffered-io.c + */ + +#define ZERO_FSB_SIZE (65536) +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE)) +extern struct page *zero_fs_block; + /* * char_dev.c */ diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 4e8e41c8b3c0..36d2f7edd310 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -42,6 +42,7 @@ struct iomap_folio_state { }; static struct bio_set iomap_ioend_bioset; +struct page *zero_fs_block; static inline bool ifs_is_fully_uptodate(struct folio *folio, struct iomap_folio_state *ifs) @@ -1985,8 +1986,13 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc, } EXPORT_SYMBOL_GPL(iomap_writepages); + static int __init iomap_init(void) { + zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER); + if (!zero_fs_block) + return -ENOMEM; + return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE), offsetof(struct iomap_ioend, io_bio), BIOSET_NEED_BVECS); diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index f3b43d223a46..50c2bca8a347 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -236,17 +236,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, loff_t pos, unsigned len) { struct inode *inode = file_inode(dio->iocb->ki_filp); - struct page *page = ZERO_PAGE(0); struct bio *bio; + /* + * Max block size supported is 64k + */ + WARN_ON_ONCE(len > ZERO_FSB_SIZE); + bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, GFP_KERNEL); + bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - __bio_add_page(bio, page, len, 0); + __bio_add_page(bio, zero_fs_block, len, 0); iomap_dio_submit_bio(iter, dio, bio, pos); }
On 5/16/24 17:02, Pankaj Raghav (Samsung) wrote: > On Wed, May 15, 2024 at 07:03:20PM +0100, Matthew Wilcox wrote: >> On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote: >>> static int __init iomap_init(void) >>> { >>> + void *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL); >> >> Don't use XFS coding style outside XFS. >> >> kzalloc() does not guarantee page alignment much less alignment to >> a folio. It happens to work today, but that is an implementation >> artefact. >> >>> + >>> + if (!addr) >>> + return -ENOMEM; >>> + >>> + zero_fsb_folio = virt_to_folio(addr); >> >> We also don't guarantee that calling kzalloc() gives you a virtual >> address that can be converted to a folio. You need to allocate a folio >> to be sure that you get a folio. >> >> Of course, you don't actually need a folio. You don't need any of the >> folio metadata and can just use raw pages. >> >>> + /* >>> + * The zero folio used is 64k. >>> + */ >>> + WARN_ON_ONCE(len > (16 * PAGE_SIZE)); >> >> PAGE_SIZE is not necessarily 4KiB. >> >>> + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, >>> + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); >> >> The point was that we now only need one biovec, not MAX. >> > > Thanks for the comments. I think it all makes sense: > > diff --git a/fs/internal.h b/fs/internal.h > index 7ca738904e34..e152b77a77e4 100644 > --- a/fs/internal.h > +++ b/fs/internal.h > @@ -35,6 +35,14 @@ static inline void bdev_cache_init(void) > int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len, > get_block_t *get_block, const struct iomap *iomap); > > +/* > + * iomap/buffered-io.c > + */ > + > +#define ZERO_FSB_SIZE (65536) > +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE)) > +extern struct page *zero_fs_block; > + > /* > * char_dev.c > */ But why? We already have a perfectly fine hugepage zero page in huge_memory.c. Shouldn't we rather export that one and use it? (Actually I have some patches for doing so...) We might allocate folios > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 4e8e41c8b3c0..36d2f7edd310 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -42,6 +42,7 @@ struct iomap_folio_state { > }; > > static struct bio_set iomap_ioend_bioset; > +struct page *zero_fs_block; > > static inline bool ifs_is_fully_uptodate(struct folio *folio, > struct iomap_folio_state *ifs) > @@ -1985,8 +1986,13 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc, > } > EXPORT_SYMBOL_GPL(iomap_writepages); > > + > static int __init iomap_init(void) > { > + zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER); > + if (!zero_fs_block) > + return -ENOMEM; > + > return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE), > offsetof(struct iomap_ioend, io_bio), > BIOSET_NEED_BVECS); > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index f3b43d223a46..50c2bca8a347 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -236,17 +236,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, > loff_t pos, unsigned len) > { > struct inode *inode = file_inode(dio->iocb->ki_filp); > - struct page *page = ZERO_PAGE(0); > struct bio *bio; > > + /* > + * Max block size supported is 64k > + */ > + WARN_ON_ONCE(len > ZERO_FSB_SIZE); > + > bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); > fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, > GFP_KERNEL); > + > bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io; > > - __bio_add_page(bio, page, len, 0); > + __bio_add_page(bio, zero_fs_block, len, 0); > iomap_dio_submit_bio(iter, dio, bio, pos); > } >
On 5/17/24 14:36, Hannes Reinecke wrote: > On 5/16/24 17:02, Pankaj Raghav (Samsung) wrote: >> On Wed, May 15, 2024 at 07:03:20PM +0100, Matthew Wilcox wrote: >>> On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote: >>>> static int __init iomap_init(void) >>>> { >>>> + void *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL); >>> >>> Don't use XFS coding style outside XFS. >>> >>> kzalloc() does not guarantee page alignment much less alignment to >>> a folio. It happens to work today, but that is an implementation >>> artefact. >>> >>>> + >>>> + if (!addr) >>>> + return -ENOMEM; >>>> + >>>> + zero_fsb_folio = virt_to_folio(addr); >>> >>> We also don't guarantee that calling kzalloc() gives you a virtual >>> address that can be converted to a folio. You need to allocate a folio >>> to be sure that you get a folio. >>> >>> Of course, you don't actually need a folio. You don't need any of the >>> folio metadata and can just use raw pages. >>> >>>> + /* >>>> + * The zero folio used is 64k. >>>> + */ >>>> + WARN_ON_ONCE(len > (16 * PAGE_SIZE)); >>> >>> PAGE_SIZE is not necessarily 4KiB. >>> >>>> + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, >>>> + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); >>> >>> The point was that we now only need one biovec, not MAX. >>> >> >> Thanks for the comments. I think it all makes sense: >> >> diff --git a/fs/internal.h b/fs/internal.h >> index 7ca738904e34..e152b77a77e4 100644 >> --- a/fs/internal.h >> +++ b/fs/internal.h >> @@ -35,6 +35,14 @@ static inline void bdev_cache_init(void) >> int __block_write_begin_int(struct folio *folio, loff_t pos, >> unsigned len, >> get_block_t *get_block, const struct iomap *iomap); >> +/* >> + * iomap/buffered-io.c >> + */ >> + >> +#define ZERO_FSB_SIZE (65536) >> +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE)) >> +extern struct page *zero_fs_block; >> + >> /* >> * char_dev.c >> */ > But why? > We already have a perfectly fine hugepage zero page in huge_memory.c. > Shouldn't we rather export that one and use it? > (Actually I have some patches for doing so...) > We might allocate folios Bah. Hit 'enter' too soon. We might allocate a zero folio as a fallback if the huge zero page is not available, but first we should try to use that. Cheers, Hannes
On Fri, May 17, 2024 at 02:36:29PM +0200, Hannes Reinecke wrote: > > +#define ZERO_FSB_SIZE (65536) > > +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE)) > > +extern struct page *zero_fs_block; > > + > > /* > > * char_dev.c > > */ > But why? > We already have a perfectly fine hugepage zero page in huge_memory.c. > Shouldn't we rather export that one and use it? > (Actually I have some patches for doing so...) But we don't necessarily. We only have it if do_huge_pmd_anonymous_page() satisfies: if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm) && transparent_hugepage_use_zero_page()) { ie we've taken a page fault on a PMD hole in a VMA, that VMA permits PMD mappings to exist, the page fault was for read, the forbids-huge-zeropage isn't set for this vma, and using the hugetlb zero page isn't forbidden. I'd like to see stats for how much the PMD-zero-page is actually used, because I suspect it's not really used very much.
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 5f481068de5b..7f584f9ff2c5 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -236,11 +236,18 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, loff_t pos, unsigned len) { struct inode *inode = file_inode(dio->iocb->ki_filp); - struct page *page = ZERO_PAGE(0); + struct folio *zero_page_folio = page_folio(ZERO_PAGE(0)); + struct folio *folio = zero_page_folio; struct bio *bio; WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE)); + if (len > PAGE_SIZE) { + folio = mm_get_huge_zero_folio(current->mm); + if (!folio) + folio = zero_page_folio; + } + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, @@ -251,10 +258,10 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, bio->bi_end_io = iomap_dio_bio_end_io; while (len) { - unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE); + unsigned int size = min(len, folio_size(folio)); - __bio_add_page(bio, page, io_len, 0); - len -= io_len; + bio_add_folio_nofail(bio, folio, size, 0); + len -= size; } iomap_dio_submit_bio(iter, dio, bio, pos); }