Message ID | 20240425113746.335530-8-kernel@pankajraghav.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | enable bs > ps in XFS | expand |
On Thu, Apr 25, 2024 at 01:37:42PM +0200, Pankaj Raghav (Samsung) wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size > < fs block size. iomap_dio_zero() has an implicit assumption that fs block > size < page_size. This is true for most filesystems at the moment. > > If the block size > page size, this will send the contents of the page > next to zero page(as len > PAGE_SIZE) to the underlying block device, > causing FS corruption. > > iomap is a generic infrastructure and it should not make any assumptions > about the fs block size and the page size of the system. So what happened to the plan to making huge_zero_page a folio and have it available for non-hugetlb setups? Not only would this be cleaner and more efficient, but it would actually work for the case where you'd have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't seem impossible with 2MB folios.
On Thu, Apr 25, 2024 at 11:22:35PM -0700, Christoph Hellwig wrote: > On Thu, Apr 25, 2024 at 01:37:42PM +0200, Pankaj Raghav (Samsung) wrote: > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size > > < fs block size. iomap_dio_zero() has an implicit assumption that fs block > > size < page_size. This is true for most filesystems at the moment. > > > > If the block size > page size, this will send the contents of the page > > next to zero page(as len > PAGE_SIZE) to the underlying block device, > > causing FS corruption. > > > > iomap is a generic infrastructure and it should not make any assumptions > > about the fs block size and the page size of the system. > > So what happened to the plan to making huge_zero_page a folio and have > it available for non-hugetlb setups? Not only would this be cleaner > and more efficient, but it would actually work for the case where you'd > have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't > seem impossible with 2MB folios. I mentioned this Darrick in one of the older series[1] that it was proving to be a bit complicated (at least for me) to add that support. Currently, we reserve the ZERO_PAGE during kernel startup (arch/x86/kernel/head_64.S). Do we go about doing the same by reserving 1 PMD (512 PTEs with base page size) at kernel startup if we want to have zeroed 2MB (for x86) always at our disposal to use for zeroing out? Because allocating it during runtime will defeat the purpose. Let me know what you think. In anycase, I would like to pursue huge_zero_page folio separately from this series. Also iomap_dio_zero() only pads a fs block with zeroes, which should never be > 64k for XFS. [1] https://lore.kernel.org/linux-fsdevel/5kodxnrvjq5dsjgjfeps6wte774c2sl75bn3fg3hh46q3wkwk5@2tru4htvqmrq/
On Thu, Apr 25, 2024 at 11:22:35PM -0700, Christoph Hellwig wrote: > On Thu, Apr 25, 2024 at 01:37:42PM +0200, Pankaj Raghav (Samsung) wrote: > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size > > < fs block size. iomap_dio_zero() has an implicit assumption that fs block > > size < page_size. This is true for most filesystems at the moment. > > > > If the block size > page size, this will send the contents of the page > > next to zero page(as len > PAGE_SIZE) to the underlying block device, > > causing FS corruption. > > > > iomap is a generic infrastructure and it should not make any assumptions > > about the fs block size and the page size of the system. > > So what happened to the plan to making huge_zero_page a folio and have There's a series of commits in linux-mm with the titles: sparc: use is_huge_zero_pmd() mm: add is_huge_zero_folio() mm: add pmd_folio() mm: convert migrate_vma_collect_pmd to use a folio mm: convert huge_zero_page to huge_zero_folio mm: convert do_huge_pmd_anonymous_page to huge_zero_folio dax: use huge_zero_folio mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folio > it available for non-hugetlb setups? Not only would this be cleaner > and more efficient, but it would actually work for the case where you'd > have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't > seem impossible with 2MB folios. It is available for non-hugetlb setups. It is however allocated on demand, so it might not be available.
On Sat, Apr 27, 2024 at 04:26:44AM +0100, Matthew Wilcox wrote: > There's a series of commits in linux-mm with the titles: > > sparc: use is_huge_zero_pmd() > mm: add is_huge_zero_folio() > mm: add pmd_folio() > mm: convert migrate_vma_collect_pmd to use a folio > mm: convert huge_zero_page to huge_zero_folio > mm: convert do_huge_pmd_anonymous_page to huge_zero_folio > dax: use huge_zero_folio > mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folio > > > it available for non-hugetlb setups? Not only would this be cleaner > > and more efficient, but it would actually work for the case where you'd > > have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't > > seem impossible with 2MB folios. > > It is available for non-hugetlb setups. It is however allocated on > demand, so it might not be available. We could just export get_huge_zero_page/put_huge_zero_page and make sure it is is available for block sizse > PAGE_SIZE file systems, or is there a good argument against that?
On Fri, Apr 26, 2024 at 11:43:01AM +0000, Pankaj Raghav (Samsung) wrote: > Because allocating it during runtime will defeat the purpose. Well, what runtime? Either way it seems like we have the infrastructure now based on the comment from willy. > In anycase, I would like to pursue huge_zero_page folio separately > from this series. Also iomap_dio_zero() only pads a fs block with > zeroes, which should never be > 64k for XFS. Only if you are limited to 64k block size.
On Fri, Apr 26, 2024 at 10:12:01PM -0700, Christoph Hellwig wrote: > On Fri, Apr 26, 2024 at 11:43:01AM +0000, Pankaj Raghav (Samsung) wrote: > > Because allocating it during runtime will defeat the purpose. > > Well, what runtime? Either way it seems like we have the infrastructure > now based on the comment from willy. As willy pointed out in that reply, it is allocated on demand, so it might still fail and we might have to revert back to looping. And if we end up using the huge zero page, we should also make sure to decrement to reference in iomap_dio_bio_end_io(), which is not needed when we use ZERO_PAGE. FWIW, I did a prototype with mm_huge_zero_page() in one of the older series. [1] (I forgot to decrement reference in end_io()) but I did not get final response if that is the direction we want to go. Let me know your thoughts. [1]https://lore.kernel.org/linux-xfs/3pqmgrlewo6ctcwakdvbvjqixac5en6irlipe5aiz6vkylfyni@2luhrs36ke5r/#r > > > In anycase, I would like to pursue huge_zero_page folio separately > > from this series. Also iomap_dio_zero() only pads a fs block with > > zeroes, which should never be > 64k for XFS. > > Only if you are limited to 64k block size. >
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index f3b43d223a46..5f481068de5b 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, struct page *page = ZERO_PAGE(0); struct bio *bio; - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); + WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE)); + + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, GFP_KERNEL); + bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - __bio_add_page(bio, page, len, 0); + while (len) { + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE); + + __bio_add_page(bio, page, io_len, 0); + len -= io_len; + } iomap_dio_submit_bio(iter, dio, bio, pos); }