Message ID | 20240605092455.20435-2-kundan.kumar@samsung.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: add larger order folio instead of pages | expand |
On Wed, Jun 05, 2024 at 02:54:54PM +0530, Kundan Kumar wrote: > @@ -1214,27 +1214,27 @@ static int bio_iov_add_page(struct bio *bio, struct page *page, > > if (bio->bi_vcnt > 0 && > bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1], > - page, len, offset, &same_page)) { > + &folio->page, len, offset, &same_page)) { Let's make that folio_page(folio, 0) > +static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio, > unsigned int len, unsigned int offset) > { > struct request_queue *q = bdev_get_queue(bio->bi_bdev); > bool same_page = false; > > - if (bio_add_hw_page(q, bio, page, len, offset, > + if (bio_add_hw_page(q, bio, &folio->page, len, offset, Likewise. > @@ -1301,15 +1301,49 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > > for (left = size, i = 0; left > 0; left -= len, i++) { > struct page *page = pages[i]; > + struct folio *folio = page_folio(page); > + > + /* Calculate the offset of page in folio */ > + folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) + > + offset; No it doesn't. I'd drop the comment entirely, but if you must have a comment, it turns the page offset into a folio offset. > + len = min_t(size_t, (folio_size(folio) - folio_offset), left); > + > + num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); > + > + if (num_pages > 1) { > + ssize_t bytes = left; > + size_t contig_sz = min_t(size_t, PAGE_SIZE - offset, > + bytes); > + > + /* > + * Check if pages are contiguous and belong to the > + * same folio. > + */ You do like writing obvious comments, don't you? > + for (j = i + 1; j < i + num_pages; j++) { > + size_t next = min_t(size_t, PAGE_SIZE, bytes); > + > + if (page_folio(pages[j]) != folio || > + pages[j] != pages[j - 1] + 1) { > + break; > + } > + contig_sz += next; > + bytes -= next; > + } > + num_pages = j - i; > + len = contig_sz; > + }
On Wed, Jun 05, 2024 at 02:54:54PM +0530, Kundan Kumar wrote: > + /* > + * Check if pages are contiguous and belong to the > + * same folio. > + */ Oh, a useful comment here would be *why* we do this, not *what* we do. /* * We might COW a single page in the middle of * a large folio, so we have to check that all * pages belong to the same folio. */ for example. > + for (j = i + 1; j < i + num_pages; j++) { > + size_t next = min_t(size_t, PAGE_SIZE, bytes); > + > + if (page_folio(pages[j]) != folio || > + pages[j] != pages[j - 1] + 1) { > + break; > + } > + contig_sz += next; > + bytes -= next; > + } > + num_pages = j - i; > + len = contig_sz; > + }
On Wed, Jun 05, 2024 at 07:45:23PM +0100, Matthew Wilcox wrote: > > if (bio->bi_vcnt > 0 && > > bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1], > > - page, len, offset, &same_page)) { > > + &folio->page, len, offset, &same_page)) { > > Let's make that folio_page(folio, 0) Or just pass a folio to bvec_try_merge_page (and rename it) if we want to go all the way, but given that that will go down quite a bit into xen and zone device code maybe that's not worth it for now as we should eventually just convert it to a plain phys_addr_t anyway. > > struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > bool same_page = false; > > > > - if (bio_add_hw_page(q, bio, page, len, offset, > > + if (bio_add_hw_page(q, bio, &folio->page, len, offset, > > Likewise. OTOH replacing / augmenting bio_add_hw_page with a bio_add_hw_folio should be worthwhile, so I'd love to see that as a prep patch.
> @@ -1301,15 +1301,49 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > > for (left = size, i = 0; left > 0; left -= len, i++) { > struct page *page = pages[i]; > + struct folio *folio = page_folio(page); > + > + /* Calculate the offset of page in folio */ > + folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) + > + offset; > + > + len = min_t(size_t, (folio_size(folio) - folio_offset), left); > + > + num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); > + > + if (num_pages > 1) { I still hate having this logic in the block layer. Even if it is just a dumb wrapper with the same logic as here I'd much prefer to have a iov_iter_extract_folios, which can then shift down into a pin_user_folios_fast and into the MM slowly rather than adding this logic to the caller.
On 06/06/24 06:56AM, Christoph Hellwig wrote: >> @@ -1301,15 +1301,49 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) >> >> for (left = size, i = 0; left > 0; left -= len, i++) { >> struct page *page = pages[i]; >> + struct folio *folio = page_folio(page); >> + >> + /* Calculate the offset of page in folio */ >> + folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) + >> + offset; >> + >> + len = min_t(size_t, (folio_size(folio) - folio_offset), left); >> + >> + num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); >> + >> + if (num_pages > 1) { > >I still hate having this logic in the block layer. Even if it is just >a dumb wrapper with the same logic as here I'd much prefer to have >a iov_iter_extract_folios, which can then shift down into a >pin_user_folios_fast and into the MM slowly rather than adding this >logic to the caller. > iov_iter_extract_folios will require allocation of a folio_vec array which then will be filled after processing the pages[] array. This will introduce extra allocation cost in the hot path. -- Kundan
Hello, kernel test robot noticed "WARNING:at_lib/iov_iter.c:#iov_iter_revert" on: commit: d245fbe9b4ca466a625094e538ce363e4dbfde94 ("[PATCH v4 1/2] block: add folio awareness instead of looping through pages") url: https://github.com/intel-lab-lkp/linux/commits/Kundan-Kumar/block-add-folio-awareness-instead-of-looping-through-pages/20240605-182718 base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git for-next patch link: https://lore.kernel.org/all/20240605092455.20435-2-kundan.kumar@samsung.com/ patch subject: [PATCH v4 1/2] block: add folio awareness instead of looping through pages in testcase: ltp version: ltp-x86_64-14c1f76-1_20240608 with following parameters: test: lvm.local-00 compiler: gcc-13 test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz (Kaby Lake) with 32G memory (please refer to attached dmesg/kmsg for entire log/backtrace) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202406111633.69b1950e-lkp@intel.com The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240611/202406111633.69b1950e-lkp@intel.com [ 130.031598][ T3261] ------------[ cut here ]------------ [ 130.037219][ T3261] WARNING: CPU: 3 PID: 3261 at lib/iov_iter.c:552 iov_iter_revert+0x1e7/0x250 [ 130.045995][ T3261] Modules linked in: vfat fat xfs ext2 btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal sd_mod intel_powerclamp t10_pi coretemp crc64_rocksoft_generic crc64_rocksoft crc64 kvm_intel sg ipmi_devintf ipmi_msghandler i915 kvm crct10dif_pclmul drm_buddy crc32_pclmul intel_gtt crc32c_intel ghash_clmulni_intel drm_display_helper sha512_ssse3 mei_wdt ttm rapl drm_kms_helper wmi_bmof ahci mei_me libahci video intel_cstate i2c_designware_platform i2c_designware_core intel_uncore idma64 i2c_i801 libata mei i2c_smbus pinctrl_sunrisepoint wmi acpi_pad binfmt_misc fuse loop drm dm_mod ip_tables [ 130.104614][ T3261] CPU: 3 PID: 3261 Comm: pvcreate Tainted: G S 6.10.0-rc1-00017-gd245fbe9b4ca #1 [ 130.114995][ T3261] Hardware name: Dell Inc. OptiPlex 7050/062KRH, BIOS 1.2.0 12/22/2016 [ 130.123127][ T3261] RIP: 0010:iov_iter_revert+0x1e7/0x250 [ 130.128576][ T3261] Code: 48 89 f8 48 c1 e8 03 42 0f b6 04 30 84 c0 74 04 3c 03 7e 31 8b 45 08 49 83 c4 01 4d 89 65 20 48 39 d8 72 d1 49 89 6d 10 eb 89 <0f> 0b eb 8c 4c 89 ef e8 ed 5b 6a ff e9 74 fe ff ff e8 73 5c 6a ff [ 130.148044][ T3261] RSP: 0018:ffffc90000f4f8e8 EFLAGS: 00010286 [ 130.154025][ T3261] RAX: 0000000000000000 RBX: fffffffffffff200 RCX: 1ffff1103df41cfd [ 130.161898][ T3261] RDX: 1ffff1103df41d04 RSI: fffffffffffff200 RDI: ffffc90000f4fbf0 [ 130.169767][ T3261] RBP: 0000000000000002 R08: 0000000000099000 R09: 0000000000001200 [ 130.177636][ T3261] R10: 0000000000000002 R11: ffffea000f6a0000 R12: dffffc0000000000 [ 130.185502][ T3261] R13: ffff8881efa0e7c0 R14: ffffc90000f4fbf0 R15: fffffffffffff200 [ 130.193371][ T3261] FS: 00007fc2df84fd40(0000) GS:ffff88876c180000(0000) knlGS:0000000000000000 [ 130.202190][ T3261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 130.208672][ T3261] CR2: 000055bf19c62500 CR3: 0000000807f5c001 CR4: 00000000003706f0 [ 130.216537][ T3261] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 130.224413][ T3261] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 130.232292][ T3261] Call Trace: [ 130.235486][ T3261] <TASK> [ 130.238335][ T3261] ? __warn+0xcc/0x260 [ 130.242311][ T3261] ? iov_iter_revert+0x1e7/0x250 [ 130.247138][ T3261] ? report_bug+0x261/0x2c0 [ 130.251500][ T3261] ? handle_bug+0x6d/0x90 [ 130.255683][ T3261] ? exc_invalid_op+0x17/0x40 [ 130.260210][ T3261] ? asm_exc_invalid_op+0x1a/0x20 [ 130.265085][ T3261] ? iov_iter_revert+0x1e7/0x250 [ 130.269887][ T3261] __bio_iov_iter_get_pages+0x868/0xcb0 [ 130.275282][ T3261] ? bio_associate_blkg_from_css+0x2f5/0xbb0 [ 130.281111][ T3261] ? __pfx___bio_iov_iter_get_pages+0x10/0x10 [ 130.287021][ T3261] ? bio_init+0x34a/0x5c0 [ 130.291199][ T3261] ? bio_alloc_bioset+0x28f/0x950 [ 130.296073][ T3261] bio_iov_iter_get_pages+0xb6/0x470 [ 130.301205][ T3261] __blkdev_direct_IO_async+0x3be/0x7f0 [ 130.306597][ T3261] ? blkdev_direct_IO+0x1d7/0x310 [ 130.311467][ T3261] blkdev_write_iter+0x5cd/0xac0 [ 130.316254][ T3261] aio_write+0x344/0x730 [ 130.320346][ T3261] ? __pfx_aio_write+0x10/0x10 [ 130.324961][ T3261] ? _raw_spin_lock_irqsave+0x8b/0xf0 [ 130.330182][ T3261] ? io_submit_one+0x213/0x9a0 [ 130.334794][ T3261] io_submit_one+0x213/0x9a0 [ 130.339231][ T3261] ? __pfx_io_submit_one+0x10/0x10 [ 130.344190][ T3261] ? __pfx___might_resched+0x10/0x10 [ 130.349319][ T3261] __x64_sys_io_submit+0x167/0x380 [ 130.354279][ T3261] ? __pfx___x64_sys_io_submit+0x10/0x10 [ 130.359762][ T3261] ? fpregs_restore_userregs+0xe3/0x1f0 [ 130.365153][ T3261] do_syscall_64+0x5f/0x170 [ 130.369506][ T3261] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 130.375243][ T3261] RIP: 0033:0x7fc2dfd41719 [ 130.379508][ T3261] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48 [ 130.398931][ T3261] RSP: 002b:00007ffc819ab548 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1 [ 130.407175][ T3261] RAX: ffffffffffffffda RBX: 00007fc2df84fb10 RCX: 00007fc2dfd41719 [ 130.414988][ T3261] RDX: 00007ffc819ab5f0 RSI: 0000000000000001 RDI: 00007fc2e00cf000 [ 130.422803][ T3261] RBP: 00007fc2e00cf000 R08: 00007ffc819ab490 R09: 0000000000000008 [ 130.430616][ T3261] R10: 000055bf19955608 R11: 0000000000000246 R12: 0000000000000001 [ 130.438425][ T3261] R13: 0000000000000002 R14: 00007ffc819ab5f0 R15: 00007fc2e010e020 [ 130.446238][ T3261] </TASK> [ 130.449114][ T3261] ---[ end trace 0000000000000000 ]--- [ 130.455322][ T3261] ------------[ cut here ]------------ [ 130.460954][ T3261] WARNING: CPU: 7 PID: 3261 at lib/iov_iter.c:552 iov_iter_revert+0x1e7/0x250
diff --git a/block/bio.c b/block/bio.c index e9e809a63c59..7857b9ca5957 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1204,7 +1204,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter) bio_set_flag(bio, BIO_CLONED); } -static int bio_iov_add_page(struct bio *bio, struct page *page, +static int bio_iov_add_folio(struct bio *bio, struct folio *folio, unsigned int len, unsigned int offset) { bool same_page = false; @@ -1214,27 +1214,27 @@ static int bio_iov_add_page(struct bio *bio, struct page *page, if (bio->bi_vcnt > 0 && bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1], - page, len, offset, &same_page)) { + &folio->page, len, offset, &same_page)) { bio->bi_iter.bi_size += len; if (same_page) - bio_release_page(bio, page); + bio_release_page(bio, &folio->page); return 0; } - __bio_add_page(bio, page, len, offset); + bio_add_folio_nofail(bio, folio, len, offset); return 0; } -static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page, +static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio, unsigned int len, unsigned int offset) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); bool same_page = false; - if (bio_add_hw_page(q, bio, page, len, offset, + if (bio_add_hw_page(q, bio, &folio->page, len, offset, queue_max_zone_append_sectors(q), &same_page) != len) return -EINVAL; if (same_page) - bio_release_page(bio, page); + bio_release_page(bio, &folio->page); return 0; } @@ -1258,9 +1258,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; struct page **pages = (struct page **)bv; ssize_t size, left; - unsigned len, i = 0; - size_t offset; - int ret = 0; + unsigned int len, i = 0, j; + size_t offset, folio_offset; + int ret = 0, num_pages; /* * Move page array up in the allocated memory for the bio vecs as far as @@ -1301,15 +1301,49 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) for (left = size, i = 0; left > 0; left -= len, i++) { struct page *page = pages[i]; + struct folio *folio = page_folio(page); + + /* Calculate the offset of page in folio */ + folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) + + offset; + + len = min_t(size_t, (folio_size(folio) - folio_offset), left); + + num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); + + if (num_pages > 1) { + ssize_t bytes = left; + size_t contig_sz = min_t(size_t, PAGE_SIZE - offset, + bytes); + + /* + * Check if pages are contiguous and belong to the + * same folio. + */ + for (j = i + 1; j < i + num_pages; j++) { + size_t next = min_t(size_t, PAGE_SIZE, bytes); + + if (page_folio(pages[j]) != folio || + pages[j] != pages[j - 1] + 1) { + break; + } + contig_sz += next; + bytes -= next; + } + num_pages = j - i; + len = contig_sz; + } - len = min_t(size_t, PAGE_SIZE - offset, left); if (bio_op(bio) == REQ_OP_ZONE_APPEND) { - ret = bio_iov_add_zone_append_page(bio, page, len, - offset); + ret = bio_iov_add_zone_append_folio(bio, folio, len, + folio_offset); if (ret) break; } else - bio_iov_add_page(bio, page, len, offset); + bio_iov_add_folio(bio, folio, len, folio_offset); + + /* Skip the pages which got added */ + i = i + (num_pages - 1); offset = 0; }
Add a bigger size from folio to bio and skip merge processing for pages. Fetch the offset of page within a folio. Depending on the size of folio and folio_offset, fetch a larger length. This length may consist of multiple contiguous pages if folio is multiorder. Using the length calculate number of pages which will be added to bio and increment the loop counter to skip those pages. There is also a check to see if pages are contiguous and belong to same folio, this is done as a COW may happen and change contiguous mapping of pages of folio. This technique helps to avoid overhead of merging pages which belong to same large order folio. Also folio-lize the functions bio_iov_add_page() and bio_iov_add_zone_append_page() Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> --- block/bio.c | 62 +++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 48 insertions(+), 14 deletions(-)