Message ID | 20240419091721.1790-1-kundan.kumar@samsung.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block : add larger order folio size instead of pages | expand |
On Fri, Apr 19, 2024 at 02:47:21PM +0530, Kundan Kumar wrote: > When mTHP is enabled, IO can contain larger folios instead of pages. > In such cases add a larger size to the bio instead of looping through > pages. This reduces the overhead of iterating through pages for larger > block sizes. perf diff before and after this change: > > Perf diff for write I/O with 128K block size: > 1.22% -0.97% [kernel.kallsyms] [k] bio_iov_iter_get_pages > Perf diff for read I/O with 128K block size: > 4.13% -3.26% [kernel.kallsyms] [k] bio_iov_iter_get_pages I'm a bit confused by this to be honest. We already merge adjacent pages, and it doesn't look to be _that_ expensive. Can you drill down any further in the perf stats and show what the expensive part is?
On 19/04/24 03:16PM, Matthew Wilcox wrote: >On Fri, Apr 19, 2024 at 02:47:21PM +0530, Kundan Kumar wrote: >> When mTHP is enabled, IO can contain larger folios instead of pages. >> In such cases add a larger size to the bio instead of looping through >> pages. This reduces the overhead of iterating through pages for larger >> block sizes. perf diff before and after this change: >> >> Perf diff for write I/O with 128K block size: >> 1.22% -0.97% [kernel.kallsyms] [k] bio_iov_iter_get_pages >> Perf diff for read I/O with 128K block size: >> 4.13% -3.26% [kernel.kallsyms] [k] bio_iov_iter_get_pages > >I'm a bit confused by this to be honest. We already merge adjacent >pages, and it doesn't look to be _that_ expensive. Can you drill down >any further in the perf stats and show what the expensive part is? > > Majority of the overhead exists due to repeated call to bvec_try_merge_page(). For a 128K size I/O we would call this function 32 times. The bvec_try_merge_page() function does comparisions and calculations which adds to overall overhead[1] Function bio_iov_iter_get_pages() shows reduction of overhead at these places[2] This patch reduces overhead as evident from perf diff: 4.17% -3.21% [kernel.kallsyms] [k] bio_iov_iter_get_pages Also 5.54% [kernel.kallsyms] [k] bvec_try_merge_page the above perf diff has been obtained by running fio command[3] Note : These experiments have been done after enabling mTHP, where we get one folio for a 128K I/O. [1] I. : 14 size_t bv_end = bv->bv_offset + bv->bv_len; : 15 phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) + bv_end - 1; : 16 phys_addr_t page_addr = page_to_phys(page); : 18 if (vec_end_addr + 1 != page_addr + off) 3.21 : ffffffff817ac796: mov %ecx,%eax : 20 { 1.40 : ffffffff817ac798: mov %rsp,%rbp 3.14 : ffffffff817ac79b: push %r15 2.35 : ffffffff817ac79d: push %r14 3.13 : ffffffff817ac7a4: push %r12 II. : 113 if (bv->bv_page + bv_end / PAGE_SIZE != page + off / PAGE_SIZE) 1.84 : ffffffff817ac83e: shr $0xc,%ecx 3.09 : ffffffff817ac841: shr $0xc,%r15 8.52 : ffffffff817ac84d: add 0x0(%r13),%r15 0.65 : ffffffff817ac851: add %rcx,%r14 0.62 : ffffffff817ac854: cmp %r14,%r15 0.61 : ffffffff817ac857: je ffffffff817ac86e <bvec_try_merge_page+0xde> [2] I. : 206 struct page *page = pages[i]; 4.92 : ffffffff817af307: mov -0x40(%rbp),%rdx 3.97 : ffffffff817af30b: mov %r13d,%eax II. : 198 for (left = size, i = 0; left > 0; left -= len, i++) { 0.95 : ffffffff817af2f0: add $0x1,%r13d 4.80 : ffffffff817af2f6: sub %rbx,%r12 2.87 : ffffffff817af2f9: test %r12,%r12 III. : 167 if (WARN_ON_ONCE(bio->bi_iter.bi_size > UINT_MAX - len)) 3.91 : ffffffff817af295: jb ffffffff817af547 <bio_iov_iter_get_pages+0x3e7> : 169 if (bio->bi_vcnt > 0 && 2.98 : ffffffff817af2a0: test %ax,%ax : 173 bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1], 3.45 : ffffffff817af2a5: movzwl %ax,%edi 1.07 : ffffffff817af2ab: lea -0x41(%rbp),%r8 3.08 : ffffffff817af2af: mov %r10,-0x50(%rbp) 5.77 : ffffffff817af2b3: sub $0x1,%edi 0.96 : ffffffff817af2b6: mov %ebx,-0x58(%rbp) 0.95 : ffffffff817af2b9: movslq %edi,%rdi 2.88 : ffffffff817af2bc: shl $0x4,%rdi 0.96 : ffffffff817af2c0: add 0x70(%r15),%rdi [3] perf record -o fio_128k_block_read.data fio -iodepth=128 -iomem_align=128K\ -iomem=mmap -rw=randread -direct=1 -ioengine=io_uring -bs=128K -numjobs=1\ -runtime=1m -group_reporting -filename=/dev/nvme1n1 -name=io_uring_test --Kundan
> + folio = page_folio(page); > + > + if (!folio_test_large(folio) || > + (bio_op(bio) == REQ_OP_ZONE_APPEND)) { I don't understand why you need this branch. All the arithmetics below should also work just fine for non-large folios, and there while the same_page logic in bio_iov_add_zone_append_page probably needs to be folio-ized first, it should be handled the same way here as well. bio_iov_add_page should also be moved to take a folio before the (otherwise nice) changes here. and of course in the long run we really need a folio version of pin_user_pages and iov_iter_extract_pages.
On 22/04/24 01:14PM, Christoph Hellwig wrote: >> + folio = page_folio(page); >> + >> + if (!folio_test_large(folio) || >> + (bio_op(bio) == REQ_OP_ZONE_APPEND)) { > >I don't understand why you need this branch. All the arithmetics >below should also work just fine for non-large folios The branch helps to skip these calculations for zero order folio: A) folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) + offset; B) folio_size(folio) >, and there >while the same_page logic in bio_iov_add_zone_append_page probablyg >needs to be folio-ized first, it should be handled the same way here >as well. Regarding the same_page logic, if we add same page twice then we release the page on second addition. It seemed to me that this logic will work even if we merge large order folios. Please let me know if I am missing something. If we pass a large size of folio to bio_iov_add_zone_append_page then we fail early due queue_max_zone_append_sectors limit. This can be modified to add lesser pages which are a part of bigger folio. Let me know if I shall proceed this way or if it is fine not to add the entire folio. >bio_iov_add_page should also be moved to take a folio >before the (otherwise nice) changes here. If we convert bio_iov_add_page() to bio_iov_add_folio()/bio_add_folio(), we see a decline of about 11% for 4K I/O. When mTHP is enabled we may get a large order folio even for a 4K I/O. The folio_offset may become larger than 4K and we endup using expensive mempool_alloc during nvme_map_data in NVMe driver[1]. [1] static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, struct nvme_command *cmnd) { ... ... if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2) return nvme_setup_prp_simple(dev, req, &cmnd->rw, &bv); ... ... iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC); ... ... } -- Kundan > >
On Fri, Apr 19, 2024 at 02:47:21PM +0530, Kundan Kumar wrote: > @@ -1289,16 +1291,33 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > > for (left = size, i = 0; left > 0; left -= len, i++) { > struct page *page = pages[i]; > + folio = page_folio(page); > + > + if (!folio_test_large(folio) || > + (bio_op(bio) == REQ_OP_ZONE_APPEND)) { > + len = min_t(size_t, PAGE_SIZE - offset, left); > + if (bio_op(bio) == REQ_OP_ZONE_APPEND) { > + ret = bio_iov_add_zone_append_page(bio, page, > + len, offset); > + if (ret) > + break; > + } else > + bio_iov_add_page(bio, page, len, offset); > + } else { > + /* See the offset of folio and the size */ > + folio_offset = (folio_page_idx(folio, page) > + << PAGE_SHIFT) + offset; > + size_folio = folio_size(folio); > > - len = min_t(size_t, PAGE_SIZE - offset, left); > - if (bio_op(bio) == REQ_OP_ZONE_APPEND) { > - ret = bio_iov_add_zone_append_page(bio, page, len, > - offset); > - if (ret) > - break; > - } else > - bio_iov_add_page(bio, page, len, offset); > + /* Calculate the length of folio to be added */ > + len = min_t(size_t, (size_folio - folio_offset), left); > + > + num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); > > + bio_iov_add_page(bio, page, len, offset); I think there's another optimization to be had here. You only need one reference on the folio for all its pages, so I believe you can safely drop (num_pages - 1) references right here. Then __bio_release_pages() can be further simplified by removing the 'do{...}while()" loop releasing individual pages. I tested this atop your patch, and it looks okay so far. This could be more efficient if we had access to gup_put_folio() since we already know all the pages are part of the same folio (unpin_user_pages() recalculates that) --- diff --git a/block/bio.c b/block/bio.c index 469606494f8f7..9829c79494108 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1155,7 +1155,6 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty) bio_for_each_folio_all(fi, bio) { struct page *page; - size_t nr_pages; if (mark_dirty) { folio_lock(fi.folio); @@ -1163,11 +1162,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty) folio_unlock(fi.folio); } page = folio_page(fi.folio, fi.offset / PAGE_SIZE); - nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE - - fi.offset / PAGE_SIZE + 1; - do { - bio_release_page(bio, page++); - } while (--nr_pages != 0); + bio_release_page(bio, page); } } EXPORT_SYMBOL_GPL(__bio_release_pages); @@ -1315,6 +1310,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); bio_iov_add_page(bio, page, len, offset); + if (bio_flagged(bio, BIO_PAGE_PINNED) && num_pages > 1) + unpin_user_pages(pages + i, num_pages - 1); /* Skip the pages which got added */ i = i + (num_pages - 1); } --
On Wed, Apr 24, 2024 at 06:52:46PM +0530, Kundan Kumar wrote: > > while the same_page logic in bio_iov_add_zone_append_page probablyg > > needs to be folio-ized first, it should be handled the same way here > > as well. > > Regarding the same_page logic, if we add same page twice then we release > the page on second addition. It seemed to me that this logic will work even > if we merge large order folios. Please let me know if I am missing something. The 'same_page' logic should only happen with successive calls to __bio_iov_iter_get_pages(), and only for the first page considered, so I think the logic is handled the same with your change.
On Wed, Apr 24, 2024 at 06:52:46PM +0530, Kundan Kumar wrote: > On 22/04/24 01:14PM, Christoph Hellwig wrote: >>> + folio = page_folio(page); >>> + >>> + if (!folio_test_large(folio) || >>> + (bio_op(bio) == REQ_OP_ZONE_APPEND)) { >> >> I don't understand why you need this branch. All the arithmetics >> below should also work just fine for non-large folios > > The branch helps to skip these calculations for zero order folio: > A) folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) + offset; > B) folio_size(folio) Well, we'll need to just handle folio and stop special casing order 0 ones eventually. > If we convert bio_iov_add_page() to bio_iov_add_folio()/bio_add_folio(), > we see a decline of about 11% for 4K I/O. When mTHP is enabled we may get > a large order folio even for a 4K I/O. The folio_offset may become larger > than 4K and we endup using expensive mempool_alloc during nvme_map_data in > NVMe driver[1]. > > [1] > static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, > struct nvme_command *cmnd) > { > ... > ... > if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2) We can replace this with: if ((bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2) as nvme_setup_prp_simple just masks away the high bits anyway.
diff --git a/block/bio.c b/block/bio.c index 38baedb39c6f..c507e47e3c46 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1247,8 +1247,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) struct page **pages = (struct page **)bv; ssize_t size, left; unsigned len, i = 0; - size_t offset; + size_t offset, folio_offset, size_folio; int ret = 0; + unsigned short num_pages; + struct folio *folio; /* * Move page array up in the allocated memory for the bio vecs as far as @@ -1289,16 +1291,33 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) for (left = size, i = 0; left > 0; left -= len, i++) { struct page *page = pages[i]; + folio = page_folio(page); + + if (!folio_test_large(folio) || + (bio_op(bio) == REQ_OP_ZONE_APPEND)) { + len = min_t(size_t, PAGE_SIZE - offset, left); + if (bio_op(bio) == REQ_OP_ZONE_APPEND) { + ret = bio_iov_add_zone_append_page(bio, page, + len, offset); + if (ret) + break; + } else + bio_iov_add_page(bio, page, len, offset); + } else { + /* See the offset of folio and the size */ + folio_offset = (folio_page_idx(folio, page) + << PAGE_SHIFT) + offset; + size_folio = folio_size(folio); - len = min_t(size_t, PAGE_SIZE - offset, left); - if (bio_op(bio) == REQ_OP_ZONE_APPEND) { - ret = bio_iov_add_zone_append_page(bio, page, len, - offset); - if (ret) - break; - } else - bio_iov_add_page(bio, page, len, offset); + /* Calculate the length of folio to be added */ + len = min_t(size_t, (size_folio - folio_offset), left); + + num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE); + bio_iov_add_page(bio, page, len, offset); + /* Skip the pages which got added */ + i = i + (num_pages - 1); + } offset = 0; }
When mTHP is enabled, IO can contain larger folios instead of pages. In such cases add a larger size to the bio instead of looping through pages. This reduces the overhead of iterating through pages for larger block sizes. perf diff before and after this change: Perf diff for write I/O with 128K block size: 1.22% -0.97% [kernel.kallsyms] [k] bio_iov_iter_get_pages Perf diff for read I/O with 128K block size: 4.13% -3.26% [kernel.kallsyms] [k] bio_iov_iter_get_pages Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> --- block/bio.c | 37 ++++++++++++++++++++++++++++--------- 1 file changed, 28 insertions(+), 9 deletions(-)