block : add larger order folio size instead of pages

Message ID	20240419091721.1790-1-kundan.kumar@samsung.com (mailing list archive)
State	New, archived
Headers	show Received: from mailout2.samsung.com (mailout2.samsung.com [203.254.224.25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A275839EB for <linux-block@vger.kernel.org>; Fri, 19 Apr 2024 12:17:31 +0000 (UTC) From: Kundan Kumar <kundan.kumar@samsung.com> To: axboe@kernel.dk, hch@lst.de, willy@infradead.org Cc: linux-block@vger.kernel.org, joshi.k@samsung.com, mcgrof@kernel.org, anuj20.g@samsung.com, nj.shetty@samsung.com, c.gameti@samsung.com, gost.dev@samsung.com, Kundan Kumar <kundan.kumar@samsung.com> Subject: [PATCH] block : add larger order folio size instead of pages Date: Fri, 19 Apr 2024 14:47:21 +0530 Message-Id: <20240419091721.1790-1-kundan.kumar@samsung.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 105P DLP-Filter: Pass References: <CGME20240419092428epcas5p4f63759b0efa1f12dfbcf13c67fa8d0f0@epcas5p4.samsung.com>
Series	block : add larger order folio size instead of pages \| expand block : add larger order folio size instead of pages

Kundan Kumar April 19, 2024, 9:17 a.m. UTC

When mTHP is enabled, IO can contain larger folios instead of pages.
In such cases add a larger size to the bio instead of looping through
pages. This reduces the overhead of iterating through pages for larger
block sizes. perf diff before and after this change:

Perf diff for write I/O with 128K block size:
	1.22%     -0.97%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
Perf diff for read I/O with 128K block size:
	4.13%     -3.26%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
---
 block/bio.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

Matthew Wilcox April 19, 2024, 2:16 p.m. UTC | #1

On Fri, Apr 19, 2024 at 02:47:21PM +0530, Kundan Kumar wrote:
> When mTHP is enabled, IO can contain larger folios instead of pages.
> In such cases add a larger size to the bio instead of looping through
> pages. This reduces the overhead of iterating through pages for larger
> block sizes. perf diff before and after this change:
> 
> Perf diff for write I/O with 128K block size:
> 	1.22%     -0.97%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
> Perf diff for read I/O with 128K block size:
> 	4.13%     -3.26%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages

I'm a bit confused by this to be honest.  We already merge adjacent
pages, and it doesn't look to be _that_ expensive.  Can you drill down
any further in the perf stats and show what the expensive part is?

Kundan Kumar April 22, 2024, 9:44 a.m. UTC | #2

On 19/04/24 03:16PM, Matthew Wilcox wrote:
>On Fri, Apr 19, 2024 at 02:47:21PM +0530, Kundan Kumar wrote:
>> When mTHP is enabled, IO can contain larger folios instead of pages.
>> In such cases add a larger size to the bio instead of looping through
>> pages. This reduces the overhead of iterating through pages for larger
>> block sizes. perf diff before and after this change:
>>
>> Perf diff for write I/O with 128K block size:
>> 	1.22%     -0.97%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
>> Perf diff for read I/O with 128K block size:
>> 	4.13%     -3.26%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
>
>I'm a bit confused by this to be honest.  We already merge adjacent
>pages, and it doesn't look to be _that_ expensive.  Can you drill down
>any further in the perf stats and show what the expensive part is?
>
>
Majority of the overhead exists due to repeated call to bvec_try_merge_page().
For a 128K size I/O we would call this function 32 times. The
bvec_try_merge_page() function does comparisions and calculations which adds
to overall overhead[1]

Function bio_iov_iter_get_pages() shows reduction of overhead at these places[2]

This patch reduces overhead as evident from perf diff:
     4.17%     -3.21%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
Also
     5.54%             [kernel.kallsyms]  [k] bvec_try_merge_page

the above perf diff has been obtained by running fio command[3]

Note : These experiments have been done after enabling mTHP, where we get
one folio for a 128K I/O.

[1]
I.       : 14               size_t bv_end = bv->bv_offset + bv->bv_len;
         : 15               phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) + bv_end - 1;
         : 16               phys_addr_t page_addr = page_to_phys(page);
         : 18               if (vec_end_addr + 1 != page_addr + off)
    3.21 :   ffffffff817ac796:       mov    %ecx,%eax
         : 20               {
    1.40 :   ffffffff817ac798:       mov    %rsp,%rbp
    3.14 :   ffffffff817ac79b:       push   %r15
    2.35 :   ffffffff817ac79d:       push   %r14
    3.13 :   ffffffff817ac7a4:       push   %r12

II.       : 113              if (bv->bv_page + bv_end / PAGE_SIZE != page + off / PAGE_SIZE)
    1.84 :   ffffffff817ac83e:       shr    $0xc,%ecx
    3.09 :   ffffffff817ac841:       shr    $0xc,%r15
    8.52 :   ffffffff817ac84d:       add    0x0(%r13),%r15
    0.65 :   ffffffff817ac851:       add    %rcx,%r14
    0.62 :   ffffffff817ac854:       cmp    %r14,%r15
    0.61 :   ffffffff817ac857:       je     ffffffff817ac86e <bvec_try_merge_page+0xde>

[2]
I.           : 206              struct page *page = pages[i];
    4.92 :   ffffffff817af307:       mov    -0x40(%rbp),%rdx
    3.97 :   ffffffff817af30b:       mov    %r13d,%eax

II.       : 198              for (left = size, i = 0; left > 0; left -= len, i++) {
    0.95 :   ffffffff817af2f0:       add    $0x1,%r13d
    4.80 :   ffffffff817af2f6:       sub    %rbx,%r12
    2.87 :   ffffffff817af2f9:       test   %r12,%r12

III.     : 167              if (WARN_ON_ONCE(bio->bi_iter.bi_size > UINT_MAX - len))
    3.91 :   ffffffff817af295:       jb     ffffffff817af547 <bio_iov_iter_get_pages+0x3e7>
         : 169              if (bio->bi_vcnt > 0 &&
    2.98 :   ffffffff817af2a0:       test   %ax,%ax
         : 173              bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
    3.45 :   ffffffff817af2a5:       movzwl %ax,%edi
    1.07 :   ffffffff817af2ab:       lea    -0x41(%rbp),%r8
    3.08 :   ffffffff817af2af:       mov    %r10,-0x50(%rbp)
    5.77 :   ffffffff817af2b3:       sub    $0x1,%edi
    0.96 :   ffffffff817af2b6:       mov    %ebx,-0x58(%rbp)
    0.95 :   ffffffff817af2b9:       movslq %edi,%rdi
    2.88 :   ffffffff817af2bc:       shl    $0x4,%rdi
    0.96 :   ffffffff817af2c0:       add    0x70(%r15),%rdi

[3]
perf record -o fio_128k_block_read.data fio -iodepth=128 -iomem_align=128K\
-iomem=mmap -rw=randread -direct=1 -ioengine=io_uring -bs=128K -numjobs=1\
-runtime=1m -group_reporting -filename=/dev/nvme1n1 -name=io_uring_test

--Kundan

Christoph Hellwig April 22, 2024, 11:14 a.m. UTC | #3

> +		folio = page_folio(page);
> +
> +		if (!folio_test_large(folio) ||
> +		   (bio_op(bio) == REQ_OP_ZONE_APPEND)) {

I don't understand why you need this branch.  All the arithmetics
below should also work just fine for non-large folios, and there
while the same_page logic in bio_iov_add_zone_append_page probably
needs to be folio-ized first, it should be handled the same way here
as well.  bio_iov_add_page should also be moved to take a folio
before the (otherwise nice) changes here.

and of course in the long run we really need a folio version of
pin_user_pages and iov_iter_extract_pages.

Kundan Kumar April 24, 2024, 1:22 p.m. UTC | #4

On 22/04/24 01:14PM, Christoph Hellwig wrote:
>> +		folio = page_folio(page);
>> +
>> +		if (!folio_test_large(folio) ||
>> +		   (bio_op(bio) == REQ_OP_ZONE_APPEND)) {
>
>I don't understand why you need this branch.  All the arithmetics
>below should also work just fine for non-large folios

The branch helps to skip these calculations for zero order folio:
A) folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) + offset;
B) folio_size(folio)

>, and there
>while the same_page logic in bio_iov_add_zone_append_page probablyg
>needs to be folio-ized first, it should be handled the same way here
>as well.

Regarding the same_page logic, if we add same page twice then we release
the page on second addition. It seemed to me that this logic will work even
if we merge large order folios. Please let me know if I am missing something.

If we pass a large size of folio to bio_iov_add_zone_append_page then we fail
early due queue_max_zone_append_sectors limit. This can be modified to add
lesser pages which are a part of bigger folio. Let me know if I shall proceed
this way or if it is fine not to add the entire folio.

>bio_iov_add_page should also be moved to take a folio
>before the (otherwise nice) changes here.

If we convert bio_iov_add_page() to bio_iov_add_folio()/bio_add_folio(),
we see a decline of about 11% for 4K I/O. When mTHP is enabled we may get
a large order folio even for a 4K I/O. The folio_offset may become larger
than 4K and we endup using expensive mempool_alloc during nvme_map_data in
NVMe driver[1].

[1]
static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
                struct nvme_command *cmnd)
{
...
...
                        if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
                                return nvme_setup_prp_simple(dev, req,
                                                             &cmnd->rw, &bv);
...
...
       iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
...
...
}

--
Kundan
>
>

Keith Busch April 24, 2024, 4:16 p.m. UTC | #5

On Fri, Apr 19, 2024 at 02:47:21PM +0530, Kundan Kumar wrote:
> @@ -1289,16 +1291,33 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  
>  	for (left = size, i = 0; left > 0; left -= len, i++) {
>  		struct page *page = pages[i];
> +		folio = page_folio(page);
> +
> +		if (!folio_test_large(folio) ||
> +		   (bio_op(bio) == REQ_OP_ZONE_APPEND)) {
> +			len = min_t(size_t, PAGE_SIZE - offset, left);
> +			if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +				ret = bio_iov_add_zone_append_page(bio, page,
> +						len, offset);
> +				if (ret)
> +					break;
> +			} else
> +				bio_iov_add_page(bio, page, len, offset);
> +		} else {
> +			/* See the offset of folio and the size */
> +			folio_offset = (folio_page_idx(folio, page)
> +					<< PAGE_SHIFT) + offset;
> +			size_folio = folio_size(folio);
>  
> -		len = min_t(size_t, PAGE_SIZE - offset, left);
> -		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> -			ret = bio_iov_add_zone_append_page(bio, page, len,
> -					offset);
> -			if (ret)
> -				break;
> -		} else
> -			bio_iov_add_page(bio, page, len, offset);
> +			/* Calculate the length of folio to be added */
> +			len = min_t(size_t, (size_folio - folio_offset), left);
> +
> +			num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE);
>  
> +			bio_iov_add_page(bio, page, len, offset);

I think there's another optimization to be had here. You only need one
reference on the folio for all its pages, so I believe you can safely
drop (num_pages - 1) references right here. Then __bio_release_pages()
can be further simplified by removing the 'do{...}while()" loop
releasing individual pages.

I tested this atop your patch, and it looks okay so far. This could be
more efficient if we had access to gup_put_folio() since we already know
all the pages are part of the same folio (unpin_user_pages()
recalculates that)

---
diff --git a/block/bio.c b/block/bio.c
index 469606494f8f7..9829c79494108 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1155,7 +1155,6 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 
 	bio_for_each_folio_all(fi, bio) {
 		struct page *page;
-		size_t nr_pages;
 
 		if (mark_dirty) {
 			folio_lock(fi.folio);
@@ -1163,11 +1162,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 			folio_unlock(fi.folio);
 		}
 		page = folio_page(fi.folio, fi.offset / PAGE_SIZE);
-		nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE -
-			   fi.offset / PAGE_SIZE + 1;
-		do {
-			bio_release_page(bio, page++);
-		} while (--nr_pages != 0);
+		bio_release_page(bio, page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1315,6 +1310,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 			num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE);
 
 			bio_iov_add_page(bio, page, len, offset);
+			if (bio_flagged(bio, BIO_PAGE_PINNED) && num_pages > 1)
+				unpin_user_pages(pages + i, num_pages - 1);
 			/* Skip the pages which got added */
 			i = i + (num_pages - 1);
 		}
--

Keith Busch April 24, 2024, 5:04 p.m. UTC | #6

On Wed, Apr 24, 2024 at 06:52:46PM +0530, Kundan Kumar wrote:
> > while the same_page logic in bio_iov_add_zone_append_page probablyg
> > needs to be folio-ized first, it should be handled the same way here
> > as well.
> 
> Regarding the same_page logic, if we add same page twice then we release
> the page on second addition. It seemed to me that this logic will work even
> if we merge large order folios. Please let me know if I am missing something.

The 'same_page' logic should only happen with successive calls to
__bio_iov_iter_get_pages(), and only for the first page considered, so I
think the logic is handled the same with your change.

Christoph Hellwig April 27, 2024, 8:14 a.m. UTC | #7

On Wed, Apr 24, 2024 at 06:52:46PM +0530, Kundan Kumar wrote:
> On 22/04/24 01:14PM, Christoph Hellwig wrote:
>>> +		folio = page_folio(page);
>>> +
>>> +		if (!folio_test_large(folio) ||
>>> +		   (bio_op(bio) == REQ_OP_ZONE_APPEND)) {
>>
>> I don't understand why you need this branch.  All the arithmetics
>> below should also work just fine for non-large folios
>
> The branch helps to skip these calculations for zero order folio:
> A) folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) + offset;
> B) folio_size(folio)

Well, we'll need to just handle folio and stop special casing
order 0 ones eventually.

> If we convert bio_iov_add_page() to bio_iov_add_folio()/bio_add_folio(),
> we see a decline of about 11% for 4K I/O. When mTHP is enabled we may get
> a large order folio even for a 4K I/O. The folio_offset may become larger
> than 4K and we endup using expensive mempool_alloc during nvme_map_data in
> NVMe driver[1].
>
> [1]
> static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>                struct nvme_command *cmnd)
> {
> ...
> ...
>                        if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)

We can replace this with:

	if ((bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) + bv.bv_len <=
	    NVME_CTRL_PAGE_SIZE * 2)

as nvme_setup_prp_simple just masks away the high bits anyway.

block : add larger order folio size instead of pages

Commit Message

Comments

Patch