mbox series

[0/4] iopoll support for io_uring/nvme passthrough

Message ID 20220805154226.155008-1-joshi.k@samsung.com (mailing list archive)
Headers show
Series iopoll support for io_uring/nvme passthrough | expand

Message

Kanchan Joshi Aug. 5, 2022, 3:42 p.m. UTC
Hi,

Series enables async polling on io_uring command, and nvme passthrough
(for io-commands) is wired up to leverage that.

512b randread performance (KIOP) below:

QD_batch    block    passthru    passthru-poll   block-poll
1_1          80        81          158            157
8_2         406       470          680            700
16_4        620       656          931            920
128_32      879       1056        1120            1132

Polling is giving the clear win here.
Upstream fio is used for testing.

passthru command line:
fio -iodepth=64 -rw=randread -ioengine=io_uring_cmd -bs=512 -numjobs=1
-runtime=60 -group_reporting -iodepth_batch_submit=16
-iodepth_batch_complete_min=1 -iodepth_batch_complete_max=16
-cmd_type=nvme -hipri=0 -filename=/dev/ng1n1 -name=io_uring_cmd_64

block command line:
fio -direct=1 -iodepth=64 -rw=randread -ioengine=io_uring -bs=512
-numjobs=1 -runtime=60 -group_reporting -iodepth_batch_submit=16
-iodepth_batch_complete_min=1 -iodepth_batch_complete_max=16
-hipri=0 -filename=/dev/nvme1n1 name=io_uring_64

Bit of code  went into non-passthrough path for io_uring (patch 2) but I
do not see that causing any performance regression.
peak-perf test showed 2.3M IOPS with or without this series for
block-io.

io_uring: Running taskset -c 0,12 t/io_uring -b512 -d128 -c32 -s32 -p1
-F1 -B1 -n2  /dev/nvme0n1
submitter=0, tid=3089, file=/dev/nvme0n1, node=-1
submitter=1, tid=3090, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=2.31M, BW=1126MiB/s, IOS/call=31/31
IOPS=2.30M, BW=1124MiB/s, IOS/call=32/31
IOPS=2.30M, BW=1123MiB/s, IOS/call=32/32 

Kanchan Joshi (4):
  fs: add file_operations->uring_cmd_iopoll
  io_uring: add iopoll infrastructure for io_uring_cmd
  block: export blk_rq_is_poll
  nvme: wire up async polling for io passthrough commands

 block/blk-mq.c                |  3 +-
 drivers/nvme/host/core.c      |  1 +
 drivers/nvme/host/ioctl.c     | 73 ++++++++++++++++++++++++++++++++---
 drivers/nvme/host/multipath.c |  1 +
 drivers/nvme/host/nvme.h      |  2 +
 include/linux/blk-mq.h        |  1 +
 include/linux/fs.h            |  1 +
 include/linux/io_uring.h      |  8 +++-
 io_uring/io_uring.c           |  6 +++
 io_uring/opdef.c              |  1 +
 io_uring/rw.c                 |  8 +++-
 io_uring/uring_cmd.c          | 11 +++++-
 12 files changed, 105 insertions(+), 11 deletions(-)

Comments

Jens Axboe Aug. 5, 2022, 5:04 p.m. UTC | #1
On 8/5/22 9:42 AM, Kanchan Joshi wrote:
> Hi,
> 
> Series enables async polling on io_uring command, and nvme passthrough
> (for io-commands) is wired up to leverage that.
> 
> 512b randread performance (KIOP) below:
> 
> QD_batch    block    passthru    passthru-poll   block-poll
> 1_1          80        81          158            157
> 8_2         406       470          680            700
> 16_4        620       656          931            920
> 128_32      879       1056        1120            1132

Curious on why passthru is slower than block-poll? Are we missing
something here?
Kanchan Joshi Aug. 5, 2022, 5:13 p.m. UTC | #2
On Fri, Aug 05, 2022 at 11:04:22AM -0600, Jens Axboe wrote:
>On 8/5/22 9:42 AM, Kanchan Joshi wrote:
>> Hi,
>>
>> Series enables async polling on io_uring command, and nvme passthrough
>> (for io-commands) is wired up to leverage that.
>>
>> 512b randread performance (KIOP) below:
>>
>> QD_batch    block    passthru    passthru-poll   block-poll
>> 1_1          80        81          158            157
>> 8_2         406       470          680            700
>> 16_4        620       656          931            920
>> 128_32      879       1056        1120            1132
>
>Curious on why passthru is slower than block-poll? Are we missing
>something here?
passthru-poll vs block-poll you mean?
passthru does not have bio-cache, while block path is running with that.
Maybe completion-batching is also playing some role, not too sure about that
at the moment.
Jens Axboe Aug. 5, 2022, 5:18 p.m. UTC | #3
On 8/5/22 11:04 AM, Jens Axboe wrote:
> On 8/5/22 9:42 AM, Kanchan Joshi wrote:
>> Hi,
>>
>> Series enables async polling on io_uring command, and nvme passthrough
>> (for io-commands) is wired up to leverage that.
>>
>> 512b randread performance (KIOP) below:
>>
>> QD_batch    block    passthru    passthru-poll   block-poll
>> 1_1          80        81          158            157
>> 8_2         406       470          680            700
>> 16_4        620       656          931            920
>> 128_32      879       1056        1120            1132
> 
> Curious on why passthru is slower than block-poll? Are we missing
> something here?

I took a quick peek, running it here. List of items making it slower:

- No fixedbufs support for passthru, each each request will go through
  get_user_pages() and put_pages() on completion. This is about a 10%
  change for me, by itself.

- nvme_uring_cmd_io() -> nvme_alloc_user_request() -> blk_rq_map_user()
  -> blk_rq_map_user_iov() -> memset() is another ~4% for me.

- The kmalloc+kfree per command is roughly 9% extra slowdown.

There are other little things, but the above are the main ones. Even if
I disable fixedbufs for non-passthru, passthru is about ~24% slower
here using a single device and a single core, which is mostly the above
mentioned items.

This isn't specific to the iopoll support, that's obviously faster than
IRQ driven for this test case. This is just comparing passthru with
the regular block path for doing random 512b reads.
Jens Axboe Aug. 5, 2022, 5:27 p.m. UTC | #4
On 8/5/22 11:13 AM, Kanchan Joshi wrote:
> On Fri, Aug 05, 2022 at 11:04:22AM -0600, Jens Axboe wrote:
>> On 8/5/22 9:42 AM, Kanchan Joshi wrote:
>>> Hi,
>>>
>>> Series enables async polling on io_uring command, and nvme passthrough
>>> (for io-commands) is wired up to leverage that.
>>>
>>> 512b randread performance (KIOP) below:
>>>
>>> QD_batch    block    passthru    passthru-poll   block-poll
>>> 1_1          80        81          158            157
>>> 8_2         406       470          680            700
>>> 16_4        620       656          931            920
>>> 128_32      879       1056        1120            1132
>>
>> Curious on why passthru is slower than block-poll? Are we missing
>> something here?
> passthru-poll vs block-poll you mean?
> passthru does not have bio-cache, while block path is running with that.
> Maybe completion-batching is also playing some role, not too sure about that
> at the moment.

Yeah, see other email on a quick rundown. We should make
bio_map_user_iov() use the bio caching, that'd make a big difference.
Won't fully close the gap, but will be close if we exclude the lack of
fixedbufs.
Keith Busch Aug. 5, 2022, 6:11 p.m. UTC | #5
On Fri, Aug 05, 2022 at 11:18:38AM -0600, Jens Axboe wrote:
> On 8/5/22 11:04 AM, Jens Axboe wrote:
> > On 8/5/22 9:42 AM, Kanchan Joshi wrote:
> >> Hi,
> >>
> >> Series enables async polling on io_uring command, and nvme passthrough
> >> (for io-commands) is wired up to leverage that.
> >>
> >> 512b randread performance (KIOP) below:
> >>
> >> QD_batch    block    passthru    passthru-poll   block-poll
> >> 1_1          80        81          158            157
> >> 8_2         406       470          680            700
> >> 16_4        620       656          931            920
> >> 128_32      879       1056        1120            1132
> > 
> > Curious on why passthru is slower than block-poll? Are we missing
> > something here?
> 
> I took a quick peek, running it here. List of items making it slower:
> 
> - No fixedbufs support for passthru, each each request will go through
>   get_user_pages() and put_pages() on completion. This is about a 10%
>   change for me, by itself.

Enabling fixed buffer support through here looks like it will take a little bit
of work. The driver needs an opcode or flag to tell it the user address is a
fixed buffer, and io_uring needs to export its registered buffer for a driver
like nvme to get to.
 
> - nvme_uring_cmd_io() -> nvme_alloc_user_request() -> blk_rq_map_user()
>   -> blk_rq_map_user_iov() -> memset() is another ~4% for me.

Where's the memset() coming from? That should only happen if we need to bounce,
right? This type of request shouldn't need that unless you're using odd user
address alignment.
Jens Axboe Aug. 5, 2022, 6:15 p.m. UTC | #6
On 8/5/22 12:11 PM, Keith Busch wrote:
> On Fri, Aug 05, 2022 at 11:18:38AM -0600, Jens Axboe wrote:
>> On 8/5/22 11:04 AM, Jens Axboe wrote:
>>> On 8/5/22 9:42 AM, Kanchan Joshi wrote:
>>>> Hi,
>>>>
>>>> Series enables async polling on io_uring command, and nvme passthrough
>>>> (for io-commands) is wired up to leverage that.
>>>>
>>>> 512b randread performance (KIOP) below:
>>>>
>>>> QD_batch    block    passthru    passthru-poll   block-poll
>>>> 1_1          80        81          158            157
>>>> 8_2         406       470          680            700
>>>> 16_4        620       656          931            920
>>>> 128_32      879       1056        1120            1132
>>>
>>> Curious on why passthru is slower than block-poll? Are we missing
>>> something here?
>>
>> I took a quick peek, running it here. List of items making it slower:
>>
>> - No fixedbufs support for passthru, each each request will go through
>>   get_user_pages() and put_pages() on completion. This is about a 10%
>>   change for me, by itself.
> 
> Enabling fixed buffer support through here looks like it will take a
> little bit of work. The driver needs an opcode or flag to tell it the
> user address is a fixed buffer, and io_uring needs to export its
> registered buffer for a driver like nvme to get to.

Yeah, it's not a straight forward thing. But if this will be used with
recycled buffers, then it'll definitely be worthwhile to look into.

>> - nvme_uring_cmd_io() -> nvme_alloc_user_request() -> blk_rq_map_user()
>>   -> blk_rq_map_user_iov() -> memset() is another ~4% for me.
> 
> Where's the memset() coming from? That should only happen if we need
> to bounce, right? This type of request shouldn't need that unless
> you're using odd user address alignment.

Not sure, need to double check! Hacking up a patch to get rid of the
frivolous alloc+free, we'll see how it stands after that and I'll find
it.
Jens Axboe Aug. 5, 2022, 6:21 p.m. UTC | #7
On 8/5/22 11:18 AM, Jens Axboe wrote:
> On 8/5/22 11:04 AM, Jens Axboe wrote:
>> On 8/5/22 9:42 AM, Kanchan Joshi wrote:
>>> Hi,
>>>
>>> Series enables async polling on io_uring command, and nvme passthrough
>>> (for io-commands) is wired up to leverage that.
>>>
>>> 512b randread performance (KIOP) below:
>>>
>>> QD_batch    block    passthru    passthru-poll   block-poll
>>> 1_1          80        81          158            157
>>> 8_2         406       470          680            700
>>> 16_4        620       656          931            920
>>> 128_32      879       1056        1120            1132
>>
>> Curious on why passthru is slower than block-poll? Are we missing
>> something here?
> 
> I took a quick peek, running it here. List of items making it slower:
> 
> - No fixedbufs support for passthru, each each request will go through
>   get_user_pages() and put_pages() on completion. This is about a 10%
>   change for me, by itself.
> 
> - nvme_uring_cmd_io() -> nvme_alloc_user_request() -> blk_rq_map_user()
>   -> blk_rq_map_user_iov() -> memset() is another ~4% for me.
> 
> - The kmalloc+kfree per command is roughly 9% extra slowdown.
> 
> There are other little things, but the above are the main ones. Even if
> I disable fixedbufs for non-passthru, passthru is about ~24% slower
> here using a single device and a single core, which is mostly the above
> mentioned items.
> 
> This isn't specific to the iopoll support, that's obviously faster than
> IRQ driven for this test case. This is just comparing passthru with
> the regular block path for doing random 512b reads.

Here's a hack that gets rid of the page array alloc+free for smaller vec
ranges, and uses the bio cache for polled IO too.

This reclaims about 14% of the 24% compared to block-iopoll, in
particular for this run it brings IOPS from ~1815K to 2110K for
passthru-polled.

We also don't seem to be taking advantage of request completion
batching, and tag batch allocations. Outside of that, looks like just
some generic block bits that need fixing up (and the attached patch that
needs some cleanup gets us most of the way there), so nothing we can't
get sorted out.

Keith, the memset() seems to be tied to the allocations fixed in this
patch, it's gone now as well.

diff --git a/block/blk-map.c b/block/blk-map.c
index df8b066cd548..8861b89e15a8 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -157,10 +157,8 @@ static int bio_copy_user_iov(struct request *rq, struct rq_map_data *map_data,
 		goto out_bmd;
 	bio_init(bio, NULL, bio->bi_inline_vecs, nr_pages, req_op(rq));
 
-	if (map_data) {
-		nr_pages = 1 << map_data->page_order;
+	if (map_data)
 		i = map_data->offset / PAGE_SIZE;
-	}
 	while (len) {
 		unsigned int bytes = PAGE_SIZE;
 
@@ -232,7 +230,7 @@ static int bio_copy_user_iov(struct request *rq, struct rq_map_data *map_data,
 }
 
 static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
-		gfp_t gfp_mask)
+			    gfp_t gfp_mask)
 {
 	unsigned int max_sectors = queue_max_hw_sectors(rq->q);
 	unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
@@ -243,18 +241,34 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	if (!iov_iter_count(iter))
 		return -EINVAL;
 
-	bio = bio_kmalloc(nr_vecs, gfp_mask);
-	if (!bio)
-		return -ENOMEM;
-	bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq));
+	if (rq->cmd_flags & REQ_POLLED) {
+		blk_opf_t opf = rq->cmd_flags | REQ_ALLOC_CACHE;
+
+		bio = bio_alloc_bioset(NULL, nr_vecs, opf, gfp_mask,
+					&fs_bio_set);
+		if (!bio)
+			return -ENOMEM;
+	} else {
+		bio = bio_kmalloc(nr_vecs, gfp_mask);
+		if (!bio)
+			return -ENOMEM;
+		bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq));
+	}
 
 	while (iov_iter_count(iter)) {
-		struct page **pages;
+		struct page **pages, *stack_pages[8];
 		ssize_t bytes;
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+		if (nr_vecs < ARRAY_SIZE(stack_pages)) {
+			pages = stack_pages;
+			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
+							nr_vecs, &offs);
+		} else {
+			bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX,
+							&offs);
+		}
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -291,7 +305,8 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 */
 		while (j < npages)
 			put_page(pages[j++]);
-		kvfree(pages);
+		if (pages != stack_pages)
+			kvfree(pages);
 		/* couldn't stuff something into bio? */
 		if (bytes)
 			break;
@@ -304,8 +319,12 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 
  out_unmap:
 	bio_release_pages(bio, false);
-	bio_uninit(bio);
-	kfree(bio);
+	if (bio->bi_opf & REQ_ALLOC_CACHE) {
+		bio_put(bio);
+	} else {
+		bio_uninit(bio);
+		kfree(bio);
+	}
 	return ret;
 }
 
@@ -325,8 +344,12 @@ static void bio_invalidate_vmalloc_pages(struct bio *bio)
 static void bio_map_kern_endio(struct bio *bio)
 {
 	bio_invalidate_vmalloc_pages(bio);
-	bio_uninit(bio);
-	kfree(bio);
+	if (bio->bi_opf & REQ_ALLOC_CACHE) {
+		bio_put(bio);
+	} else {
+		bio_uninit(bio);
+		kfree(bio);
+	}
 }
 
 /**
@@ -610,8 +633,12 @@ int blk_rq_unmap_user(struct bio *bio)
 
 		next_bio = bio;
 		bio = bio->bi_next;
-		bio_uninit(next_bio);
-		kfree(next_bio);
+		if (next_bio->bi_opf & REQ_ALLOC_CACHE) {
+			bio_put(next_bio);
+		} else {
+			bio_uninit(next_bio);
+			kfree(next_bio);
+		}
 	}
 
 	return ret;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8f841caaa4cb..ce2f4b8dc0d9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -964,11 +964,10 @@ blk_status_t blk_insert_cloned_request(struct request *rq);
 
 struct rq_map_data {
 	struct page **pages;
-	int page_order;
 	int nr_entries;
 	unsigned long offset;
-	int null_mapped;
-	int from_user;
+	bool null_mapped;
+	bool from_user;
 };
 
 int blk_rq_map_user(struct request_queue *, struct request *,
Kanchan Joshi Aug. 7, 2022, 5:58 p.m. UTC | #8
On Fri, Aug 05, 2022 at 12:15:24PM -0600, Jens Axboe wrote:
>On 8/5/22 12:11 PM, Keith Busch wrote:
>> On Fri, Aug 05, 2022 at 11:18:38AM -0600, Jens Axboe wrote:
>>> On 8/5/22 11:04 AM, Jens Axboe wrote:
>>>> On 8/5/22 9:42 AM, Kanchan Joshi wrote:
>>>>> Hi,
>>>>>
>>>>> Series enables async polling on io_uring command, and nvme passthrough
>>>>> (for io-commands) is wired up to leverage that.
>>>>>
>>>>> 512b randread performance (KIOP) below:
>>>>>
>>>>> QD_batch    block    passthru    passthru-poll   block-poll
>>>>> 1_1          80        81          158            157
>>>>> 8_2         406       470          680            700
>>>>> 16_4        620       656          931            920
>>>>> 128_32      879       1056        1120            1132
>>>>
>>>> Curious on why passthru is slower than block-poll? Are we missing
>>>> something here?
>>>
>>> I took a quick peek, running it here. List of items making it slower:
>>>
>>> - No fixedbufs support for passthru, each each request will go through
>>>   get_user_pages() and put_pages() on completion. This is about a 10%
>>>   change for me, by itself.
>>
>> Enabling fixed buffer support through here looks like it will take a
>> little bit of work. The driver needs an opcode or flag to tell it the
>> user address is a fixed buffer, and io_uring needs to export its
>> registered buffer for a driver like nvme to get to.
>
>Yeah, it's not a straight forward thing. But if this will be used with
>recycled buffers, then it'll definitely be worthwhile to look into.

Had posted bio-cache and fixedbufs in the initial round but retracted
to get the foundation settled first.
https://lore.kernel.org/linux-nvme/20220308152105.309618-1-joshi.k@samsung.com/

I see that you brought back bio-cache already. I can refresh fixedbufs.
Completion-batching seems too tightly coupled to block-path.
Jens Axboe Aug. 7, 2022, 6:46 p.m. UTC | #9
On 8/7/22 11:58 AM, Kanchan Joshi wrote:
> On Fri, Aug 05, 2022 at 12:15:24PM -0600, Jens Axboe wrote:
>> On 8/5/22 12:11 PM, Keith Busch wrote:
>>> On Fri, Aug 05, 2022 at 11:18:38AM -0600, Jens Axboe wrote:
>>>> On 8/5/22 11:04 AM, Jens Axboe wrote:
>>>>> On 8/5/22 9:42 AM, Kanchan Joshi wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Series enables async polling on io_uring command, and nvme passthrough
>>>>>> (for io-commands) is wired up to leverage that.
>>>>>>
>>>>>> 512b randread performance (KIOP) below:
>>>>>>
>>>>>> QD_batch    block    passthru    passthru-poll   block-poll
>>>>>> 1_1          80        81          158            157
>>>>>> 8_2         406       470          680            700
>>>>>> 16_4        620       656          931            920
>>>>>> 128_32      879       1056        1120            1132
>>>>>
>>>>> Curious on why passthru is slower than block-poll? Are we missing
>>>>> something here?
>>>>
>>>> I took a quick peek, running it here. List of items making it slower:
>>>>
>>>> - No fixedbufs support for passthru, each each request will go through
>>>>   get_user_pages() and put_pages() on completion. This is about a 10%
>>>>   change for me, by itself.
>>>
>>> Enabling fixed buffer support through here looks like it will take a
>>> little bit of work. The driver needs an opcode or flag to tell it the
>>> user address is a fixed buffer, and io_uring needs to export its
>>> registered buffer for a driver like nvme to get to.
>>
>> Yeah, it's not a straight forward thing. But if this will be used with
>> recycled buffers, then it'll definitely be worthwhile to look into.
> 
> Had posted bio-cache and fixedbufs in the initial round but retracted
> to get the foundation settled first.
> https://lore.kernel.org/linux-nvme/20220308152105.309618-1-joshi.k@samsung.com/
> 
> I see that you brought back bio-cache already. I can refresh fixedbufs.

Excellent, yes please bring back the fixedbufs. It's a 5-10% win,
nothing to sneeze at.

> Completion-batching seems too tightly coupled to block-path.

It's really not, in fact it'd be even simpler to do for passthru. The
rq->end_io handler just needs to know about it.