[for-next,v3,0/4] fixed-buffer for uring-cmd/passthrough

Message ID	20220902151657.10766-1-joshi.k@samsung.com (mailing list archive)
Headers	show Return-Path: <io-uring-owner@kernel.org> From: Kanchan Joshi <joshi.k@samsung.com> To: axboe@kernel.dk, hch@lst.de, kbusch@kernel.org, asml.silence@gmail.com Cc: io-uring@vger.kernel.org, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, gost.dev@samsung.com, Kanchan Joshi <joshi.k@samsung.com> Subject: [PATCH for-next v3 0/4] fixed-buffer for uring-cmd/passthrough Date: Fri, 2 Sep 2022 20:46:53 +0530 Message-Id: <20220902151657.10766-1-joshi.k@samsung.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 105P DLP-Filter: Pass References: <CGME20220902152701epcas5p1d4aca8eebc90fb96ac7ed5a8270816cf@epcas5p1.samsung.com> Precedence: bulk
Series	fixed-buffer for uring-cmd/passthrough \| expand [for-next,v3,0/4] fixed-buffer for uring-cmd/passthrough [for-next,v3,1/4] io_uring: introduce io_uring_cmd_import_fixed [for-next,v3,2/4] io_uring: introduce fixed buffer support for io_uring_cmd [for-next,v3,3/4] block: add helper to map bvec iterator for passthrough [for-next,v3,4/4] nvme: wire up fixed buffer support for nvme passthrough

Kanchan Joshi Sept. 2, 2022, 3:16 p.m. UTC

Hi,

Currently uring-cmd lacks the ability to leverage the pre-registered
buffers. This series adds the support in uring-cmd, and plumbs
nvme passthrough to work with it.

Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
in my setup.

Without fixedbufs
*****************
# taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
submitter=0, tid=5256, file=/dev/ng0n1, node=-1
polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
^CExiting on signal
Maximum IOPS=1.85M

With fixedbufs
**************
# taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B1 -O0 -n1 -u1 /dev/ng0n1
submitter=0, tid=5260, file=/dev/ng0n1, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=1, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=2.17M, BW=1059MiB/s, IOS/call=32/31
IOPS=2.17M, BW=1057MiB/s, IOS/call=32/32
IOPS=2.16M, BW=1055MiB/s, IOS/call=32/32
^CExiting on signal
Maximum IOPS=2.17M

Patch 1, 3 = prep/infrastructure
Patch 2 = expand io_uring command to use registered-buffers
Patch 4 = expand nvme passthrough to use registered-buffers

This series is prepared on top of:
for-next + iopoll-passthru series [1].
A unified branch is present here:
https://github.com/OpenMPDK/linux/commits/feat/pt_fixedbufs_v3

t/io_uring util with fixedbuf support is here:
https://github.com/joshkan/fio/tree/priv/fb-v3

Changes since v2:
- Kill the new opcode, add a flag instead (Pavel)
- Fix standalone build issue with patch 1 (Pavel)

Changes since v1:
- Fix a naming issue for an exported helper

[1] https://lore.kernel.org/io-uring/20220823161443.49436-1-joshi.k@samsung.com/ 



Anuj Gupta (2):
  io_uring: introduce io_uring_cmd_import_fixed
  io_uring: introduce fixed buffer support for io_uring_cmd

Kanchan Joshi (2):
  block: add helper to map bvec iterator for passthrough
  nvme: wire up fixed buffer support for nvme passthrough

 block/blk-map.c               | 71 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/ioctl.c     | 38 +++++++++++++------
 include/linux/blk-mq.h        |  1 +
 include/linux/io_uring.h      | 11 +++++-
 include/uapi/linux/io_uring.h |  9 +++++
 io_uring/uring_cmd.c          | 29 +++++++++++++-
 6 files changed, 145 insertions(+), 14 deletions(-)

Jens Axboe Sept. 2, 2022, 4:06 p.m. UTC | #1

On 9/2/22 9:16 AM, Kanchan Joshi wrote:
> Hi,
> 
> Currently uring-cmd lacks the ability to leverage the pre-registered
> buffers. This series adds the support in uring-cmd, and plumbs
> nvme passthrough to work with it.
> 
> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
> in my setup.
> 
> Without fixedbufs
> *****************
> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
> ^CExiting on signal
> Maximum IOPS=1.85M

With the poll support queued up, I ran this one as well. tldr is:

bdev (non pt)	122M IOPS
irq driven	51-52M IOPS
polled		71M IOPS
polled+fixed	78M IOPS

Looking at profiles, it looks like the bio is still being allocated
and freed and not dipping into the alloc cache, which is using a
substantial amount of CPU. I'll poke a bit and see what's going on...

Jens Axboe Sept. 2, 2022, 4:32 p.m. UTC | #2

On 9/2/22 10:06 AM, Jens Axboe wrote:
> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>> Hi,
>>
>> Currently uring-cmd lacks the ability to leverage the pre-registered
>> buffers. This series adds the support in uring-cmd, and plumbs
>> nvme passthrough to work with it.
>>
>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>> in my setup.
>>
>> Without fixedbufs
>> *****************
>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>> Engine=io_uring, sq_ring=128, cq_ring=128
>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>> ^CExiting on signal
>> Maximum IOPS=1.85M
> 
> With the poll support queued up, I ran this one as well. tldr is:
> 
> bdev (non pt)	122M IOPS
> irq driven	51-52M IOPS
> polled		71M IOPS
> polled+fixed	78M IOPS
> 
> Looking at profiles, it looks like the bio is still being allocated
> and freed and not dipping into the alloc cache, which is using a
> substantial amount of CPU. I'll poke a bit and see what's going on...

It's using the fs_bio_set, and that doesn't have the PERCPU alloc cache
enabled. With the below, we then do:

polled+fixed	82M

I suspect the remainder is due to the lack of batching on the request
freeing side, at least some of it. Haven't really looked deeper yet.

One issue I saw - try and use passthrough polling without having any
poll queues defined and it'll stall just spinning on completions. You
need to ensure that these are processed as well - look at how the
non-passthrough io_uring poll path handles it.


diff --git a/block/bio.c b/block/bio.c
index 3d3a2678fea2..cba6b1c02eb8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1754,7 +1754,7 @@ static int __init init_bio(void)
 	cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
 					bio_cpu_dead);
 
-	if (bioset_init(&fs_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS))
+	if (bioset_init(&fs_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE))
 		panic("bio: can't allocate bios\n");
 
 	if (bioset_integrity_create(&fs_bio_set, BIO_POOL_SIZE))

Kanchan Joshi Sept. 2, 2022, 6:46 p.m. UTC | #3

On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>On 9/2/22 10:06 AM, Jens Axboe wrote:
>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>> Hi,
>>>
>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>> buffers. This series adds the support in uring-cmd, and plumbs
>>> nvme passthrough to work with it.
>>>
>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>> in my setup.
>>>
>>> Without fixedbufs
>>> *****************
>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>> ^CExiting on signal
>>> Maximum IOPS=1.85M
>>
>> With the poll support queued up, I ran this one as well. tldr is:
>>
>> bdev (non pt)	122M IOPS
>> irq driven	51-52M IOPS
>> polled		71M IOPS
>> polled+fixed	78M IOPS

except first one, rest three entries are for passthru? somehow I didn't
see that big of a gap. I will try to align my setup in coming days.

>> Looking at profiles, it looks like the bio is still being allocated
>> and freed and not dipping into the alloc cache, which is using a
>> substantial amount of CPU. I'll poke a bit and see what's going on...
>
>It's using the fs_bio_set, and that doesn't have the PERCPU alloc cache
>enabled. With the below, we then do:

Thanks for the find.

>polled+fixed	82M
>
>I suspect the remainder is due to the lack of batching on the request
>freeing side, at least some of it. Haven't really looked deeper yet.
>
>One issue I saw - try and use passthrough polling without having any
>poll queues defined and it'll stall just spinning on completions. You
>need to ensure that these are processed as well - look at how the
>non-passthrough io_uring poll path handles it.

Had tested this earlier, and it used to run fine. And it does not now.
I see that io are getting completed, irq-completion is arriving in nvme
and it is triggering task-work based completion (by calling
io_uring_cmd_complete_in_task). But task-work never got called and
therefore no completion happened.

io_uring_cmd_complete_in_task -> io_req_task_work_add -> __io_req_task_work_add

Seems task work did not get added. Something about newly added
IORING_SETUP_DEFER_TASKRUN changes the scenario.

static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
{
        struct io_uring_task *tctx = req->task->io_uring;
        struct io_ring_ctx *ctx = req->ctx;
        struct llist_node *node;

        if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
                io_req_local_work_add(req);
                return;
        }
	....

To confirm, I commented that in t/io_uring and it runs fine.
Please see if that changes anything for you? I will try to find the
actual fix tomorow.

diff --git a/t/io_uring.c b/t/io_uring.c
index d893b7b2..ac5f60e0 100644
--- a/t/io_uring.c
+++ b/t/io_uring.c
@@ -460,7 +460,6 @@ static int io_uring_setup(unsigned entries, struct io_uring_params *p)

        p->flags |= IORING_SETUP_COOP_TASKRUN;
        p->flags |= IORING_SETUP_SINGLE_ISSUER;
-       p->flags |= IORING_SETUP_DEFER_TASKRUN;
 retry:
        ret = syscall(__NR_io_uring_setup, entries, p);
        if (!ret)

Jens Axboe Sept. 2, 2022, 7:32 p.m. UTC | #4

On 9/2/22 12:46 PM, Kanchan Joshi wrote:
> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>> On 9/2/22 10:06 AM, Jens Axboe wrote:
>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>>> Hi,
>>>>
>>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>>> buffers. This series adds the support in uring-cmd, and plumbs
>>>> nvme passthrough to work with it.
>>>>
>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>>> in my setup.
>>>>
>>>> Without fixedbufs
>>>> *****************
>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>>> ^CExiting on signal
>>>> Maximum IOPS=1.85M
>>>
>>> With the poll support queued up, I ran this one as well. tldr is:
>>>
>>> bdev (non pt)??? 122M IOPS
>>> irq driven??? 51-52M IOPS
>>> polled??????? 71M IOPS
>>> polled+fixed??? 78M IOPS
> 
> except first one, rest three entries are for passthru? somehow I didn't
> see that big of a gap. I will try to align my setup in coming days.

Right, sorry it was badly labeled. First one is bdev with polling,
registered buffers, etc. The others are all the passthrough mode. polled
goes to 74M with the caching fix, so it's about a 74M -> 82M bump using
registered buffers with passthrough and polling.

>> polled+fixed??? 82M
>>
>> I suspect the remainder is due to the lack of batching on the request
>> freeing side, at least some of it. Haven't really looked deeper yet.
>>
>> One issue I saw - try and use passthrough polling without having any
>> poll queues defined and it'll stall just spinning on completions. You
>> need to ensure that these are processed as well - look at how the
>> non-passthrough io_uring poll path handles it.
> 
> Had tested this earlier, and it used to run fine. And it does not now.
> I see that io are getting completed, irq-completion is arriving in nvme
> and it is triggering task-work based completion (by calling
> io_uring_cmd_complete_in_task). But task-work never got called and
> therefore no completion happened.
> 
> io_uring_cmd_complete_in_task -> io_req_task_work_add -> __io_req_task_work_add
> 
> Seems task work did not get added. Something about newly added
> IORING_SETUP_DEFER_TASKRUN changes the scenario.
> 
> static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
> {
> ?????? struct io_uring_task *tctx = req->task->io_uring;
> ?????? struct io_ring_ctx *ctx = req->ctx;
> ?????? struct llist_node *node;
> 
> ?????? if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> ?????????????? io_req_local_work_add(req);
> ?????????????? return;
> ?????? }
> ????....
> 
> To confirm, I commented that in t/io_uring and it runs fine.
> Please see if that changes anything for you? I will try to find the
> actual fix tomorow.

Ah gotcha, yes that actually makes a lot of sense. I wonder if regular
polling is then also broken without poll queues if
IORING_SETUP_DEFER_TASKRUN is set. It should be, I'll check into
io_iopoll_check().

Jens Axboe Sept. 2, 2022, 9:25 p.m. UTC | #5

On 9/2/22 1:32 PM, Jens Axboe wrote:
> On 9/2/22 12:46 PM, Kanchan Joshi wrote:
>> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>>> On 9/2/22 10:06 AM, Jens Axboe wrote:
>>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>>>> Hi,
>>>>>
>>>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>>>> buffers. This series adds the support in uring-cmd, and plumbs
>>>>> nvme passthrough to work with it.
>>>>>
>>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>>>> in my setup.
>>>>>
>>>>> Without fixedbufs
>>>>> *****************
>>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>>>> ^CExiting on signal
>>>>> Maximum IOPS=1.85M
>>>>
>>>> With the poll support queued up, I ran this one as well. tldr is:
>>>>
>>>> bdev (non pt)??? 122M IOPS
>>>> irq driven??? 51-52M IOPS
>>>> polled??????? 71M IOPS
>>>> polled+fixed??? 78M IOPS
>>
>> except first one, rest three entries are for passthru? somehow I didn't
>> see that big of a gap. I will try to align my setup in coming days.
> 
> Right, sorry it was badly labeled. First one is bdev with polling,
> registered buffers, etc. The others are all the passthrough mode. polled
> goes to 74M with the caching fix, so it's about a 74M -> 82M bump using
> registered buffers with passthrough and polling.
> 
>>> polled+fixed??? 82M
>>>
>>> I suspect the remainder is due to the lack of batching on the request
>>> freeing side, at least some of it. Haven't really looked deeper yet.
>>>
>>> One issue I saw - try and use passthrough polling without having any
>>> poll queues defined and it'll stall just spinning on completions. You
>>> need to ensure that these are processed as well - look at how the
>>> non-passthrough io_uring poll path handles it.
>>
>> Had tested this earlier, and it used to run fine. And it does not now.
>> I see that io are getting completed, irq-completion is arriving in nvme
>> and it is triggering task-work based completion (by calling
>> io_uring_cmd_complete_in_task). But task-work never got called and
>> therefore no completion happened.
>>
>> io_uring_cmd_complete_in_task -> io_req_task_work_add -> __io_req_task_work_add
>>
>> Seems task work did not get added. Something about newly added
>> IORING_SETUP_DEFER_TASKRUN changes the scenario.
>>
>> static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
>> {
>> ?????? struct io_uring_task *tctx = req->task->io_uring;
>> ?????? struct io_ring_ctx *ctx = req->ctx;
>> ?????? struct llist_node *node;
>>
>> ?????? if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
>> ?????????????? io_req_local_work_add(req);
>> ?????????????? return;
>> ?????? }
>> ????....
>>
>> To confirm, I commented that in t/io_uring and it runs fine.
>> Please see if that changes anything for you? I will try to find the
>> actual fix tomorow.
> 
> Ah gotcha, yes that actually makes a lot of sense. I wonder if regular
> polling is then also broken without poll queues if
> IORING_SETUP_DEFER_TASKRUN is set. It should be, I'll check into
> io_iopoll_check().

A mix of fixes and just cleanups, here's what I got.

Kanchan Joshi Sept. 3, 2022, 9:34 a.m. UTC | #6

On Fri, Sep 02, 2022 at 03:25:33PM -0600, Jens Axboe wrote:
>On 9/2/22 1:32 PM, Jens Axboe wrote:
>> On 9/2/22 12:46 PM, Kanchan Joshi wrote:
>>> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>>>> On 9/2/22 10:06 AM, Jens Axboe wrote:
>>>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>>>>> buffers. This series adds the support in uring-cmd, and plumbs
>>>>>> nvme passthrough to work with it.
>>>>>>
>>>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>>>>> in my setup.
>>>>>>
>>>>>> Without fixedbufs
>>>>>> *****************
>>>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>>>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>>>>> ^CExiting on signal
>>>>>> Maximum IOPS=1.85M
>>>>>
>>>>> With the poll support queued up, I ran this one as well. tldr is:
>>>>>
>>>>> bdev (non pt)??? 122M IOPS
>>>>> irq driven??? 51-52M IOPS
>>>>> polled??????? 71M IOPS
>>>>> polled+fixed??? 78M IOPS
>>>
>>> except first one, rest three entries are for passthru? somehow I didn't
>>> see that big of a gap. I will try to align my setup in coming days.
>>
>> Right, sorry it was badly labeled. First one is bdev with polling,
>> registered buffers, etc. The others are all the passthrough mode. polled
>> goes to 74M with the caching fix, so it's about a 74M -> 82M bump using
>> registered buffers with passthrough and polling.
>>
>>>> polled+fixed??? 82M
>>>>
>>>> I suspect the remainder is due to the lack of batching on the request
>>>> freeing side, at least some of it. Haven't really looked deeper yet.
>>>>
>>>> One issue I saw - try and use passthrough polling without having any
>>>> poll queues defined and it'll stall just spinning on completions. You
>>>> need to ensure that these are processed as well - look at how the
>>>> non-passthrough io_uring poll path handles it.
>>>
>>> Had tested this earlier, and it used to run fine. And it does not now.
>>> I see that io are getting completed, irq-completion is arriving in nvme
>>> and it is triggering task-work based completion (by calling
>>> io_uring_cmd_complete_in_task). But task-work never got called and
>>> therefore no completion happened.
>>>
>>> io_uring_cmd_complete_in_task -> io_req_task_work_add -> __io_req_task_work_add
>>>
>>> Seems task work did not get added. Something about newly added
>>> IORING_SETUP_DEFER_TASKRUN changes the scenario.
>>>
>>> static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
>>> {
>>> ?????? struct io_uring_task *tctx = req->task->io_uring;
>>> ?????? struct io_ring_ctx *ctx = req->ctx;
>>> ?????? struct llist_node *node;
>>>
>>> ?????? if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
>>> ?????????????? io_req_local_work_add(req);
>>> ?????????????? return;
>>> ?????? }
>>> ????....
>>>
>>> To confirm, I commented that in t/io_uring and it runs fine.
>>> Please see if that changes anything for you? I will try to find the
>>> actual fix tomorow.
>>
>> Ah gotcha, yes that actually makes a lot of sense. I wonder if regular
>> polling is then also broken without poll queues if
>> IORING_SETUP_DEFER_TASKRUN is set. It should be, I'll check into
>> io_iopoll_check().
>
>A mix of fixes and just cleanups, here's what I got.

Thanks, this looks much better. Just something to discuss on the fix
though. Will use other thread for that.

Jens Axboe Sept. 3, 2022, 5 p.m. UTC | #7

On 9/2/22 3:25 PM, Jens Axboe wrote:
> On 9/2/22 1:32 PM, Jens Axboe wrote:
>> On 9/2/22 12:46 PM, Kanchan Joshi wrote:
>>> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>>>> On 9/2/22 10:06 AM, Jens Axboe wrote:
>>>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>>>>> buffers. This series adds the support in uring-cmd, and plumbs
>>>>>> nvme passthrough to work with it.
>>>>>>
>>>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>>>>> in my setup.
>>>>>>
>>>>>> Without fixedbufs
>>>>>> *****************
>>>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>>>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>>>>> ^CExiting on signal
>>>>>> Maximum IOPS=1.85M
>>>>>
>>>>> With the poll support queued up, I ran this one as well. tldr is:
>>>>>
>>>>> bdev (non pt)??? 122M IOPS
>>>>> irq driven??? 51-52M IOPS
>>>>> polled??????? 71M IOPS
>>>>> polled+fixed??? 78M IOPS

Followup on this, since t/io_uring didn't correctly detect NUMA nodes
for passthrough.

With the current tree and the patchset I just sent for iopoll and the
caching fix that's in the block tree, here's the final score:

polled+fixed passthrough	105M IOPS

which is getting pretty close to the bdev polled fixed path as well.
I think that is starting to look pretty good!

[...]
submitter=22, tid=4768, file=/dev/ng22n1, node=8
submitter=23, tid=4769, file=/dev/ng23n1, node=8
polled=1, fixedbufs=1/0, register_files=1, buffered=1, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=102.51M, BW=50.05GiB/s, IOS/call=32/31
IOPS=105.29M, BW=51.41GiB/s, IOS/call=31/32
IOPS=105.34M, BW=51.43GiB/s, IOS/call=32/31
IOPS=105.37M, BW=51.45GiB/s, IOS/call=32/32
IOPS=105.37M, BW=51.45GiB/s, IOS/call=31/31
IOPS=105.38M, BW=51.45GiB/s, IOS/call=31/31
IOPS=105.35M, BW=51.44GiB/s, IOS/call=32/32
IOPS=105.49M, BW=51.51GiB/s, IOS/call=32/31
^CExiting on signal
Maximum IOPS=105.49M

Kanchan Joshi Sept. 4, 2022, 5:01 p.m. UTC | #8

On Sat, Sep 03, 2022 at 11:00:43AM -0600, Jens Axboe wrote:
>On 9/2/22 3:25 PM, Jens Axboe wrote:
>> On 9/2/22 1:32 PM, Jens Axboe wrote:
>>> On 9/2/22 12:46 PM, Kanchan Joshi wrote:
>>>> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>>>>> On 9/2/22 10:06 AM, Jens Axboe wrote:
>>>>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>>>>>> buffers. This series adds the support in uring-cmd, and plumbs
>>>>>>> nvme passthrough to work with it.
>>>>>>>
>>>>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>>>>>> in my setup.
>>>>>>>
>>>>>>> Without fixedbufs
>>>>>>> *****************
>>>>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>>>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>>>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>>>>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>>>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>>>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>>>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>>>>>> ^CExiting on signal
>>>>>>> Maximum IOPS=1.85M
>>>>>>
>>>>>> With the poll support queued up, I ran this one as well. tldr is:
>>>>>>
>>>>>> bdev (non pt)??? 122M IOPS
>>>>>> irq driven??? 51-52M IOPS
>>>>>> polled??????? 71M IOPS
>>>>>> polled+fixed??? 78M IOPS
>
>Followup on this, since t/io_uring didn't correctly detect NUMA nodes
>for passthrough.
>
>With the current tree and the patchset I just sent for iopoll and the
>caching fix that's in the block tree, here's the final score:
>
>polled+fixed passthrough	105M IOPS
>
>which is getting pretty close to the bdev polled fixed path as well.
>I think that is starting to look pretty good!
Great! In my setup (single disk/numa-node), current kernel shows-

Block MIOPS
***********
command:t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -P1 -n1 /dev/nvme0n1
plain: 1.52
plain+fb: 1.77
plain+poll: 2.23
plain+fb+poll: 2.61

Passthru MIOPS
**************
command:t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -P1 -u1 -n1 /dev/ng0n1
plain: 1.78
plain+fb: 2.08
plain+poll: 2.21
plain+fb+poll: 2.69

Jens Axboe Sept. 4, 2022, 8:17 p.m. UTC | #9

On 9/4/22 11:01 AM, Kanchan Joshi wrote:
> On Sat, Sep 03, 2022 at 11:00:43AM -0600, Jens Axboe wrote:
>> On 9/2/22 3:25 PM, Jens Axboe wrote:
>>> On 9/2/22 1:32 PM, Jens Axboe wrote:
>>>> On 9/2/22 12:46 PM, Kanchan Joshi wrote:
>>>>> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>>>>>> On 9/2/22 10:06 AM, Jens Axboe wrote:
>>>>>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>>>>>>> buffers. This series adds the support in uring-cmd, and plumbs
>>>>>>>> nvme passthrough to work with it.
>>>>>>>>
>>>>>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>>>>>>> in my setup.
>>>>>>>>
>>>>>>>> Without fixedbufs
>>>>>>>> *****************
>>>>>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>>>>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>>>>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>>>>>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>>>>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>>>>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>>>>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>>>>>>> ^CExiting on signal
>>>>>>>> Maximum IOPS=1.85M
>>>>>>>
>>>>>>> With the poll support queued up, I ran this one as well. tldr is:
>>>>>>>
>>>>>>> bdev (non pt)??? 122M IOPS
>>>>>>> irq driven??? 51-52M IOPS
>>>>>>> polled??????? 71M IOPS
>>>>>>> polled+fixed??? 78M IOPS
>>
>> Followup on this, since t/io_uring didn't correctly detect NUMA nodes
>> for passthrough.
>>
>> With the current tree and the patchset I just sent for iopoll and the
>> caching fix that's in the block tree, here's the final score:
>>
>> polled+fixed passthrough??? 105M IOPS
>>
>> which is getting pretty close to the bdev polled fixed path as well.
>> I think that is starting to look pretty good!
> Great! In my setup (single disk/numa-node), current kernel shows-
> 
> Block MIOPS
> ***********
> command:t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -P1 -n1 /dev/nvme0n1
> plain: 1.52
> plain+fb: 1.77
> plain+poll: 2.23
> plain+fb+poll: 2.61
> 
> Passthru MIOPS
> **************
> command:t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -P1 -u1 -n1 /dev/ng0n1
> plain: 1.78
> plain+fb: 2.08
> plain+poll: 2.21
> plain+fb+poll: 2.69

Interesting, here's what I have:

Block MIOPS
============
plain: 2.90
plain+fb: 3.0
plain+poll: 4.04
plain+fb+poll: 5.09	

Passthru MIPS
=============
plain: 2.37
plain+fb: 2.84
plain+poll: 3.65
plain+fb+poll: 4.93

This is a gen2 optane, it maxes out at right around 5.1M IOPS. Note that
I have disabled iostats and merges generally in my runs:

echo 0 > /sys/block/nvme0n1/queue/iostats
echo 2 > /sys/block/nvme0n1/queue/nomerges

which will impact block more than passthru obviously, particularly
the nomerges. iostats should have a similar impact on both of them (but
I haven't tested either of those without those disabled).

Kanchan Joshi Sept. 5, 2022, 5:52 a.m. UTC | #10

On Sun, Sep 04, 2022 at 02:17:33PM -0600, Jens Axboe wrote:
>On 9/4/22 11:01 AM, Kanchan Joshi wrote:
>> On Sat, Sep 03, 2022 at 11:00:43AM -0600, Jens Axboe wrote:
>>> On 9/2/22 3:25 PM, Jens Axboe wrote:
>>>> On 9/2/22 1:32 PM, Jens Axboe wrote:
>>>>> On 9/2/22 12:46 PM, Kanchan Joshi wrote:
>>>>>> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>>>>>>> On 9/2/22 10:06 AM, Jens Axboe wrote:
>>>>>>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>>>>>>>> buffers. This series adds the support in uring-cmd, and plumbs
>>>>>>>>> nvme passthrough to work with it.
>>>>>>>>>
>>>>>>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>>>>>>>> in my setup.
>>>>>>>>>
>>>>>>>>> Without fixedbufs
>>>>>>>>> *****************
>>>>>>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>>>>>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>>>>>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>>>>>>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>>>>>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>>>>>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>>>>>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>>>>>>>> ^CExiting on signal
>>>>>>>>> Maximum IOPS=1.85M
>>>>>>>>
>>>>>>>> With the poll support queued up, I ran this one as well. tldr is:
>>>>>>>>
>>>>>>>> bdev (non pt)??? 122M IOPS
>>>>>>>> irq driven??? 51-52M IOPS
>>>>>>>> polled??????? 71M IOPS
>>>>>>>> polled+fixed??? 78M IOPS
>>>
>>> Followup on this, since t/io_uring didn't correctly detect NUMA nodes
>>> for passthrough.
>>>
>>> With the current tree and the patchset I just sent for iopoll and the
>>> caching fix that's in the block tree, here's the final score:
>>>
>>> polled+fixed passthrough??? 105M IOPS
>>>
>>> which is getting pretty close to the bdev polled fixed path as well.
>>> I think that is starting to look pretty good!
>> Great! In my setup (single disk/numa-node), current kernel shows-
>>
>> Block MIOPS
>> ***********
>> command:t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -P1 -n1 /dev/nvme0n1
>> plain: 1.52
>> plain+fb: 1.77
>> plain+poll: 2.23
>> plain+fb+poll: 2.61
>>
>> Passthru MIOPS
>> **************
>> command:t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -P1 -u1 -n1 /dev/ng0n1
>> plain: 1.78
>> plain+fb: 2.08
>> plain+poll: 2.21
>> plain+fb+poll: 2.69
>
>Interesting, here's what I have:
>
>Block MIOPS
>============
>plain: 2.90
>plain+fb: 3.0
>plain+poll: 4.04
>plain+fb+poll: 5.09	
>
>Passthru MIPS
>=============
>plain: 2.37
>plain+fb: 2.84
>plain+poll: 3.65
>plain+fb+poll: 4.93
>
>This is a gen2 optane
same. Do you see same 'FW rev' as below?

# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          PHAL11730018400AGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB    512   B +  0 B   L0310200


>, it maxes out at right around 5.1M IOPS. Note that
>I have disabled iostats and merges generally in my runs:
>
>echo 0 > /sys/block/nvme0n1/queue/iostats
>echo 2 > /sys/block/nvme0n1/queue/nomerges
>
>which will impact block more than passthru obviously, particularly
>the nomerges. iostats should have a similar impact on both of them (but
>I haven't tested either of those without those disabled).

bit improvment after disabling, but for all entries.

block
=====
plain: 1.6
plain+FB: 1.91
plain+poll: 2.36
plain+FB+poll: 2.85

passthru
========
plain: 1.9
plain+FB: 2.2
plain+poll: 2.4
plain+FB+poll: 2.9

Maybe there is something about my kernel-config that prevents from
reaching to expected peak (i.e. 5.1M). Will check more.

Jens Axboe Sept. 5, 2022, 5:48 p.m. UTC | #11

On 9/4/22 11:52 PM, Kanchan Joshi wrote:
> On Sun, Sep 04, 2022 at 02:17:33PM -0600, Jens Axboe wrote:
>> On 9/4/22 11:01 AM, Kanchan Joshi wrote:
>>> On Sat, Sep 03, 2022 at 11:00:43AM -0600, Jens Axboe wrote:
>>>> On 9/2/22 3:25 PM, Jens Axboe wrote:
>>>>> On 9/2/22 1:32 PM, Jens Axboe wrote:
>>>>>> On 9/2/22 12:46 PM, Kanchan Joshi wrote:
>>>>>>> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>>>>>>>> On 9/2/22 10:06 AM, Jens Axboe wrote:
>>>>>>>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>>>>>>>>> buffers. This series adds the support in uring-cmd, and plumbs
>>>>>>>>>> nvme passthrough to work with it.
>>>>>>>>>>
>>>>>>>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>>>>>>>>> in my setup.
>>>>>>>>>>
>>>>>>>>>> Without fixedbufs
>>>>>>>>>> *****************
>>>>>>>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>>>>>>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>>>>>>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>>>>>>>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>>>>>>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>>>>>>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>>>>>>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>>>>>>>>> ^CExiting on signal
>>>>>>>>>> Maximum IOPS=1.85M
>>>>>>>>>
>>>>>>>>> With the poll support queued up, I ran this one as well. tldr is:
>>>>>>>>>
>>>>>>>>> bdev (non pt)??? 122M IOPS
>>>>>>>>> irq driven??? 51-52M IOPS
>>>>>>>>> polled??????? 71M IOPS
>>>>>>>>> polled+fixed??? 78M IOPS
>>>>
>>>> Followup on this, since t/io_uring didn't correctly detect NUMA nodes
>>>> for passthrough.
>>>>
>>>> With the current tree and the patchset I just sent for iopoll and the
>>>> caching fix that's in the block tree, here's the final score:
>>>>
>>>> polled+fixed passthrough??? 105M IOPS
>>>>
>>>> which is getting pretty close to the bdev polled fixed path as well.
>>>> I think that is starting to look pretty good!
>>> Great! In my setup (single disk/numa-node), current kernel shows-
>>>
>>> Block MIOPS
>>> ***********
>>> command:t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -P1 -n1 /dev/nvme0n1
>>> plain: 1.52
>>> plain+fb: 1.77
>>> plain+poll: 2.23
>>> plain+fb+poll: 2.61
>>>
>>> Passthru MIOPS
>>> **************
>>> command:t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -P1 -u1 -n1 /dev/ng0n1
>>> plain: 1.78
>>> plain+fb: 2.08
>>> plain+poll: 2.21
>>> plain+fb+poll: 2.69
>>
>> Interesting, here's what I have:
>>
>> Block MIOPS
>> ============
>> plain: 2.90
>> plain+fb: 3.0
>> plain+poll: 4.04
>> plain+fb+poll: 5.09   
>>
>> Passthru MIPS
>> =============
>> plain: 2.37
>> plain+fb: 2.84
>> plain+poll: 3.65
>> plain+fb+poll: 4.93
>>
>> This is a gen2 optane
> same. Do you see same 'FW rev' as below?
> 
> # nvme list
> Node????????????????? SN?????????????????? Model??????????????????????????????????? Namespace Usage????????????????????? Format?????????? FW Rev
> --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
> /dev/nvme0n1????????? PHAL11730018400AGN?? INTEL SSDPF21Q400GB????????????????????? 1???????? 400.09? GB / 400.09? GB??? 512?? B +? 0 B?? L0310200
> 
> 
>> , it maxes out at right around 5.1M IOPS. Note that
>> I have disabled iostats and merges generally in my runs:
>>
>> echo 0 > /sys/block/nvme0n1/queue/iostats
>> echo 2 > /sys/block/nvme0n1/queue/nomerges
>>
>> which will impact block more than passthru obviously, particularly
>> the nomerges. iostats should have a similar impact on both of them (but
>> I haven't tested either of those without those disabled).
> 
> bit improvment after disabling, but for all entries.
> 
> block
> =====
> plain: 1.6
> plain+FB: 1.91
> plain+poll: 2.36
> plain+FB+poll: 2.85
> 
> passthru
> ========
> plain: 1.9
> plain+FB: 2.2
> plain+poll: 2.4
> plain+FB+poll: 2.9
> 
> Maybe there is something about my kernel-config that prevents from
> reaching to expected peak (i.e. 5.1M). Will check more.

Here's the config I use for this kind of testing, in case it's useful.

[for-next,v3,0/4] fixed-buffer for uring-cmd/passthrough

Message

Comments