[PATCHv2,0/6] ublk zero-copy support

Message ID	20250211005646.222452-1-kbusch@meta.com (mailing list archive)
Headers	show Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 27F867B3E1 for <io-uring@vger.kernel.org>; Tue, 11 Feb 2025 00:57:16 +0000 (UTC) From: Keith Busch <kbusch@meta.com> To: <ming.lei@redhat.com>, <asml.silence@gmail.com>, <axboe@kernel.dk>, <linux-block@vger.kernel.org>, <io-uring@vger.kernel.org> CC: <bernd@bsbernd.com>, Keith Busch <kbusch@kernel.org> Subject: [PATCHv2 0/6] ublk zero-copy support Date: Mon, 10 Feb 2025 16:56:40 -0800 Message-ID: <20250211005646.222452-1-kbusch@meta.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain
Series	ublk zero-copy support \| expand [PATCHv2,0/6] ublk zero-copy support [PATCHv2,1/6] io_uring: use node for import [PATCHv2,2/6] io_uring: create resource release callback [PATCHv2,3/6] io_uring: add support for kernel registered bvecs [PATCHv2,4/6] ublk: zc register/unregister bvec [PATCHv2,5/6] io_uring: add abstraction for buf_table rsrc data [PATCHv2,6/6] io_uring: cache nodes and mapped buffers

Keith Busch Feb. 11, 2025, 12:56 a.m. UTC

From: Keith Busch <kbusch@kernel.org>

Previous version was discussed here:

  https://lore.kernel.org/linux-block/20250203154517.937623-1-kbusch@meta.com/

The same ublksrv reference code in that link was used to test the kernel
side changes.

Before listing what has changed, I want to mention what is the same: the
reliance on the ring ctx lock to serialize the register ahead of any
use. I'm not ignoring the feedback; I just don't have a solid answer
right now, and want to progress on the other fronts in the meantime.

Here's what's different from the previous:

 - Introduced an optional 'release' callback when the resource node is
   no longer referenced. The callback addresses any buggy applications
   that may complete their request and unregister their index while IO
   is in flight. This obviates any need to take extra page references
   since it prevents the request from completing.

 - Removed peeking into the io_cache element size and instead use a
   more intuitive bvec segment count limit to decide if we're caching
   the imu (suggested by Pavel).

 - Dropped the const request changes; it's not needed.

 - Merged up to latest block uring branch

Jens Axboe (1):
  io_uring: use node for import

Keith Busch (5):
  io_uring: create resource release callback
  io_uring: add support for kernel registered bvecs
  ublk: zc register/unregister bvec
  io_uring: add abstraction for buf_table rsrc data
  io_uring: cache nodes and mapped buffers

 drivers/block/ublk_drv.c       | 145 ++++++++++++++-----
 include/linux/io_uring.h       |   1 +
 include/linux/io_uring_types.h |  28 ++--
 include/uapi/linux/ublk_cmd.h  |   4 +
 io_uring/fdinfo.c              |   8 +-
 io_uring/filetable.c           |   2 +-
 io_uring/net.c                 |   5 +-
 io_uring/nop.c                 |   2 +-
 io_uring/register.c            |   2 +-
 io_uring/rsrc.c                | 252 ++++++++++++++++++++++++++-------
 io_uring/rsrc.h                |  11 +-
 io_uring/rw.c                  |   4 +-
 io_uring/uring_cmd.c           |   4 +-
 13 files changed, 355 insertions(+), 113 deletions(-)

Ming Lei Feb. 12, 2025, 2:29 a.m. UTC | #1

On Mon, Feb 10, 2025 at 04:56:40PM -0800, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Previous version was discussed here:
> 
>   https://lore.kernel.org/linux-block/20250203154517.937623-1-kbusch@meta.com/
> 
> The same ublksrv reference code in that link was used to test the kernel
> side changes.
> 
> Before listing what has changed, I want to mention what is the same: the
> reliance on the ring ctx lock to serialize the register ahead of any
> use. I'm not ignoring the feedback; I just don't have a solid answer
> right now, and want to progress on the other fronts in the meantime.

It is explained in the following links:

https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/

- node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
  in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
  ->issue() when the two are submitted in same io_uring_enter(), so you
  need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
  to ->issue() first.

- secondly, ->issue() order is only respected by IO_LINK, and io_uring
  can't provide such guarantee without using IO_LINK:

  Pavel explained it in the following link:

  https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/

  There are also other examples, such as, register buffer stays in one
  link chain, and the consumer OP isn't in this chain, the consumer OP
  can still be issued before issuing register_buffer.


Thanks, 
Ming

Keith Busch Feb. 12, 2025, 3:28 p.m. UTC | #2

On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote:
> It is explained in the following links:
> 
> https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/
> 
> - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
>   in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
>   ->issue() when the two are submitted in same io_uring_enter(), so you
>   need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
>   to ->issue() first.

I don't think that's accurate, at least in practice. In a normal flow,
we'll have this sequence:

 io_submit_sqes
   io_submit_sqe (uring_cmd ublk register)
     io_init_req
       ->prep()
     io_queue_sqe
       ->issue()
   io_submit_sqe (read/write_fixed)
     io_init_req
       ->prep()
     io_queue_sqe
      ->issue()

The first SQE is handled in its entirety before even looking at the
subsequent SQE. Since the register is first, then the read/write_fixed's
prep will have a valid index. Testing this patch series appears to show
this reliably works.

> - secondly, ->issue() order is only respected by IO_LINK, and io_uring
>   can't provide such guarantee without using IO_LINK:
> 
>   Pavel explained it in the following link:
> 
>   https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/
> 
>   There are also other examples, such as, register buffer stays in one
>   link chain, and the consumer OP isn't in this chain, the consumer OP
>   can still be issued before issuing register_buffer.

Yep, I got that. Linking is just something I was hoping to avoid. I
understand there are conditions that can break the normal flow I'm
relying on regarding  the ordering. This hasn't appeared to be a problem
in practice, but I agree this needs to be handled.

Pavel Begunkov Feb. 12, 2025, 4:06 p.m. UTC | #3

On 2/12/25 15:28, Keith Busch wrote:
> On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote:
>> It is explained in the following links:
>>
>> https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/
>>
>> - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
>>    in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
>>    ->issue() when the two are submitted in same io_uring_enter(), so you
>>    need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
>>    to ->issue() first.
> 
> I don't think that's accurate, at least in practice. In a normal flow,
> we'll have this sequence:
> 
>   io_submit_sqes
>     io_submit_sqe (uring_cmd ublk register)
>       io_init_req
>         ->prep()
>       io_queue_sqe
>         ->issue()
>     io_submit_sqe (read/write_fixed)
>       io_init_req
>         ->prep()
>       io_queue_sqe
>        ->issue()
> 
> The first SQE is handled in its entirety before even looking at the
> subsequent SQE. Since the register is first, then the read/write_fixed's
> prep will have a valid index. Testing this patch series appears to show
> this reliably works.

Ming describes how it works for links. This one is indeed how
non links are normally executed. Though I'd repeat it's an
implementation detail and not a part of the uapi. Interestingly,
Keith, you sent some patches changing the ordering here quite a
while ago, just as an example of how it can change.


>> - secondly, ->issue() order is only respected by IO_LINK, and io_uring
>>    can't provide such guarantee without using IO_LINK:
>>
>>    Pavel explained it in the following link:
>>
>>    https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/
>>
>>    There are also other examples, such as, register buffer stays in one
>>    link chain, and the consumer OP isn't in this chain, the consumer OP
>>    can still be issued before issuing register_buffer.
> 
> Yep, I got that. Linking is just something I was hoping to avoid. I
> understand there are conditions that can break the normal flow I'm
> relying on regarding  the ordering. This hasn't appeared to be a problem
> in practice, but I agree this needs to be handled.

Ming Lei Feb. 13, 2025, 1:52 a.m. UTC | #4

On Wed, Feb 12, 2025 at 04:06:58PM +0000, Pavel Begunkov wrote:
> On 2/12/25 15:28, Keith Busch wrote:
> > On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote:
> > > It is explained in the following links:
> > > 
> > > https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/
> > > 
> > > - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
> > >    in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
> > >    ->issue() when the two are submitted in same io_uring_enter(), so you
> > >    need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
> > >    to ->issue() first.
> > 
> > I don't think that's accurate, at least in practice. In a normal flow,
> > we'll have this sequence:
> > 
> >   io_submit_sqes
> >     io_submit_sqe (uring_cmd ublk register)
> >       io_init_req
> >         ->prep()
> >       io_queue_sqe
> >         ->issue()
> >     io_submit_sqe (read/write_fixed)
> >       io_init_req
> >         ->prep()
> >       io_queue_sqe
> >        ->issue()
> > 
> > The first SQE is handled in its entirety before even looking at the
> > subsequent SQE. Since the register is first, then the read/write_fixed's
> > prep will have a valid index. Testing this patch series appears to show
> > this reliably works.
> 
> Ming describes how it works for links. This one is indeed how
> non links are normally executed. Though I'd repeat it's an
> implementation detail and not a part of the uapi. Interestingly,
> Keith, you sent some patches changing the ordering here quite a
> while ago, just as an example of how it can change.

My fault, I should have provided the link or async background.

> 
> 
> > > - secondly, ->issue() order is only respected by IO_LINK, and io_uring
> > >    can't provide such guarantee without using IO_LINK:
> > > 
> > >    Pavel explained it in the following link:
> > > 
> > >    https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/
> > > 
> > >    There are also other examples, such as, register buffer stays in one
> > >    link chain, and the consumer OP isn't in this chain, the consumer OP
> > >    can still be issued before issuing register_buffer.
> > 
> > Yep, I got that. Linking is just something I was hoping to avoid. I
> > understand there are conditions that can break the normal flow I'm
> > relying on regarding  the ordering. This hasn't appeared to be a problem
> > in practice, but I agree this needs to be handled.

LINK/ASYNC needs to be supported, and sometimes they are useful.

- IO_LINK is the only way for respecting IO order

  io_uring only supports non-link or link all in one batch

- ASYNC sometimes can avoid to call two ->issue() unnecessarily if you
  know that the OP can't be dealt with async way in advance, maybe not
  one problem for ublk uring_cmd, but it is helpful for some FS write
  (un-allocated write)


Thanks,
Ming

lizetao Feb. 13, 2025, 3:12 p.m. UTC | #5

Hi,

> -----Original Message-----
> From: Keith Busch <kbusch@meta.com>
> Sent: Tuesday, February 11, 2025 8:57 AM
> To: ming.lei@redhat.com; asml.silence@gmail.com; axboe@kernel.dk; linux-
> block@vger.kernel.org; io-uring@vger.kernel.org
> Cc: bernd@bsbernd.com; Keith Busch <kbusch@kernel.org>
> Subject: [PATCHv2 0/6] ublk zero-copy support
> 
> From: Keith Busch <kbusch@kernel.org>
> 
> Previous version was discussed here:
> 
>   https://lore.kernel.org/linux-block/20250203154517.937623-1-
> kbusch@meta.com/
> 
> The same ublksrv reference code in that link was used to test the kernel side
> changes.
> 
> Before listing what has changed, I want to mention what is the same: the
> reliance on the ring ctx lock to serialize the register ahead of any use. I'm not
> ignoring the feedback; I just don't have a solid answer right now, and want to
> progress on the other fronts in the meantime.
> 
> Here's what's different from the previous:
> 
>  - Introduced an optional 'release' callback when the resource node is
>    no longer referenced. The callback addresses any buggy applications
>    that may complete their request and unregister their index while IO
>    is in flight. This obviates any need to take extra page references
>    since it prevents the request from completing.
> 
>  - Removed peeking into the io_cache element size and instead use a
>    more intuitive bvec segment count limit to decide if we're caching
>    the imu (suggested by Pavel).
> 
>  - Dropped the const request changes; it's not needed.

I tested this patch set. When I use null as the device, the test results are like your v1.
When the bs is 4k, there is a slight improvement; when the bs is 64k, there is a significant improvement.
However, when I used loop as the device, I found that there was no improvement, whether using 4k or 64k. As follow:

  ublk add -t loop -f ./ublk-loop.img 
  ublk add -t loop -f ./ublk-loop-zerocopy.img

  fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
    read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)

  fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
    read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)


So, this patch set is optimized for null type devices? Or if I've missed any key information, please let me know.


---
Li Zetao

Keith Busch Feb. 13, 2025, 4:06 p.m. UTC | #6

On Thu, Feb 13, 2025 at 03:12:43PM +0000, lizetao wrote:
> I tested this patch set. When I use null as the device, the test results are like your v1.
> When the bs is 4k, there is a slight improvement; when the bs is 64k, there is a significant improvement.
> However, when I used loop as the device, I found that there was no improvement, whether using 4k or 64k. As follow:
> 
>   ublk add -t loop -f ./ublk-loop.img 
>   ublk add -t loop -f ./ublk-loop-zerocopy.img
> 
>   fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
>     read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)
> 
>   fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
>     read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)
> 
> 
> So, this patch set is optimized for null type devices? Or if I've missed any key information, please let me know.

What do you get if if you run your fio job directly on your
ublk-loop.img file?

Throughput should improve until you've saturated the backend device.
Once you hit that point, the primary benefit of zero-copy come from
decreased memory and CPU utilizations.

Ming Lei Feb. 14, 2025, 2:41 a.m. UTC | #7

On Thu, Feb 13, 2025 at 11:24 PM lizetao <lizetao1@huawei.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Keith Busch <kbusch@meta.com>
> > Sent: Tuesday, February 11, 2025 8:57 AM
> > To: ming.lei@redhat.com; asml.silence@gmail.com; axboe@kernel.dk; linux-
> > block@vger.kernel.org; io-uring@vger.kernel.org
> > Cc: bernd@bsbernd.com; Keith Busch <kbusch@kernel.org>
> > Subject: [PATCHv2 0/6] ublk zero-copy support
> >
> > From: Keith Busch <kbusch@kernel.org>
> >
> > Previous version was discussed here:
> >
> >   https://lore.kernel.org/linux-block/20250203154517.937623-1-
> > kbusch@meta.com/
> >
> > The same ublksrv reference code in that link was used to test the kernel side
> > changes.
> >
> > Before listing what has changed, I want to mention what is the same: the
> > reliance on the ring ctx lock to serialize the register ahead of any use. I'm not
> > ignoring the feedback; I just don't have a solid answer right now, and want to
> > progress on the other fronts in the meantime.
> >
> > Here's what's different from the previous:
> >
> >  - Introduced an optional 'release' callback when the resource node is
> >    no longer referenced. The callback addresses any buggy applications
> >    that may complete their request and unregister their index while IO
> >    is in flight. This obviates any need to take extra page references
> >    since it prevents the request from completing.
> >
> >  - Removed peeking into the io_cache element size and instead use a
> >    more intuitive bvec segment count limit to decide if we're caching
> >    the imu (suggested by Pavel).
> >
> >  - Dropped the const request changes; it's not needed.
>
> I tested this patch set. When I use null as the device, the test results are like your v1.
> When the bs is 4k, there is a slight improvement; when the bs is 64k, there is a significant improvement.

Yes,  the improvement is usually more obvious with a big IO size(>= 64K).

> However, when I used loop as the device, I found that there was no improvement, whether using 4k or 64k. As follow:
>
>   ublk add -t loop -f ./ublk-loop.img
>   ublk add -t loop -f ./ublk-loop-zerocopy.img
>
>   fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
>     read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)
>
>   fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
>     read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)
>
>
> So, this patch set is optimized for null type devices? Or if I've missed any key information, please let me know.

Latency may have decreased a bit.

System sources can't be saturated in single queue depth, please run
the same test with
high queue depth per Keith's suggestion:

        --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16

Also if you set up the backing file as ramfs image, the improvement
should be pretty
obvious, I observed IOPS doubled in this way.


Thanks,
Ming

lizetao Feb. 14, 2025, 3:39 a.m. UTC | #8

Hi,

> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Friday, February 14, 2025 12:07 AM
> To: lizetao <lizetao1@huawei.com>
> Cc: Keith Busch <kbusch@meta.com>; io-uring@vger.kernel.org;
> axboe@kernel.dk
> Subject: Re: [PATCHv2 0/6] ublk zero-copy support
> 
> On Thu, Feb 13, 2025 at 03:12:43PM +0000, lizetao wrote:
> > I tested this patch set. When I use null as the device, the test results are like
> your v1.
> > When the bs is 4k, there is a slight improvement; when the bs is 64k, there is
> a significant improvement.
> > However, when I used loop as the device, I found that there was no
> improvement, whether using 4k or 64k. As follow:
> >
> >   ublk add -t loop -f ./ublk-loop.img
> >   ublk add -t loop -f ./ublk-loop-zerocopy.img
> >
> >   fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring
> -bs=128k -size=5G
> >     read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)
> >
> >   fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring
> -bs=128k -size=5G
> >     read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)
> >
> >
> > So, this patch set is optimized for null type devices? Or if I've missed any key
> information, please let me know.
> 
> What do you get if if you run your fio job directly on your ublk-loop.img file?

I test it directly on ublk-loop.img, and the result is as follow:
  
  fio -filename=./ublk-loop.img -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
  read: IOPS=1005, BW=126MiB/s (132MB/s)(1258MiB/10009msec)

From the test results, this seems to be limited by the performance limitations of the file system
where ./ublk-loop.img is located.

> Throughput should improve until you've saturated the backend device.
> Once you hit that point, the primary benefit of zero-copy come from decreased
> memory and CPU utilizations.

---
Li Zetao

lizetao Feb. 14, 2025, 4:21 a.m. UTC | #9

Hi,

> -----Original Message-----
> From: Ming Lei <ming.lei@redhat.com>
> Sent: Friday, February 14, 2025 10:41 AM
> To: lizetao <lizetao1@huawei.com>
> Cc: Keith Busch <kbusch@meta.com>; io-uring@vger.kernel.org;
> axboe@kernel.dk; Ming Lei <ming.lei@redhat.com>
> Subject: Re: [PATCHv2 0/6] ublk zero-copy support
> 
> On Thu, Feb 13, 2025 at 11:24 PM lizetao <lizetao1@huawei.com> wrote:
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: Keith Busch <kbusch@meta.com>
> > > Sent: Tuesday, February 11, 2025 8:57 AM
> > > To: ming.lei@redhat.com; asml.silence@gmail.com; axboe@kernel.dk;
> > > linux- block@vger.kernel.org; io-uring@vger.kernel.org
> > > Cc: bernd@bsbernd.com; Keith Busch <kbusch@kernel.org>
> > > Subject: [PATCHv2 0/6] ublk zero-copy support
> > >
> > > From: Keith Busch <kbusch@kernel.org>
> > >
> > > Previous version was discussed here:
> > >
> > >   https://lore.kernel.org/linux-block/20250203154517.937623-1-
> > > kbusch@meta.com/
> > >
> > > The same ublksrv reference code in that link was used to test the
> > > kernel side changes.
> > >
> > > Before listing what has changed, I want to mention what is the same:
> > > the reliance on the ring ctx lock to serialize the register ahead of
> > > any use. I'm not ignoring the feedback; I just don't have a solid
> > > answer right now, and want to progress on the other fronts in the
> meantime.
> > >
> > > Here's what's different from the previous:
> > >
> > >  - Introduced an optional 'release' callback when the resource node is
> > >    no longer referenced. The callback addresses any buggy applications
> > >    that may complete their request and unregister their index while IO
> > >    is in flight. This obviates any need to take extra page references
> > >    since it prevents the request from completing.
> > >
> > >  - Removed peeking into the io_cache element size and instead use a
> > >    more intuitive bvec segment count limit to decide if we're caching
> > >    the imu (suggested by Pavel).
> > >
> > >  - Dropped the const request changes; it's not needed.
> >
> > I tested this patch set. When I use null as the device, the test results are like
> your v1.
> > When the bs is 4k, there is a slight improvement; when the bs is 64k, there is
> a significant improvement.
> 
> Yes,  the improvement is usually more obvious with a big IO size(>= 64K).
> 
> > However, when I used loop as the device, I found that there was no
> improvement, whether using 4k or 64k. As follow:
> >
> >   ublk add -t loop -f ./ublk-loop.img
> >   ublk add -t loop -f ./ublk-loop-zerocopy.img
> >
> >   fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring
> -bs=128k -size=5G
> >     read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)
> >
> >   fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring
> -bs=128k -size=5G
> >     read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)
> >
> >
> > So, this patch set is optimized for null type devices? Or if I've missed any key
> information, please let me know.
> 
> Latency may have decreased a bit.
> 
> System sources can't be saturated in single queue depth, please run the same
> test with high queue depth per Keith's suggestion:
> 
>         --iodepth=128 --iodepth_batch_submit=16 --
> iodepth_batch_complete_min=16

I tested it with these settings, but the result is similar to iodepth=1:

  fio -filename=/dev/ublkb0 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10
    read: IOPS=2182, BW=136MiB/s (143MB/s)(1440MiB/10558msec)
  
  fio -filename=/dev/ublkb1 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10
    read: IOPS=2174, BW=136MiB/s (143MB/s)(1438MiB/10580msec)

So I believe this is limited by the performance limitations of the file system where ./ublk-loop.img is located.

> 
> Also if you set up the backing file as ramfs image, the improvement should be
> pretty obvious, I observed IOPS doubled in this way.

This is true, I tested it in /tmp/ and got a large optimizations. The results as follow:

  fio -filename=/dev/ublkb0 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10
    read: IOPS=95.8k, BW=5985MiB/s (6276MB/s)(58.5GiB/10014msec)

  fio -filename=/dev/ublkb1 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10
    read: IOPS=170k, BW=10.4GiB/s (11.1GB/s)(80.0GiB/7721msec)

So this test result is in line with expectations.

---
Li Zetao

[PATCHv2,0/6] ublk zero-copy support

Message

Comments