mbox series

[PATCHv2,0/6] ublk zero-copy support

Message ID 20250211005646.222452-1-kbusch@meta.com (mailing list archive)
Headers show
Series ublk zero-copy support | expand

Message

Keith Busch Feb. 11, 2025, 12:56 a.m. UTC
From: Keith Busch <kbusch@kernel.org>

Previous version was discussed here:

  https://lore.kernel.org/linux-block/20250203154517.937623-1-kbusch@meta.com/

The same ublksrv reference code in that link was used to test the kernel
side changes.

Before listing what has changed, I want to mention what is the same: the
reliance on the ring ctx lock to serialize the register ahead of any
use. I'm not ignoring the feedback; I just don't have a solid answer
right now, and want to progress on the other fronts in the meantime.

Here's what's different from the previous:

 - Introduced an optional 'release' callback when the resource node is
   no longer referenced. The callback addresses any buggy applications
   that may complete their request and unregister their index while IO
   is in flight. This obviates any need to take extra page references
   since it prevents the request from completing.

 - Removed peeking into the io_cache element size and instead use a
   more intuitive bvec segment count limit to decide if we're caching
   the imu (suggested by Pavel).

 - Dropped the const request changes; it's not needed.

 - Merged up to latest block uring branch

Jens Axboe (1):
  io_uring: use node for import

Keith Busch (5):
  io_uring: create resource release callback
  io_uring: add support for kernel registered bvecs
  ublk: zc register/unregister bvec
  io_uring: add abstraction for buf_table rsrc data
  io_uring: cache nodes and mapped buffers

 drivers/block/ublk_drv.c       | 145 ++++++++++++++-----
 include/linux/io_uring.h       |   1 +
 include/linux/io_uring_types.h |  28 ++--
 include/uapi/linux/ublk_cmd.h  |   4 +
 io_uring/fdinfo.c              |   8 +-
 io_uring/filetable.c           |   2 +-
 io_uring/net.c                 |   5 +-
 io_uring/nop.c                 |   2 +-
 io_uring/register.c            |   2 +-
 io_uring/rsrc.c                | 252 ++++++++++++++++++++++++++-------
 io_uring/rsrc.h                |  11 +-
 io_uring/rw.c                  |   4 +-
 io_uring/uring_cmd.c           |   4 +-
 13 files changed, 355 insertions(+), 113 deletions(-)

Comments

Ming Lei Feb. 12, 2025, 2:29 a.m. UTC | #1
On Mon, Feb 10, 2025 at 04:56:40PM -0800, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Previous version was discussed here:
> 
>   https://lore.kernel.org/linux-block/20250203154517.937623-1-kbusch@meta.com/
> 
> The same ublksrv reference code in that link was used to test the kernel
> side changes.
> 
> Before listing what has changed, I want to mention what is the same: the
> reliance on the ring ctx lock to serialize the register ahead of any
> use. I'm not ignoring the feedback; I just don't have a solid answer
> right now, and want to progress on the other fronts in the meantime.

It is explained in the following links:

https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/

- node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
  in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
  ->issue() when the two are submitted in same io_uring_enter(), so you
  need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
  to ->issue() first.

- secondly, ->issue() order is only respected by IO_LINK, and io_uring
  can't provide such guarantee without using IO_LINK:

  Pavel explained it in the following link:

  https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/

  There are also other examples, such as, register buffer stays in one
  link chain, and the consumer OP isn't in this chain, the consumer OP
  can still be issued before issuing register_buffer.


Thanks, 
Ming
Keith Busch Feb. 12, 2025, 3:28 p.m. UTC | #2
On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote:
> It is explained in the following links:
> 
> https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/
> 
> - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
>   in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
>   ->issue() when the two are submitted in same io_uring_enter(), so you
>   need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
>   to ->issue() first.

I don't think that's accurate, at least in practice. In a normal flow,
we'll have this sequence:

 io_submit_sqes
   io_submit_sqe (uring_cmd ublk register)
     io_init_req
       ->prep()
     io_queue_sqe
       ->issue()
   io_submit_sqe (read/write_fixed)
     io_init_req
       ->prep()
     io_queue_sqe
      ->issue()

The first SQE is handled in its entirety before even looking at the
subsequent SQE. Since the register is first, then the read/write_fixed's
prep will have a valid index. Testing this patch series appears to show
this reliably works.
 
> - secondly, ->issue() order is only respected by IO_LINK, and io_uring
>   can't provide such guarantee without using IO_LINK:
> 
>   Pavel explained it in the following link:
> 
>   https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/
> 
>   There are also other examples, such as, register buffer stays in one
>   link chain, and the consumer OP isn't in this chain, the consumer OP
>   can still be issued before issuing register_buffer.

Yep, I got that. Linking is just something I was hoping to avoid. I
understand there are conditions that can break the normal flow I'm
relying on regarding  the ordering. This hasn't appeared to be a problem
in practice, but I agree this needs to be handled.
Pavel Begunkov Feb. 12, 2025, 4:06 p.m. UTC | #3
On 2/12/25 15:28, Keith Busch wrote:
> On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote:
>> It is explained in the following links:
>>
>> https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/
>>
>> - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
>>    in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
>>    ->issue() when the two are submitted in same io_uring_enter(), so you
>>    need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
>>    to ->issue() first.
> 
> I don't think that's accurate, at least in practice. In a normal flow,
> we'll have this sequence:
> 
>   io_submit_sqes
>     io_submit_sqe (uring_cmd ublk register)
>       io_init_req
>         ->prep()
>       io_queue_sqe
>         ->issue()
>     io_submit_sqe (read/write_fixed)
>       io_init_req
>         ->prep()
>       io_queue_sqe
>        ->issue()
> 
> The first SQE is handled in its entirety before even looking at the
> subsequent SQE. Since the register is first, then the read/write_fixed's
> prep will have a valid index. Testing this patch series appears to show
> this reliably works.

Ming describes how it works for links. This one is indeed how
non links are normally executed. Though I'd repeat it's an
implementation detail and not a part of the uapi. Interestingly,
Keith, you sent some patches changing the ordering here quite a
while ago, just as an example of how it can change.


>> - secondly, ->issue() order is only respected by IO_LINK, and io_uring
>>    can't provide such guarantee without using IO_LINK:
>>
>>    Pavel explained it in the following link:
>>
>>    https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/
>>
>>    There are also other examples, such as, register buffer stays in one
>>    link chain, and the consumer OP isn't in this chain, the consumer OP
>>    can still be issued before issuing register_buffer.
> 
> Yep, I got that. Linking is just something I was hoping to avoid. I
> understand there are conditions that can break the normal flow I'm
> relying on regarding  the ordering. This hasn't appeared to be a problem
> in practice, but I agree this needs to be handled.
Ming Lei Feb. 13, 2025, 1:52 a.m. UTC | #4
On Wed, Feb 12, 2025 at 04:06:58PM +0000, Pavel Begunkov wrote:
> On 2/12/25 15:28, Keith Busch wrote:
> > On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote:
> > > It is explained in the following links:
> > > 
> > > https://lore.kernel.org/linux-block/b6211101-3f74-4dea-a880-81bb75575dbd@gmail.com/
> > > 
> > > - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
> > >    in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
> > >    ->issue() when the two are submitted in same io_uring_enter(), so you
> > >    need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
> > >    to ->issue() first.
> > 
> > I don't think that's accurate, at least in practice. In a normal flow,
> > we'll have this sequence:
> > 
> >   io_submit_sqes
> >     io_submit_sqe (uring_cmd ublk register)
> >       io_init_req
> >         ->prep()
> >       io_queue_sqe
> >         ->issue()
> >     io_submit_sqe (read/write_fixed)
> >       io_init_req
> >         ->prep()
> >       io_queue_sqe
> >        ->issue()
> > 
> > The first SQE is handled in its entirety before even looking at the
> > subsequent SQE. Since the register is first, then the read/write_fixed's
> > prep will have a valid index. Testing this patch series appears to show
> > this reliably works.
> 
> Ming describes how it works for links. This one is indeed how
> non links are normally executed. Though I'd repeat it's an
> implementation detail and not a part of the uapi. Interestingly,
> Keith, you sent some patches changing the ordering here quite a
> while ago, just as an example of how it can change.

My fault, I should have provided the link or async background.

> 
> 
> > > - secondly, ->issue() order is only respected by IO_LINK, and io_uring
> > >    can't provide such guarantee without using IO_LINK:
> > > 
> > >    Pavel explained it in the following link:
> > > 
> > >    https://lore.kernel.org/linux-block/68256da6-bb13-4498-a0e0-dce88bb32242@gmail.com/
> > > 
> > >    There are also other examples, such as, register buffer stays in one
> > >    link chain, and the consumer OP isn't in this chain, the consumer OP
> > >    can still be issued before issuing register_buffer.
> > 
> > Yep, I got that. Linking is just something I was hoping to avoid. I
> > understand there are conditions that can break the normal flow I'm
> > relying on regarding  the ordering. This hasn't appeared to be a problem
> > in practice, but I agree this needs to be handled.

LINK/ASYNC needs to be supported, and sometimes they are useful.

- IO_LINK is the only way for respecting IO order

  io_uring only supports non-link or link all in one batch

- ASYNC sometimes can avoid to call two ->issue() unnecessarily if you
  know that the OP can't be dealt with async way in advance, maybe not
  one problem for ublk uring_cmd, but it is helpful for some FS write
  (un-allocated write)


Thanks,
Ming