mbox series

[V10,0/12] io_uring: support group buffer & ublk zc

Message ID 20241107110149.890530-1-ming.lei@redhat.com (mailing list archive)
Headers show
Series io_uring: support group buffer & ublk zc | expand

Message

Ming Lei Nov. 7, 2024, 11:01 a.m. UTC
Hello,

Patch 1~3 cleans rsrc code.

Patch 4~9 prepares for supporting kernel buffer. 

The 10th patch supports group buffer, so far only kernel buffer is
supported, but it is pretty easy to extend for userspace group buffer.

The 11th patch adds uring_cmd interface for driver.

The last patch applies group kernel buffer for supporting ublk zc.

Tests:

1) pass liburing test
- make runtests

2) add ublk loop-zc test code
https://github.com/ming1/liburing/tree/uring_group

V10:
	- use io_rsrc_node to deal with group buffer(Jens)
	- rebase on the latest for-6.13/io_uring
	- cleanup rsrc code
	- add request flag for marking if buffer is imported(Jens)
	- misc cleanup

V9:
	- reuse io_mapped_ubuf for group kernel buffer(Jens)
	- rename as REQ_F_GROUP_BUF which can be extended for userspace
	group buffer easily
	- rebase on the latest for-6.13/io_uring
	- make sure that group buffer is imported once
	- use group buffer exclusively with buffer node & buffer select
	- misc cleanup

V8:
	- simplify & clean up group request completion, don't reuse 
	SQE_GROUP as state; meantime improve document; now group
	implementation is quite clean
	- handle short read/recv correctly by zeroing out the remained
	  part(Pavel)
	- fix one group leader reference(Uday Shankar)
	- only allow ublk provide buffer command in case of zc(Uday Shankar)

V7:
	- remove dead code in sqe group support(Pavel)
	- fail single group request(Pavel)
	- remove IORING_PROVIDE_GROUP_KBUF(Pavel)
	- remove REQ_F_SQE_GROUP_DEP(Pavel)
	- rename as leasing buffer
	- improve commit log
	- map group member's IOSQE_IO_DRAIN to GROUP_KBUF, which
	aligns with buffer select use, and it means that io_uring starts
	to support leased kbuf from other subsystem for group member
	requests only

V6:
	- follow Pavel's suggestion to disallow IOSQE_CQE_SKIP_SUCCESS &
	  LINK_TIMEOUT
	- kill __io_complete_group_member() (Pavel)
	- simplify link failure handling (Pavel)
	- move members' queuing out of completion lock (Pavel)
	- cleanup group io complete handler
	- add more comment
	- add ublk zc into liburing test for covering
	  IOSQE_SQE_GROUP & IORING_PROVIDE_GROUP_KBUF 

V5:
	- follow Pavel's suggestion to minimize change on io_uring fast code
	  path: sqe group code is called in by single 'if (unlikely())' from
	  both issue & completion code path

	- simplify & re-write group request completion
		avoid to touch io-wq code by completing group leader via tw
		directly, just like ->task_complete

		re-write group member & leader completion handling, one
		simplification is always to free leader via the last member

		simplify queueing group members, not support issuing leader
		and members in parallel

	- fail the whole group if IO_*LINK & IO_DRAIN is set on group
	  members, and test code to cover this change

	- misc cleanup

V4:
	- address most comments from Pavel
	- fix request double free
	- don't use io_req_commit_cqe() in io_req_complete_defer()
	- make members' REQ_F_INFLIGHT discoverable
	- use common assembling check in submission code path
	- drop patch 3 and don't move REQ_F_CQE_SKIP out of io_free_req()
	- don't set .accept_group_kbuf for net send zc, in which members
	  need to be queued after buffer notification is got, and can be
	  enabled in future
	- add .grp_leader field via union, and share storage with .grp_link
	- move .grp_refs into one hole of io_kiocb, so that one extra
	cacheline isn't needed for io_kiocb
	- cleanup & document improvement

V3:
	- add IORING_FEAT_SQE_GROUP
	- simplify group completion, and minimize change on io_req_complete_defer()
	- simplify & cleanup io_queue_group_members()
	- fix many failure handling issues
	- cover failure handling code in added liburing tests
	- remove RFC

V2:
	- add generic sqe group, suggested by Kevin Wolf
	- add REQ_F_SQE_GROUP_DEP which is based on IOSQE_SQE_GROUP, for sharing
	  kernel resource in group wide, suggested by Kevin Wolf
	- remove sqe ext flag, and use the last bit for IOSQE_SQE_GROUP(Pavel),
	in future we still can extend sqe flags with one uring context flag
	- initialize group requests via submit state pattern, suggested by Pavel
	- all kinds of cleanup & bug fixes

Ming Lei (12):
  io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers
  io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node'
  io_uring/rsrc: add & apply io_req_assign_buf_node()
  io_uring/rsrc: prepare for supporting external 'io_rsrc_node'
  io_uring: rename io_mapped_ubuf as io_mapped_buf
  io_uring: rename io_mapped_buf->ubuf as io_mapped_buf->addr
  io_uring: shrink io_mapped_buf
  io_uring: reuse io_mapped_buf for kernel buffer
  io_uring: add callback to 'io_mapped_buffer' for giving back kernel
    buffer
  io_uring: support leased group buffer with REQ_F_GROUP_BUF
  io_uring/uring_cmd: support leasing device kernel buffer to io_uring
  ublk: support leasing io buffer to io_uring

 drivers/block/ublk_drv.c       | 168 +++++++++++++++++++++++++++++++--
 include/linux/io_uring/cmd.h   |   7 ++
 include/linux/io_uring_types.h |  43 +++++++++
 include/uapi/linux/ublk_cmd.h  |  11 ++-
 io_uring/fdinfo.c              |   4 +-
 io_uring/filetable.c           |  13 +--
 io_uring/filetable.h           |   4 +-
 io_uring/io_uring.c            |  24 ++++-
 io_uring/io_uring.h            |   5 +
 io_uring/kbuf.c                |  82 ++++++++++++++++
 io_uring/kbuf.h                |  27 ++++++
 io_uring/net.c                 |  30 +++++-
 io_uring/nop.c                 |   3 +-
 io_uring/opdef.c               |   4 +
 io_uring/opdef.h               |   2 +
 io_uring/rsrc.c                |  69 ++++++++------
 io_uring/rsrc.h                |  77 +++++++--------
 io_uring/rw.c                  |  40 ++++++--
 io_uring/splice.c              |   2 +-
 io_uring/uring_cmd.c           |  12 ++-
 20 files changed, 521 insertions(+), 106 deletions(-)

Comments

Jens Axboe Nov. 7, 2024, 10:25 p.m. UTC | #1
On Thu, 07 Nov 2024 19:01:33 +0800, Ming Lei wrote:
> Patch 1~3 cleans rsrc code.
> 
> Patch 4~9 prepares for supporting kernel buffer.
> 
> The 10th patch supports group buffer, so far only kernel buffer is
> supported, but it is pretty easy to extend for userspace group buffer.
> 
> [...]

Applied, thanks!

[01/12] io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers
        commit: 0d98c509086837a8cf5a32f82f2a58f39a539192
[02/12] io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node'
        commit: 4f219fcce5e4366cc121fc98270beb1fbbb3df2b
[03/12] io_uring/rsrc: add & apply io_req_assign_buf_node()
        commit: 039c878db7add23c1c9ea18424c442cce76670f9

Best regards,
Jens Axboe Nov. 7, 2024, 10:25 p.m. UTC | #2
On 11/7/24 3:25 PM, Jens Axboe wrote:
> 
> On Thu, 07 Nov 2024 19:01:33 +0800, Ming Lei wrote:
>> Patch 1~3 cleans rsrc code.
>>
>> Patch 4~9 prepares for supporting kernel buffer.
>>
>> The 10th patch supports group buffer, so far only kernel buffer is
>> supported, but it is pretty easy to extend for userspace group buffer.
>>
>> [...]
> 
> Applied, thanks!
> 
> [01/12] io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers
>         commit: 0d98c509086837a8cf5a32f82f2a58f39a539192
> [02/12] io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node'
>         commit: 4f219fcce5e4366cc121fc98270beb1fbbb3df2b
> [03/12] io_uring/rsrc: add & apply io_req_assign_buf_node()
>         commit: 039c878db7add23c1c9ea18424c442cce76670f9

Applied the first three as they stand alone quite nicely. I did ponder
on patch 1 to skip the make eg io_alloc_file_tables() not take both
the ctx and &ctx->file_table, but we may as well keep it symmetric.

I'll take a look at the rest of the series tomorrow.
Ming Lei Nov. 12, 2024, 12:53 a.m. UTC | #3
On Thu, Nov 07, 2024 at 03:25:59PM -0700, Jens Axboe wrote:
> On 11/7/24 3:25 PM, Jens Axboe wrote:
> > 
> > On Thu, 07 Nov 2024 19:01:33 +0800, Ming Lei wrote:
> >> Patch 1~3 cleans rsrc code.
> >>
> >> Patch 4~9 prepares for supporting kernel buffer.
> >>
> >> The 10th patch supports group buffer, so far only kernel buffer is
> >> supported, but it is pretty easy to extend for userspace group buffer.
> >>
> >> [...]
> > 
> > Applied, thanks!
> > 
> > [01/12] io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers
> >         commit: 0d98c509086837a8cf5a32f82f2a58f39a539192
> > [02/12] io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node'
> >         commit: 4f219fcce5e4366cc121fc98270beb1fbbb3df2b
> > [03/12] io_uring/rsrc: add & apply io_req_assign_buf_node()
> >         commit: 039c878db7add23c1c9ea18424c442cce76670f9
> 
> Applied the first three as they stand alone quite nicely. I did ponder
> on patch 1 to skip the make eg io_alloc_file_tables() not take both
> the ctx and &ctx->file_table, but we may as well keep it symmetric.
> 
> I'll take a look at the rest of the series tomorrow.

Hi Jens,

Any comment on the rest of the series?


thanks,
Ming
Pavel Begunkov Nov. 13, 2024, 1:43 p.m. UTC | #4
On 11/12/24 00:53, Ming Lei wrote:
> On Thu, Nov 07, 2024 at 03:25:59PM -0700, Jens Axboe wrote:
>> On 11/7/24 3:25 PM, Jens Axboe wrote:
...
> Hi Jens,
> 
> Any comment on the rest of the series?

Ming, it's dragging on because it's over complicated. I very much want
it to get to some conclusion, get it merged and move on, and I strongly
believe Jens shares the sentiment on getting the thing done.

Please, take the patches attached, adjust them to your needs and put
ublk on top. Or tell if there is a strong reason why it doesn't work.
The implementation is very simple and doesn't need almost anything
from io_uring, it's low risk and we can merge in no time.

If you can't cache the allocation in ublk, io_uring can add a cache.
If ublk needs more space and cannot embed the structure, we can add
a "private" pointer into io_mapped_ubuf. If it needs to check the IO
direction, we can add that as well (though I have doubts you really need
it, read-only might makes sense, write-only not so much). We'll also
merge Jens' patch allowing to remove a buffer with a request.
Jens Axboe Nov. 13, 2024, 2:56 p.m. UTC | #5
On 11/13/24 6:43 AM, Pavel Begunkov wrote:
> On 11/12/24 00:53, Ming Lei wrote:
>> On Thu, Nov 07, 2024 at 03:25:59PM -0700, Jens Axboe wrote:
>>> On 11/7/24 3:25 PM, Jens Axboe wrote:
> ...
>> Hi Jens,
>>
>> Any comment on the rest of the series?
> 
> Ming, it's dragging on because it's over complicated. I very much want
> it to get to some conclusion, get it merged and move on, and I strongly
> believe Jens shares the sentiment on getting the thing done.
> 
> Please, take the patches attached, adjust them to your needs and put
> ublk on top. Or tell if there is a strong reason why it doesn't work.
> The implementation is very simple and doesn't need almost anything
> from io_uring, it's low risk and we can merge in no time.

Indeed, nobody would love to get this moving forward more than me!
Pavel, can you setup a branch with the required patches? Should be your
two and then the buf update and mapping bits I did earlier. I can do it
too. Ming, would you mind if we try and setup a base that we can do this
on more trivially? I had merged the sqe grouping, but the more I think
about it, the less I like the added complexity, and the limitations we
had to put in there because relationships weren't fully understandable.

With the #1 goal of getting leased/borrowed buffers working asap, here's
what I suggest:

1) Pavel/I provide a base for doing the bare minimum, which is having an
   ephemeral buffer that zc can use.
2) You do the ublk bits on top

Yes this won't have grouping, so buf update will have to be done
separately. The downside here is that it'll add a (tiny) bit of overhead
as there's an extra sqe involved, but I don't really think that's an
issue, not even a minor one. The main objective here is NOT copying the
data, which will dwarf any other tiny extra overhead added.

This avoids introducing sqe grouping as a concept as a requirement for
zero copy ublk, which I did mention earlier is part of the complication
here. I'm a BIG fan of keeping things simple initially, particularly
when it adds major dependency complexity to the core code.

The goal here is doing the simple buffer leasing in such a way that it's
trivially understandable, and doesn't depend on grouping. With that, we
can easily get this done for the 6.14 kernel and finally ship it. I wish
6.13 was a week more away because then I think we could get it in for
6.13, but we only really have a few days at this point, so it's a bit
late. Unfortunately!

Ming, what do you think? Let's get this sorted so we can move on to
actually being able to use zc ublk, which is the goal we all share here.

> If you can't cache the allocation in ublk, io_uring can add a cache.
> If ublk needs more space and cannot embed the structure, we can add
> a "private" pointer into io_mapped_ubuf. If it needs to check the IO
> direction, we can add that as well (though I have doubts you really need
> it, read-only might makes sense, write-only not so much). We'll also
> merge Jens' patch allowing to remove a buffer with a request.

Right, we can always improve the little things as we go forward and make
it more efficient. I do think it's diminishing returns at that point,
but that doesn't mean we should not do it and make it better. But first,
let's get the concept working.

That's just violent agreement btw, I think we all share that objective
:-)