[V6,00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD

Message ID	20230330113630.1388860-1-ming.lei@redhat.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@vger.kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk>, io-uring@vger.kernel.org, linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Miklos Szeredi <mszeredi@redhat.com>, ZiyangZhang <ZiyangZhang@linux.alibaba.com>, Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>, Bernd Schubert <bschubert@ddn.com>, Pavel Begunkov <asml.silence@gmail.com>, Stefan Hajnoczi <stefanha@redhat.com>, Dan Williams <dan.j.williams@intel.com>, Ming Lei <ming.lei@redhat.com> Subject: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Date: Thu, 30 Mar 2023 19:36:13 +0800 Message-Id: <20230330113630.1388860-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	io_uring/ublk: add generic IORING_OP_FUSED_CMD \| expand [V6,00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD [V6,01/17] io_uring: increase io_kiocb->flags into 64bit [V6,02/17] io_uring: use ctx->cached_sq_head to calculate left sqes [V6,03/17] io_uring: add generic IORING_OP_FUSED_CMD [V6,04/17] io_uring: support providing buffer by IORING_OP_FUSED_CMD [V6,05/17] io_uring: support OP_READ/OP_WRITE for fused secondary request [V6,06/17] io_uring: support OP_SEND_ZC/OP_RECV for fused secondary request [V6,07/17] block: ublk_drv: add common exit handling [V6,08/17] block: ublk_drv: don't consider flush request in map/unmap io [V6,09/17] block: ublk_drv: add two helpers to clean up map/unmap request [V6,10/17] block: ublk_drv: clean up several helpers [V6,11/17] block: ublk_drv: cleanup 'struct ublk_map_data' [V6,12/17] block: ublk_drv: cleanup ublk_copy_user_pages [V6,13/17] block: ublk_drv: grab request reference when the request is handled by userspace [V6,14/17] block: ublk_drv: support to copy any part of request pages [V6,15/17] block: ublk_drv: add read()/write() support for ublk char device [V6,16/17] block: ublk_drv: don't check buffer in case of zero copy [V6,17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy

Ming Lei March 30, 2023, 11:36 a.m. UTC

Hello Jens and Guys,

Add generic fused command, which can include one primary command and multiple
secondary requests. This command provides one safe way to share resource between
primary command and secondary requests, and primary command is always
completed after all secondary requests are done, and resource lifetime
is bound with primary command.

With this way, it is easy to support zero copy for ublk/fuse device, and
there could be more potential use cases, such as offloading complicated logic
into userspace, or decouple kernel subsystems.

Follows ublksrv code, which implements zero copy for loop, nbd and
qcow2 targets with fused command:

https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6

All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:

	ublk add -t [loop|nbd|qcow2] -z .... 

Also add liburing test case for covering fused command based on miniublk
of blktest.

https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6

Performance improvement is obvious on memory bandwidth related workloads,
such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file.
ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided.

Please review and consider for v6.4.

V6:
	- re-design fused command, and make it more generic, moving sharing buffer
	as one plugin of fused command, so in future we can implement more plugins
	- document potential other use cases of fused command
	- drop support for builtin secondary sqe in SQE128, so all secondary
	  requests has standalone SQE
	- make fused command as one feature
	- cleanup & improve naming

V5:
	- rebase on for-6.4/io_uring
	- rename to primary/secondary as suggested by Jens
	- reserve interface for extending to support multiple secondary OPs in future,
	which isn't a must, because it can be done by submitting multiple fused
	commands with same primary request
	- rename to primary/secondary in ublksrv and liburing test code

V4:
	- improve APIs naming(patch 1 ~ 4)
	- improve documents and commit log(patch 2)
	- add buffer direction bit to opdef, suggested by Jens(patch 2)
	- add ublk zero copy document for cover: technical requirements(most related with
	buffer lifetime), and explains why splice isn't good and how fused command solves it(patch 17)
	- fix sparse warning(patch 7)
	- supports 64byte SQE fused command(patch 3)

V3:
	- fix build warning reported by kernel test robot
	- drop patch for checking fused flags on existed drivers with
	  ->uring_command(), which isn't necessary, since we do not do that
      when adding new ioctl or uring command
    - inline io_init_rq() for core code, so just export io_init_secondary_req
	- return result of failed secondary request unconditionally since REQ_F_CQE_SKIP
	will be cleared
	- pass xfstest over ublk-loop

V2:
	- don't resue io_mapped_ubuf (io_uring)
	- remove REQ_F_FUSED_MASTER_BIT (io_uring)
	- fix compile warning (io_uring)
	- rebase on v6.3-rc1 (io_uring)
	- grabbing io request reference when handling fused command 
	- simplify ublk_copy_user_pages() by iov iterator
	- add read()/write() for userspace to read/write ublk io buffer, so
	that some corner cases(read zero, passthrough request(report zones)) can
	be handled easily in case of zero copy; this way also helps to switch to
	zero copy completely
	- misc cleanup


Ming Lei (17):
  io_uring: increase io_kiocb->flags into 64bit
  io_uring: use ctx->cached_sq_head to calculate left sqes
  io_uring: add generic IORING_OP_FUSED_CMD
  io_uring: support providing buffer by IORING_OP_FUSED_CMD
  io_uring: support OP_READ/OP_WRITE for fused secondary request
  io_uring: support OP_SEND_ZC/OP_RECV for fused secondary request
  block: ublk_drv: add common exit handling
  block: ublk_drv: don't consider flush request in map/unmap io
  block: ublk_drv: add two helpers to clean up map/unmap request
  block: ublk_drv: clean up several helpers
  block: ublk_drv: cleanup 'struct ublk_map_data'
  block: ublk_drv: cleanup ublk_copy_user_pages
  block: ublk_drv: grab request reference when the request is handled by
    userspace
  block: ublk_drv: support to copy any part of request pages
  block: ublk_drv: add read()/write() support for ublk char device
  block: ublk_drv: don't check buffer in case of zero copy
  block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy

 Documentation/block/ublk.rst   | 126 ++++++-
 drivers/block/ublk_drv.c       | 603 ++++++++++++++++++++++++++-------
 include/linux/io_uring.h       |  41 ++-
 include/linux/io_uring_types.h |  76 +++--
 include/uapi/linux/io_uring.h  |  22 +-
 include/uapi/linux/ublk_cmd.h  |  37 +-
 io_uring/Makefile              |   2 +-
 io_uring/fused_cmd.c           | 239 +++++++++++++
 io_uring/fused_cmd.h           |  16 +
 io_uring/io_uring.c            |  57 +++-
 io_uring/io_uring.h            |   5 +
 io_uring/net.c                 |  30 +-
 io_uring/opdef.c               |  22 ++
 io_uring/opdef.h               |   7 +
 io_uring/rw.c                  |  21 ++
 15 files changed, 1124 insertions(+), 180 deletions(-)
 create mode 100644 io_uring/fused_cmd.c
 create mode 100644 io_uring/fused_cmd.h

Ming Lei April 3, 2023, 1:11 a.m. UTC | #1

On Thu, Mar 30, 2023 at 07:36:13PM +0800, Ming Lei wrote:
> Hello Jens and Guys,
> 
> Add generic fused command, which can include one primary command and multiple
> secondary requests. This command provides one safe way to share resource between
> primary command and secondary requests, and primary command is always
> completed after all secondary requests are done, and resource lifetime
> is bound with primary command.
> 
> With this way, it is easy to support zero copy for ublk/fuse device, and
> there could be more potential use cases, such as offloading complicated logic
> into userspace, or decouple kernel subsystems.
> 
> Follows ublksrv code, which implements zero copy for loop, nbd and
> qcow2 targets with fused command:
> 
> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
> 
> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:
> 
> 	ublk add -t [loop|nbd|qcow2] -z .... 
> 
> Also add liburing test case for covering fused command based on miniublk
> of blktest.
> 
> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
> 
> Performance improvement is obvious on memory bandwidth related workloads,
> such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file.
> ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided.
> 
> Please review and consider for v6.4.
> 
> V6:
> 	- re-design fused command, and make it more generic, moving sharing buffer
> 	as one plugin of fused command, so in future we can implement more plugins
> 	- document potential other use cases of fused command
> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> 	  requests has standalone SQE
> 	- make fused command as one feature
> 	- cleanup & improve naming

Hi Jens,

Can you apply ublk cleanup patches 7~11 on for-6.4? For others, we may
delay to 6.5, and I am looking at other approach too.


Thanks,
Ming

Jens Axboe April 3, 2023, 1:23 a.m. UTC | #2

On Thu, 30 Mar 2023 19:36:13 +0800, Ming Lei wrote:
> Add generic fused command, which can include one primary command and multiple
> secondary requests. This command provides one safe way to share resource between
> primary command and secondary requests, and primary command is always
> completed after all secondary requests are done, and resource lifetime
> is bound with primary command.
> 
> With this way, it is easy to support zero copy for ublk/fuse device, and
> there could be more potential use cases, such as offloading complicated logic
> into userspace, or decouple kernel subsystems.
> 
> [...]

Applied, thanks!

[07/17] block: ublk_drv: add common exit handling
        commit: 903f8aeea9fd1b97fba4ab805ddd639f57f117f8
[08/17] block: ublk_drv: don't consider flush request in map/unmap io
        commit: 23ef8220f287abe5bf741ddfc278e7359742d3b1
[09/17] block: ublk_drv: add two helpers to clean up map/unmap request
        commit: 2f3af723447c35c16f3c6a1b4b317c61dc41d6c3
[10/17] block: ublk_drv: clean up several helpers
        commit: 96cf2f5404c8bc979628a2b495852d735a56c5b5
[11/17] block: ublk_drv: cleanup 'struct ublk_map_data'
        commit: ae9f5ccea4c268a96763e51239b32d6b5172c18c

Best regards,

Jens Axboe April 3, 2023, 1:24 a.m. UTC | #3

On 4/2/23 7:11?PM, Ming Lei wrote:
> On Thu, Mar 30, 2023 at 07:36:13PM +0800, Ming Lei wrote:
>> Hello Jens and Guys,
>>
>> Add generic fused command, which can include one primary command and multiple
>> secondary requests. This command provides one safe way to share resource between
>> primary command and secondary requests, and primary command is always
>> completed after all secondary requests are done, and resource lifetime
>> is bound with primary command.
>>
>> With this way, it is easy to support zero copy for ublk/fuse device, and
>> there could be more potential use cases, such as offloading complicated logic
>> into userspace, or decouple kernel subsystems.
>>
>> Follows ublksrv code, which implements zero copy for loop, nbd and
>> qcow2 targets with fused command:
>>
>> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
>>
>> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:
>>
>> 	ublk add -t [loop|nbd|qcow2] -z .... 
>>
>> Also add liburing test case for covering fused command based on miniublk
>> of blktest.
>>
>> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
>>
>> Performance improvement is obvious on memory bandwidth related workloads,
>> such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file.
>> ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided.
>>
>> Please review and consider for v6.4.
>>
>> V6:
>> 	- re-design fused command, and make it more generic, moving sharing buffer
>> 	as one plugin of fused command, so in future we can implement more plugins
>> 	- document potential other use cases of fused command
>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
>> 	  requests has standalone SQE
>> 	- make fused command as one feature
>> 	- cleanup & improve naming
> 
> Hi Jens,
> 
> Can you apply ublk cleanup patches 7~11 on for-6.4? For others, we may
> delay to 6.5, and I am looking at other approach too.

Done - and yes, we're probably looking at 6.5 for the rest. But that's
fine, I'd rather end up with the right interface than try and rush one.

Ming Lei April 4, 2023, 7:48 a.m. UTC | #4

Hello Jens and Everyone,

On Sun, Apr 02, 2023 at 07:24:17PM -0600, Jens Axboe wrote:
> On 4/2/23 7:11?PM, Ming Lei wrote:
> > On Thu, Mar 30, 2023 at 07:36:13PM +0800, Ming Lei wrote:
> >> Hello Jens and Guys,
> >>
> >> Add generic fused command, which can include one primary command and multiple
> >> secondary requests. This command provides one safe way to share resource between
> >> primary command and secondary requests, and primary command is always
> >> completed after all secondary requests are done, and resource lifetime
> >> is bound with primary command.
> >>
> >> With this way, it is easy to support zero copy for ublk/fuse device, and
> >> there could be more potential use cases, such as offloading complicated logic
> >> into userspace, or decouple kernel subsystems.
> >>
> >> Follows ublksrv code, which implements zero copy for loop, nbd and
> >> qcow2 targets with fused command:
> >>
> >> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
> >>
> >> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:
> >>
> >> 	ublk add -t [loop|nbd|qcow2] -z .... 
> >>
> >> Also add liburing test case for covering fused command based on miniublk
> >> of blktest.
> >>
> >> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
> >>
> >> Performance improvement is obvious on memory bandwidth related workloads,
> >> such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file.
> >> ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided.
> >>
> >> Please review and consider for v6.4.
> >>
> >> V6:
> >> 	- re-design fused command, and make it more generic, moving sharing buffer
> >> 	as one plugin of fused command, so in future we can implement more plugins
> >> 	- document potential other use cases of fused command
> >> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> >> 	  requests has standalone SQE
> >> 	- make fused command as one feature
> >> 	- cleanup & improve naming
> > 
> > Hi Jens,
> > 
> > Can you apply ublk cleanup patches 7~11 on for-6.4? For others, we may
> > delay to 6.5, and I am looking at other approach too.
> 
> Done - and yes, we're probably looking at 6.5 for the rest. But that's

Thanks!

> fine, I'd rather end up with the right interface than try and rush one.

Also I'd provide one summery about this work here so that it may help
for anyone interested in this work, follows three approaches we have
tried or proposed:

1) splice can't do this job[1][2]

2) fused command in this patchset
- it is more like sendfile() or copy_file_range(), because the internal
  buffer isn't exposed outside

- v6 becomes a bit more generic, the theory is that one SQE list is submitted
as a whole request logically; the 1st sqe is the primary command, which
provides buffer for others, and is responsible for submitting other SQEs
(secondary)in this list; the primary command isn't completed until all secondary
requests are done

- this approach solves two problems efficiently in one simple way:

	a) buffer lifetime issue, and buffer lifetime is same with primary command, so
	all secondary OPs can be submitted & completely safely

	b) request dependency issue, all secondary requests depend on primary command,
	and secondary request itself could be independent, we start to allow to submit
	secondary request in non-async style, and all secondary requests can be issued
	concurrently

- this approach is simple, because we don't expose buffer outside, and
  buffer is just shared among these secondary requests; meantime
  internal buffer saves us complicated OPs' dependency issue, avoid
  contention by registering buffer anywhere between submission and
  completion code path

- the drawback is that we add one new SQE usage/model of primary SQE and
  secondary SQEs, and the whole logical request in concept, which is
  like sendfile() or copy_file_range()

3) register transient buffers for OPs[3]
- it is more like splice(), which is flexible and could be more generic, but
internal pipe buffer is added to pipe which is visible outside, so the
implementation becomes complicated; and it should be more than splice(),
because the io buffer needs to be shared among multiple OPs

- inefficiently & complicated

	a) buffer has to be added to one global container(suppose it is
	io_uring context pipe) by ADD_BUF OP, and either buffer needs to be removed after
	consumer OPs are completed, or DEL_OP is run for removing buffer explicitly, so
	either contention on the io_uring pipe is added, or another new dependency is
	added(DEL_OP depends on all normal OPs)

	b) ADD_BUF OP is needed, and normal OPs have to depend on this new
	OP by IOSQE_IO_LINK, then all normal OPs will be submitted in async way,
	even worse, each normal OP has to be issued one by one, because io_uring
	isn't capable of handling 1:N dependency issue[5]

    c) if DEL_BUF OP is needed, then it is basically not possible
	to solve 1:N dependency any more, given DEL_BUF starts to depends on the previous
	N OPs; otherwise, contention on pipe is inevitable.

	d) solving 1:N dependency issue generically

- advantage

Follows current io_uring SQE usage, and looks more generic/flexible,
like splice().

4) others approaches or suggestions?

Any idea is welcome as usual.


Finally from problem viewpoint, if the problem domain is just ublk/fuse zero copy
or other similar problems[6], fused command might be the simpler & more efficient
approach, compared with approach 3). However, are there any other problems we
want to cover by one more generic/flexible interface? If not, would we
like to pay the complexity & inefficiency for one kind of less generic
problem?


[1] https://lore.kernel.org/linux-block/ZCQnHwrXvSOQHfAC@ovpn-8-26.pek2.redhat.com/T/#m1bfa358524b6af94731bcd5be28056f9f4408ecf
[2] https://github.com/ming1/linux/blob/my_v6.3-io_uring_fuse_cmd_v6/Documentation/block/ublk.rst#zero-copy
[3] https://lore.kernel.org/linux-block/ZCQnHwrXvSOQHfAC@ovpn-8-26.pek2.redhat.com/T/#mbe428dfeb0417487cd1db7e6dabca7399a3c265b
[4] https://lore.kernel.org/linux-block/ZCQnHwrXvSOQHfAC@ovpn-8-26.pek2.redhat.com/T/#md035ffa4c6b69e85de2ab145418a9849a3b33741
[5] https://lore.kernel.org/linux-block/20230330113630.1388860-5-ming.lei@redhat.com/T/#m5e0c282ad26d9f3d8e519645168aeb3a19b5740b
[6] https://lore.kernel.org/linux-block/20230330113630.1388860-5-ming.lei@redhat.com/T/#me5cca4db606541fae452d625780635fcedcd5c6c

Thanks,
Ming

Bernd Schubert April 18, 2023, 7:38 p.m. UTC | #5

On 3/30/23 13:36, Ming Lei wrote:
[...]
> V6:
> 	- re-design fused command, and make it more generic, moving sharing buffer
> 	as one plugin of fused command, so in future we can implement more plugins
> 	- document potential other use cases of fused command
> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> 	  requests has standalone SQE
> 	- make fused command as one feature
> 	- cleanup & improve naming

Hi Ming, et al.,

I started to wonder if fused SQE could be extended to combine multiple 
syscalls, for example open/read/close.  Which would be another solution 
for the readfile syscall Miklos had proposed some time ago.

https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/

If fused SQEs could be extended, I think it would be quite helpful for 
many other patterns. Another similar examples would open/write/close, 
but ideal would be also to allow to have it more complex like 
"open/write/sync_file_range/close" - open/write/close might be the 
fastest and could possibly return before sync_file_range. Use case for 
the latter would be a file server that wants to give notifications to 
client when pages have been written out.

Thanks,
Bernd

Ming Lei April 19, 2023, 1:51 a.m. UTC | #6

On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
> On 3/30/23 13:36, Ming Lei wrote:
> [...]
> > V6:
> > 	- re-design fused command, and make it more generic, moving sharing buffer
> > 	as one plugin of fused command, so in future we can implement more plugins
> > 	- document potential other use cases of fused command
> > 	- drop support for builtin secondary sqe in SQE128, so all secondary
> > 	  requests has standalone SQE
> > 	- make fused command as one feature
> > 	- cleanup & improve naming
> 
> Hi Ming, et al.,
> 
> I started to wonder if fused SQE could be extended to combine multiple 
> syscalls, for example open/read/close.  Which would be another solution 
> for the readfile syscall Miklos had proposed some time ago.
> 
> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
> 
> If fused SQEs could be extended, I think it would be quite helpful for 
> many other patterns. Another similar examples would open/write/close, 
> but ideal would be also to allow to have it more complex like 
> "open/write/sync_file_range/close" - open/write/close might be the 
> fastest and could possibly return before sync_file_range. Use case for 
> the latter would be a file server that wants to give notifications to 
> client when pages have been written out.

The above pattern needn't fused command, and it can be done by plain
SQEs chain, follows the usage:

1) suppose you get one command from /dev/fuse, then FUSE daemon
needs to handle the command as open/write/sync/close
2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
5) get sqe4, prepare it for close syscall
6) io_uring_enter();	//for submit and get events

Then all the four OPs are done one by one by io_uring internal
machinery, and you can choose to get successful CQE for each OP.

Is the above what you want to do?

The fused command proposal is actually for zero copy(but not limited to zc).

If the above write OP need to write to file with in-kernel buffer
of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
before 1), and set sqe2->addr to offet of the buffer in 3).

However, fused command is usually used in the following way, such as FUSE daemon
gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
as io_uring fused command:

1) get sqe0 and prepare it for primary command, in which you need to
provide info for retrieving kernel buffer/pages of this READ request

2) suppose this READ request needs to be handled by translating it to
READs to two files/devices, considering it as one mirror:

- get sqe1, prepare it for read from file1, and set sqe->addr to offset
  of the buffer in 1), set sqe->len as length for read; this READ OP
  uses the kernel buffer in 1) directly 

- get sqe2, prepare it for read from file2, and set sqe->addr to offset
  of buffer in 1), set sqe->len as length for read;  this READ OP
  uses the kernel buffer in 1) directly 

3) submit the three sqe by io_uring_enter()

sqe1 and sqe2 can be submitted concurrently or be issued one by one
in order, fused command supports both, and depends on user requirement.
But io_uring linked OPs is usually slower.

Also file1/file2 needs to be opened beforehand in this example, and FD is
passed to sqe1/sqe2, another choice is to use fixed File; Also you can
add the open/close() OPs into above steps, which need these open/close/READ
to be linked in order, usually slower tnan non-linked OPs.


Thanks, 
Ming

Bernd Schubert April 19, 2023, 9:56 a.m. UTC | #7

On 4/19/23 03:51, Ming Lei wrote:
> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
>> On 3/30/23 13:36, Ming Lei wrote:
>> [...]
>>> V6:
>>> 	- re-design fused command, and make it more generic, moving sharing buffer
>>> 	as one plugin of fused command, so in future we can implement more plugins
>>> 	- document potential other use cases of fused command
>>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
>>> 	  requests has standalone SQE
>>> 	- make fused command as one feature
>>> 	- cleanup & improve naming
>>
>> Hi Ming, et al.,
>>
>> I started to wonder if fused SQE could be extended to combine multiple
>> syscalls, for example open/read/close.  Which would be another solution
>> for the readfile syscall Miklos had proposed some time ago.
>>
>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
>>
>> If fused SQEs could be extended, I think it would be quite helpful for
>> many other patterns. Another similar examples would open/write/close,
>> but ideal would be also to allow to have it more complex like
>> "open/write/sync_file_range/close" - open/write/close might be the
>> fastest and could possibly return before sync_file_range. Use case for
>> the latter would be a file server that wants to give notifications to
>> client when pages have been written out.
> 
> The above pattern needn't fused command, and it can be done by plain
> SQEs chain, follows the usage:
> 
> 1) suppose you get one command from /dev/fuse, then FUSE daemon
> needs to handle the command as open/write/sync/close
> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
> 5) get sqe4, prepare it for close syscall
> 6) io_uring_enter();	//for submit and get events

Oh, I was not aware that IOSQE_IO_LINK could pass the result of open 
down to the others. Hmm, the example I find for open is 
io_uring_prep_openat_direct in test_open_fixed(). It probably gets off 
topic here, but one needs to have ring prepared with 
io_uring_register_files_sparse, then manually manages available indexes 
and can then link commands? Interesting!

> 
> Then all the four OPs are done one by one by io_uring internal
> machinery, and you can choose to get successful CQE for each OP.
> 
> Is the above what you want to do?
> 
> The fused command proposal is actually for zero copy(but not limited to zc).

Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to 
support generic passing, as it kind of hands data (buffers) from one sqe 
to the other. I.e. instead of buffers it would have passed the fd, but 
if this is already possible - no need to make IORING_OP_FUSED_CMD more 
complex.man

> 
> If the above write OP need to write to file with in-kernel buffer
> of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
> before 1), and set sqe2->addr to offet of the buffer in 3).
> 
> However, fused command is usually used in the following way, such as FUSE daemon
> gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
> as io_uring fused command:
> 
> 1) get sqe0 and prepare it for primary command, in which you need to
> provide info for retrieving kernel buffer/pages of this READ request
> 
> 2) suppose this READ request needs to be handled by translating it to
> READs to two files/devices, considering it as one mirror:
> 
> - get sqe1, prepare it for read from file1, and set sqe->addr to offset
>    of the buffer in 1), set sqe->len as length for read; this READ OP
>    uses the kernel buffer in 1) directly
> 
> - get sqe2, prepare it for read from file2, and set sqe->addr to offset
>    of buffer in 1), set sqe->len as length for read;  this READ OP
>    uses the kernel buffer in 1) directly
> 
> 3) submit the three sqe by io_uring_enter()
> 
> sqe1 and sqe2 can be submitted concurrently or be issued one by one
> in order, fused command supports both, and depends on user requirement.
> But io_uring linked OPs is usually slower.
> 
> Also file1/file2 needs to be opened beforehand in this example, and FD is
> passed to sqe1/sqe2, another choice is to use fixed File; Also you can
> add the open/close() OPs into above steps, which need these open/close/READ
> to be linked in order, usually slower tnan non-linked OPs.


Yes thanks, I'm going to prepare this in an branch, otherwise current 
fuse-uring would have a ZC regression (although my target ddn projects 
cannot make use of it, as we need access to the buffer for checksums, etc).


Thanks,
Bernd

Ming Lei April 19, 2023, 11:19 a.m. UTC | #8

On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote:
> On 4/19/23 03:51, Ming Lei wrote:
> > On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
> >> On 3/30/23 13:36, Ming Lei wrote:
> >> [...]
> >>> V6:
> >>> 	- re-design fused command, and make it more generic, moving sharing buffer
> >>> 	as one plugin of fused command, so in future we can implement more plugins
> >>> 	- document potential other use cases of fused command
> >>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> >>> 	  requests has standalone SQE
> >>> 	- make fused command as one feature
> >>> 	- cleanup & improve naming
> >>
> >> Hi Ming, et al.,
> >>
> >> I started to wonder if fused SQE could be extended to combine multiple
> >> syscalls, for example open/read/close.  Which would be another solution
> >> for the readfile syscall Miklos had proposed some time ago.
> >>
> >> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
> >>
> >> If fused SQEs could be extended, I think it would be quite helpful for
> >> many other patterns. Another similar examples would open/write/close,
> >> but ideal would be also to allow to have it more complex like
> >> "open/write/sync_file_range/close" - open/write/close might be the
> >> fastest and could possibly return before sync_file_range. Use case for
> >> the latter would be a file server that wants to give notifications to
> >> client when pages have been written out.
> > 
> > The above pattern needn't fused command, and it can be done by plain
> > SQEs chain, follows the usage:
> > 
> > 1) suppose you get one command from /dev/fuse, then FUSE daemon
> > needs to handle the command as open/write/sync/close
> > 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
> > 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
> > 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
> > 5) get sqe4, prepare it for close syscall
> > 6) io_uring_enter();	//for submit and get events
> 
> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open 
> down to the others. Hmm, the example I find for open is 
> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off 
> topic here, but one needs to have ring prepared with 
> io_uring_register_files_sparse, then manually manages available indexes 
> and can then link commands? Interesting!

Yeah,  see test/fixed-reuse.c of liburing

> 
> > 
> > Then all the four OPs are done one by one by io_uring internal
> > machinery, and you can choose to get successful CQE for each OP.
> > 
> > Is the above what you want to do?
> > 
> > The fused command proposal is actually for zero copy(but not limited to zc).
> 
> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to 
> support generic passing, as it kind of hands data (buffers) from one sqe 
> to the other. I.e. instead of buffers it would have passed the fd, but 
> if this is already possible - no need to make IORING_OP_FUSED_CMD more 
> complex.man

The way of passing FD introduces other cost, read op running into async,
and adding it into global table, which introduces runtime cost.

That is the reason why fused command is designed in the following way:

- link can be avoided, so OPs needn't to be run in async
- no need to add buffer into global table

Cause it is really in fast io path.

> 
> > 
> > If the above write OP need to write to file with in-kernel buffer
> > of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
> > before 1), and set sqe2->addr to offet of the buffer in 3).
> > 
> > However, fused command is usually used in the following way, such as FUSE daemon
> > gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
> > as io_uring fused command:
> > 
> > 1) get sqe0 and prepare it for primary command, in which you need to
> > provide info for retrieving kernel buffer/pages of this READ request
> > 
> > 2) suppose this READ request needs to be handled by translating it to
> > READs to two files/devices, considering it as one mirror:
> > 
> > - get sqe1, prepare it for read from file1, and set sqe->addr to offset
> >    of the buffer in 1), set sqe->len as length for read; this READ OP
> >    uses the kernel buffer in 1) directly
> > 
> > - get sqe2, prepare it for read from file2, and set sqe->addr to offset
> >    of buffer in 1), set sqe->len as length for read;  this READ OP
> >    uses the kernel buffer in 1) directly
> > 
> > 3) submit the three sqe by io_uring_enter()
> > 
> > sqe1 and sqe2 can be submitted concurrently or be issued one by one
> > in order, fused command supports both, and depends on user requirement.
> > But io_uring linked OPs is usually slower.
> > 
> > Also file1/file2 needs to be opened beforehand in this example, and FD is
> > passed to sqe1/sqe2, another choice is to use fixed File; Also you can
> > add the open/close() OPs into above steps, which need these open/close/READ
> > to be linked in order, usually slower tnan non-linked OPs.
> 
> 
> Yes thanks, I'm going to prepare this in an branch, otherwise current 
> fuse-uring would have a ZC regression (although my target ddn projects 
> cannot make use of it, as we need access to the buffer for checksums, etc).

storage has similar use case too, such as encrypt, nvme tcp data digest,
..., if the checksum/encrypt approach is standard, maybe one new OP or
syscall can be added for doing that on kernel buffer directly.


Thanks
Ming

Bernd Schubert April 19, 2023, 3:42 p.m. UTC | #9

On 4/19/23 13:19, Ming Lei wrote:
> On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote:
>> On 4/19/23 03:51, Ming Lei wrote:
>>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
>>>> On 3/30/23 13:36, Ming Lei wrote:
>>>> [...]
>>>>> V6:
>>>>> 	- re-design fused command, and make it more generic, moving sharing buffer
>>>>> 	as one plugin of fused command, so in future we can implement more plugins
>>>>> 	- document potential other use cases of fused command
>>>>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
>>>>> 	  requests has standalone SQE
>>>>> 	- make fused command as one feature
>>>>> 	- cleanup & improve naming
>>>>
>>>> Hi Ming, et al.,
>>>>
>>>> I started to wonder if fused SQE could be extended to combine multiple
>>>> syscalls, for example open/read/close.  Which would be another solution
>>>> for the readfile syscall Miklos had proposed some time ago.
>>>>
>>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
>>>>
>>>> If fused SQEs could be extended, I think it would be quite helpful for
>>>> many other patterns. Another similar examples would open/write/close,
>>>> but ideal would be also to allow to have it more complex like
>>>> "open/write/sync_file_range/close" - open/write/close might be the
>>>> fastest and could possibly return before sync_file_range. Use case for
>>>> the latter would be a file server that wants to give notifications to
>>>> client when pages have been written out.
>>>
>>> The above pattern needn't fused command, and it can be done by plain
>>> SQEs chain, follows the usage:
>>>
>>> 1) suppose you get one command from /dev/fuse, then FUSE daemon
>>> needs to handle the command as open/write/sync/close
>>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
>>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
>>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
>>> 5) get sqe4, prepare it for close syscall
>>> 6) io_uring_enter();	//for submit and get events
>>
>> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open
>> down to the others. Hmm, the example I find for open is
>> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off
>> topic here, but one needs to have ring prepared with
>> io_uring_register_files_sparse, then manually manages available indexes
>> and can then link commands? Interesting!
> 
> Yeah,  see test/fixed-reuse.c of liburing
> 
>>
>>>
>>> Then all the four OPs are done one by one by io_uring internal
>>> machinery, and you can choose to get successful CQE for each OP.
>>>
>>> Is the above what you want to do?
>>>
>>> The fused command proposal is actually for zero copy(but not limited to zc).
>>
>> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to
>> support generic passing, as it kind of hands data (buffers) from one sqe
>> to the other. I.e. instead of buffers it would have passed the fd, but
>> if this is already possible - no need to make IORING_OP_FUSED_CMD more
>> complex.man
> 
> The way of passing FD introduces other cost, read op running into async,
> and adding it into global table, which introduces runtime cost.

Hmm, question from my side is why it needs to be in the global table, 
when it could be just passed to the linked or fused sqe?

> 
> That is the reason why fused command is designed in the following way:
> 
> - link can be avoided, so OPs needn't to be run in async
> - no need to add buffer into global table
> 
> Cause it is really in fast io path.
> 
>>
>>>
>>> If the above write OP need to write to file with in-kernel buffer
>>> of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
>>> before 1), and set sqe2->addr to offet of the buffer in 3).
>>>
>>> However, fused command is usually used in the following way, such as FUSE daemon
>>> gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
>>> as io_uring fused command:
>>>
>>> 1) get sqe0 and prepare it for primary command, in which you need to
>>> provide info for retrieving kernel buffer/pages of this READ request
>>>
>>> 2) suppose this READ request needs to be handled by translating it to
>>> READs to two files/devices, considering it as one mirror:
>>>
>>> - get sqe1, prepare it for read from file1, and set sqe->addr to offset
>>>     of the buffer in 1), set sqe->len as length for read; this READ OP
>>>     uses the kernel buffer in 1) directly
>>>
>>> - get sqe2, prepare it for read from file2, and set sqe->addr to offset
>>>     of buffer in 1), set sqe->len as length for read;  this READ OP
>>>     uses the kernel buffer in 1) directly
>>>
>>> 3) submit the three sqe by io_uring_enter()
>>>
>>> sqe1 and sqe2 can be submitted concurrently or be issued one by one
>>> in order, fused command supports both, and depends on user requirement.
>>> But io_uring linked OPs is usually slower.
>>>
>>> Also file1/file2 needs to be opened beforehand in this example, and FD is
>>> passed to sqe1/sqe2, another choice is to use fixed File; Also you can
>>> add the open/close() OPs into above steps, which need these open/close/READ
>>> to be linked in order, usually slower tnan non-linked OPs.
>>
>>
>> Yes thanks, I'm going to prepare this in an branch, otherwise current
>> fuse-uring would have a ZC regression (although my target ddn projects
>> cannot make use of it, as we need access to the buffer for checksums, etc).
> 
> storage has similar use case too, such as encrypt, nvme tcp data digest,
> ..., if the checksum/encrypt approach is standard, maybe one new OP or
> syscall can be added for doing that on kernel buffer directly.

I very much see the use case for FUSED_CMD for overlay or simple network 
sockets. Now in the HPC world one typically uses IB  RDMA and if that 
fails for some reasons (like connection down), tcp or other interfaces 
as fallback. And there is sending the right part of the buffer to the 
right server and erasure coding involved - it gets complex and I don't 
think there is a way for us without a buffer copy.

Thanks,
Bernd

Pavel Begunkov April 20, 2023, 1:18 a.m. UTC | #10

On 4/19/23 16:42, Bernd Schubert wrote:
> On 4/19/23 13:19, Ming Lei wrote:
>> On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote:
>>> On 4/19/23 03:51, Ming Lei wrote:
>>>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
>>>>> On 3/30/23 13:36, Ming Lei wrote:
>>>>> [...]
>>>>>> V6:
>>>>>> 	- re-design fused command, and make it more generic, moving sharing buffer
>>>>>> 	as one plugin of fused command, so in future we can implement more plugins
>>>>>> 	- document potential other use cases of fused command
>>>>>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
>>>>>> 	  requests has standalone SQE
>>>>>> 	- make fused command as one feature
>>>>>> 	- cleanup & improve naming
>>>>>
>>>>> Hi Ming, et al.,
>>>>>
>>>>> I started to wonder if fused SQE could be extended to combine multiple
>>>>> syscalls, for example open/read/close.  Which would be another solution
>>>>> for the readfile syscall Miklos had proposed some time ago.
>>>>>
>>>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
>>>>>
>>>>> If fused SQEs could be extended, I think it would be quite helpful for
>>>>> many other patterns. Another similar examples would open/write/close,
>>>>> but ideal would be also to allow to have it more complex like
>>>>> "open/write/sync_file_range/close" - open/write/close might be the
>>>>> fastest and could possibly return before sync_file_range. Use case for
>>>>> the latter would be a file server that wants to give notifications to
>>>>> client when pages have been written out.
>>>>
>>>> The above pattern needn't fused command, and it can be done by plain
>>>> SQEs chain, follows the usage:
>>>>
>>>> 1) suppose you get one command from /dev/fuse, then FUSE daemon
>>>> needs to handle the command as open/write/sync/close
>>>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
>>>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
>>>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
>>>> 5) get sqe4, prepare it for close syscall
>>>> 6) io_uring_enter();	//for submit and get events
>>>
>>> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open
>>> down to the others. Hmm, the example I find for open is
>>> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off
>>> topic here, but one needs to have ring prepared with
>>> io_uring_register_files_sparse, then manually manages available indexes
>>> and can then link commands? Interesting!
>>
>> Yeah,  see test/fixed-reuse.c of liburing
>>
>>>
>>>>
>>>> Then all the four OPs are done one by one by io_uring internal
>>>> machinery, and you can choose to get successful CQE for each OP.
>>>>
>>>> Is the above what you want to do?
>>>>
>>>> The fused command proposal is actually for zero copy(but not limited to zc).
>>>
>>> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to
>>> support generic passing, as it kind of hands data (buffers) from one sqe
>>> to the other. I.e. instead of buffers it would have passed the fd, but
>>> if this is already possible - no need to make IORING_OP_FUSED_CMD more
>>> complex.man
>>
>> The way of passing FD introduces other cost, read op running into async,
>> and adding it into global table, which introduces runtime cost.
> 
> Hmm, question from my side is why it needs to be in the global table,
> when it could be just passed to the linked or fused sqe?

Because for every such type of state you need to write custom code,
it's not scalable, not to say that it usually can't be kept to a
specific operation and leaks into generic paths / other requests.

Some may want to pass a file or a buffer, there might be a need
to pass a result in some specific way (e.g. nr = recv(); send(nr)),
and the list continues...

I tried adding BPF in the middle ~2y ago, but it was no
different in perf than returning to the userspace, and gets
worse with higher submission batching. Maybe I need to test
it again.

>> That is the reason why fused command is designed in the following way:
>>
>> - link can be avoided, so OPs needn't to be run in async
>> - no need to add buffer into global table
>>
>> Cause it is really in fast io path.
>>

Ming Lei April 20, 2023, 1:38 a.m. UTC | #11

On Wed, Apr 19, 2023 at 03:42:40PM +0000, Bernd Schubert wrote:
> On 4/19/23 13:19, Ming Lei wrote:
> > On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote:
> >> On 4/19/23 03:51, Ming Lei wrote:
> >>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
> >>>> On 3/30/23 13:36, Ming Lei wrote:
> >>>> [...]
> >>>>> V6:
> >>>>> 	- re-design fused command, and make it more generic, moving sharing buffer
> >>>>> 	as one plugin of fused command, so in future we can implement more plugins
> >>>>> 	- document potential other use cases of fused command
> >>>>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> >>>>> 	  requests has standalone SQE
> >>>>> 	- make fused command as one feature
> >>>>> 	- cleanup & improve naming
> >>>>
> >>>> Hi Ming, et al.,
> >>>>
> >>>> I started to wonder if fused SQE could be extended to combine multiple
> >>>> syscalls, for example open/read/close.  Which would be another solution
> >>>> for the readfile syscall Miklos had proposed some time ago.
> >>>>
> >>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
> >>>>
> >>>> If fused SQEs could be extended, I think it would be quite helpful for
> >>>> many other patterns. Another similar examples would open/write/close,
> >>>> but ideal would be also to allow to have it more complex like
> >>>> "open/write/sync_file_range/close" - open/write/close might be the
> >>>> fastest and could possibly return before sync_file_range. Use case for
> >>>> the latter would be a file server that wants to give notifications to
> >>>> client when pages have been written out.
> >>>
> >>> The above pattern needn't fused command, and it can be done by plain
> >>> SQEs chain, follows the usage:
> >>>
> >>> 1) suppose you get one command from /dev/fuse, then FUSE daemon
> >>> needs to handle the command as open/write/sync/close
> >>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
> >>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
> >>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
> >>> 5) get sqe4, prepare it for close syscall
> >>> 6) io_uring_enter();	//for submit and get events
> >>
> >> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open
> >> down to the others. Hmm, the example I find for open is
> >> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off
> >> topic here, but one needs to have ring prepared with
> >> io_uring_register_files_sparse, then manually manages available indexes
> >> and can then link commands? Interesting!
> > 
> > Yeah,  see test/fixed-reuse.c of liburing
> > 
> >>
> >>>
> >>> Then all the four OPs are done one by one by io_uring internal
> >>> machinery, and you can choose to get successful CQE for each OP.
> >>>
> >>> Is the above what you want to do?
> >>>
> >>> The fused command proposal is actually for zero copy(but not limited to zc).
> >>
> >> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to
> >> support generic passing, as it kind of hands data (buffers) from one sqe
> >> to the other. I.e. instead of buffers it would have passed the fd, but
> >> if this is already possible - no need to make IORING_OP_FUSED_CMD more
> >> complex.man
> > 
> > The way of passing FD introduces other cost, read op running into async,
> > and adding it into global table, which introduces runtime cost.
> 
> Hmm, question from my side is why it needs to be in the global table, 
> when it could be just passed to the linked or fused sqe?

Any data which crosses OPs need be registered to somewhere, such as
fixed buffer, fixed FD, here global meant context wide, and it is actually from
OP/SQE viewpoint.

Fused command actually is one whole command logically, even though it
may includes multiple SQEs. Then registration as context wide isn't
needn't(since it is known buffer sharing isn't context wide, and just
among several IOs), meantime dependency is avoided, so link isn't needed.

This way helps performance a lot, such as, in test on ublk/loop over tmpfs,
iops drops to 1/2 with registration in 4k rand io, but fused command actually
improves iops a bit, baseline is current in-tree ublk driver/ublksrv.

> 
> > 
> > That is the reason why fused command is designed in the following way:
> > 
> > - link can be avoided, so OPs needn't to be run in async
> > - no need to add buffer into global table
> > 
> > Cause it is really in fast io path.
> > 
> >>
> >>>
> >>> If the above write OP need to write to file with in-kernel buffer
> >>> of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
> >>> before 1), and set sqe2->addr to offet of the buffer in 3).
> >>>
> >>> However, fused command is usually used in the following way, such as FUSE daemon
> >>> gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
> >>> as io_uring fused command:
> >>>
> >>> 1) get sqe0 and prepare it for primary command, in which you need to
> >>> provide info for retrieving kernel buffer/pages of this READ request
> >>>
> >>> 2) suppose this READ request needs to be handled by translating it to
> >>> READs to two files/devices, considering it as one mirror:
> >>>
> >>> - get sqe1, prepare it for read from file1, and set sqe->addr to offset
> >>>     of the buffer in 1), set sqe->len as length for read; this READ OP
> >>>     uses the kernel buffer in 1) directly
> >>>
> >>> - get sqe2, prepare it for read from file2, and set sqe->addr to offset
> >>>     of buffer in 1), set sqe->len as length for read;  this READ OP
> >>>     uses the kernel buffer in 1) directly
> >>>
> >>> 3) submit the three sqe by io_uring_enter()
> >>>
> >>> sqe1 and sqe2 can be submitted concurrently or be issued one by one
> >>> in order, fused command supports both, and depends on user requirement.
> >>> But io_uring linked OPs is usually slower.
> >>>
> >>> Also file1/file2 needs to be opened beforehand in this example, and FD is
> >>> passed to sqe1/sqe2, another choice is to use fixed File; Also you can
> >>> add the open/close() OPs into above steps, which need these open/close/READ
> >>> to be linked in order, usually slower tnan non-linked OPs.
> >>
> >>
> >> Yes thanks, I'm going to prepare this in an branch, otherwise current
> >> fuse-uring would have a ZC regression (although my target ddn projects
> >> cannot make use of it, as we need access to the buffer for checksums, etc).
> > 
> > storage has similar use case too, such as encrypt, nvme tcp data digest,
> > ..., if the checksum/encrypt approach is standard, maybe one new OP or
> > syscall can be added for doing that on kernel buffer directly.
> 
> I very much see the use case for FUSED_CMD for overlay or simple network 
> sockets. Now in the HPC world one typically uses IB  RDMA and if that 
> fails for some reasons (like connection down), tcp or other interfaces 
> as fallback. And there is sending the right part of the buffer to the 
> right server and erasure coding involved - it gets complex and I don't 
> think there is a way for us without a buffer copy.

As I mentioned, it(checksum, encrypt, ...) becomes one generic issue if
the zero copy approach is accepted, meantime the problem itself is well-defined,
so I don't worry no solution can be figured out.

Meantime big memory copy does consume both cpu and memory bandwidth a
lot, and 64k/512k ublk io has shown this big difference wrt. copy vs.
zero copy.

Thanks,
Ming

Bernd Schubert April 21, 2023, 10:38 p.m. UTC | #12

On 4/20/23 03:38, Ming Lei wrote:
> On Wed, Apr 19, 2023 at 03:42:40PM +0000, Bernd Schubert wrote:
>> I very much see the use case for FUSED_CMD for overlay or simple network
>> sockets. Now in the HPC world one typically uses IB  RDMA and if that
>> fails for some reasons (like connection down), tcp or other interfaces
>> as fallback. And there is sending the right part of the buffer to the
>> right server and erasure coding involved - it gets complex and I don't
>> think there is a way for us without a buffer copy.
> 
> As I mentioned, it(checksum, encrypt, ...) becomes one generic issue if
> the zero copy approach is accepted, meantime the problem itself is well-defined,
> so I don't worry no solution can be figured out.
> 
> Meantime big memory copy does consume both cpu and memory bandwidth a
> lot, and 64k/512k ublk io has shown this big difference wrt. copy vs.
> zero copy.

I don't have any doubt about that, but I believe there is no current way 
to support it in all use cases. As example, let's consider we would like 
to extend nbd with verbs/rdma instead of plain tcp  - verbs/rdma needs 
registered memory and does not take a simple socket fd to send buffers to.


Thanks,
Bernd

[V6,00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD

Message

Comments