Message ID | 20230330113630.1388860-1-ming.lei@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | io_uring/ublk: add generic IORING_OP_FUSED_CMD | expand |
On Thu, Mar 30, 2023 at 07:36:13PM +0800, Ming Lei wrote: > Hello Jens and Guys, > > Add generic fused command, which can include one primary command and multiple > secondary requests. This command provides one safe way to share resource between > primary command and secondary requests, and primary command is always > completed after all secondary requests are done, and resource lifetime > is bound with primary command. > > With this way, it is easy to support zero copy for ublk/fuse device, and > there could be more potential use cases, such as offloading complicated logic > into userspace, or decouple kernel subsystems. > > Follows ublksrv code, which implements zero copy for loop, nbd and > qcow2 targets with fused command: > > https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6 > > All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing: > > ublk add -t [loop|nbd|qcow2] -z .... > > Also add liburing test case for covering fused command based on miniublk > of blktest. > > https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6 > > Performance improvement is obvious on memory bandwidth related workloads, > such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file. > ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided. > > Please review and consider for v6.4. > > V6: > - re-design fused command, and make it more generic, moving sharing buffer > as one plugin of fused command, so in future we can implement more plugins > - document potential other use cases of fused command > - drop support for builtin secondary sqe in SQE128, so all secondary > requests has standalone SQE > - make fused command as one feature > - cleanup & improve naming Hi Jens, Can you apply ublk cleanup patches 7~11 on for-6.4? For others, we may delay to 6.5, and I am looking at other approach too. Thanks, Ming
On Thu, 30 Mar 2023 19:36:13 +0800, Ming Lei wrote: > Add generic fused command, which can include one primary command and multiple > secondary requests. This command provides one safe way to share resource between > primary command and secondary requests, and primary command is always > completed after all secondary requests are done, and resource lifetime > is bound with primary command. > > With this way, it is easy to support zero copy for ublk/fuse device, and > there could be more potential use cases, such as offloading complicated logic > into userspace, or decouple kernel subsystems. > > [...] Applied, thanks! [07/17] block: ublk_drv: add common exit handling commit: 903f8aeea9fd1b97fba4ab805ddd639f57f117f8 [08/17] block: ublk_drv: don't consider flush request in map/unmap io commit: 23ef8220f287abe5bf741ddfc278e7359742d3b1 [09/17] block: ublk_drv: add two helpers to clean up map/unmap request commit: 2f3af723447c35c16f3c6a1b4b317c61dc41d6c3 [10/17] block: ublk_drv: clean up several helpers commit: 96cf2f5404c8bc979628a2b495852d735a56c5b5 [11/17] block: ublk_drv: cleanup 'struct ublk_map_data' commit: ae9f5ccea4c268a96763e51239b32d6b5172c18c Best regards,
On 4/2/23 7:11?PM, Ming Lei wrote: > On Thu, Mar 30, 2023 at 07:36:13PM +0800, Ming Lei wrote: >> Hello Jens and Guys, >> >> Add generic fused command, which can include one primary command and multiple >> secondary requests. This command provides one safe way to share resource between >> primary command and secondary requests, and primary command is always >> completed after all secondary requests are done, and resource lifetime >> is bound with primary command. >> >> With this way, it is easy to support zero copy for ublk/fuse device, and >> there could be more potential use cases, such as offloading complicated logic >> into userspace, or decouple kernel subsystems. >> >> Follows ublksrv code, which implements zero copy for loop, nbd and >> qcow2 targets with fused command: >> >> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6 >> >> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing: >> >> ublk add -t [loop|nbd|qcow2] -z .... >> >> Also add liburing test case for covering fused command based on miniublk >> of blktest. >> >> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6 >> >> Performance improvement is obvious on memory bandwidth related workloads, >> such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file. >> ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided. >> >> Please review and consider for v6.4. >> >> V6: >> - re-design fused command, and make it more generic, moving sharing buffer >> as one plugin of fused command, so in future we can implement more plugins >> - document potential other use cases of fused command >> - drop support for builtin secondary sqe in SQE128, so all secondary >> requests has standalone SQE >> - make fused command as one feature >> - cleanup & improve naming > > Hi Jens, > > Can you apply ublk cleanup patches 7~11 on for-6.4? For others, we may > delay to 6.5, and I am looking at other approach too. Done - and yes, we're probably looking at 6.5 for the rest. But that's fine, I'd rather end up with the right interface than try and rush one.
Hello Jens and Everyone, On Sun, Apr 02, 2023 at 07:24:17PM -0600, Jens Axboe wrote: > On 4/2/23 7:11?PM, Ming Lei wrote: > > On Thu, Mar 30, 2023 at 07:36:13PM +0800, Ming Lei wrote: > >> Hello Jens and Guys, > >> > >> Add generic fused command, which can include one primary command and multiple > >> secondary requests. This command provides one safe way to share resource between > >> primary command and secondary requests, and primary command is always > >> completed after all secondary requests are done, and resource lifetime > >> is bound with primary command. > >> > >> With this way, it is easy to support zero copy for ublk/fuse device, and > >> there could be more potential use cases, such as offloading complicated logic > >> into userspace, or decouple kernel subsystems. > >> > >> Follows ublksrv code, which implements zero copy for loop, nbd and > >> qcow2 targets with fused command: > >> > >> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6 > >> > >> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing: > >> > >> ublk add -t [loop|nbd|qcow2] -z .... > >> > >> Also add liburing test case for covering fused command based on miniublk > >> of blktest. > >> > >> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6 > >> > >> Performance improvement is obvious on memory bandwidth related workloads, > >> such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file. > >> ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided. > >> > >> Please review and consider for v6.4. > >> > >> V6: > >> - re-design fused command, and make it more generic, moving sharing buffer > >> as one plugin of fused command, so in future we can implement more plugins > >> - document potential other use cases of fused command > >> - drop support for builtin secondary sqe in SQE128, so all secondary > >> requests has standalone SQE > >> - make fused command as one feature > >> - cleanup & improve naming > > > > Hi Jens, > > > > Can you apply ublk cleanup patches 7~11 on for-6.4? For others, we may > > delay to 6.5, and I am looking at other approach too. > > Done - and yes, we're probably looking at 6.5 for the rest. But that's Thanks! > fine, I'd rather end up with the right interface than try and rush one. Also I'd provide one summery about this work here so that it may help for anyone interested in this work, follows three approaches we have tried or proposed: 1) splice can't do this job[1][2] 2) fused command in this patchset - it is more like sendfile() or copy_file_range(), because the internal buffer isn't exposed outside - v6 becomes a bit more generic, the theory is that one SQE list is submitted as a whole request logically; the 1st sqe is the primary command, which provides buffer for others, and is responsible for submitting other SQEs (secondary)in this list; the primary command isn't completed until all secondary requests are done - this approach solves two problems efficiently in one simple way: a) buffer lifetime issue, and buffer lifetime is same with primary command, so all secondary OPs can be submitted & completely safely b) request dependency issue, all secondary requests depend on primary command, and secondary request itself could be independent, we start to allow to submit secondary request in non-async style, and all secondary requests can be issued concurrently - this approach is simple, because we don't expose buffer outside, and buffer is just shared among these secondary requests; meantime internal buffer saves us complicated OPs' dependency issue, avoid contention by registering buffer anywhere between submission and completion code path - the drawback is that we add one new SQE usage/model of primary SQE and secondary SQEs, and the whole logical request in concept, which is like sendfile() or copy_file_range() 3) register transient buffers for OPs[3] - it is more like splice(), which is flexible and could be more generic, but internal pipe buffer is added to pipe which is visible outside, so the implementation becomes complicated; and it should be more than splice(), because the io buffer needs to be shared among multiple OPs - inefficiently & complicated a) buffer has to be added to one global container(suppose it is io_uring context pipe) by ADD_BUF OP, and either buffer needs to be removed after consumer OPs are completed, or DEL_OP is run for removing buffer explicitly, so either contention on the io_uring pipe is added, or another new dependency is added(DEL_OP depends on all normal OPs) b) ADD_BUF OP is needed, and normal OPs have to depend on this new OP by IOSQE_IO_LINK, then all normal OPs will be submitted in async way, even worse, each normal OP has to be issued one by one, because io_uring isn't capable of handling 1:N dependency issue[5] c) if DEL_BUF OP is needed, then it is basically not possible to solve 1:N dependency any more, given DEL_BUF starts to depends on the previous N OPs; otherwise, contention on pipe is inevitable. d) solving 1:N dependency issue generically - advantage Follows current io_uring SQE usage, and looks more generic/flexible, like splice(). 4) others approaches or suggestions? Any idea is welcome as usual. Finally from problem viewpoint, if the problem domain is just ublk/fuse zero copy or other similar problems[6], fused command might be the simpler & more efficient approach, compared with approach 3). However, are there any other problems we want to cover by one more generic/flexible interface? If not, would we like to pay the complexity & inefficiency for one kind of less generic problem? [1] https://lore.kernel.org/linux-block/ZCQnHwrXvSOQHfAC@ovpn-8-26.pek2.redhat.com/T/#m1bfa358524b6af94731bcd5be28056f9f4408ecf [2] https://github.com/ming1/linux/blob/my_v6.3-io_uring_fuse_cmd_v6/Documentation/block/ublk.rst#zero-copy [3] https://lore.kernel.org/linux-block/ZCQnHwrXvSOQHfAC@ovpn-8-26.pek2.redhat.com/T/#mbe428dfeb0417487cd1db7e6dabca7399a3c265b [4] https://lore.kernel.org/linux-block/ZCQnHwrXvSOQHfAC@ovpn-8-26.pek2.redhat.com/T/#md035ffa4c6b69e85de2ab145418a9849a3b33741 [5] https://lore.kernel.org/linux-block/20230330113630.1388860-5-ming.lei@redhat.com/T/#m5e0c282ad26d9f3d8e519645168aeb3a19b5740b [6] https://lore.kernel.org/linux-block/20230330113630.1388860-5-ming.lei@redhat.com/T/#me5cca4db606541fae452d625780635fcedcd5c6c Thanks, Ming
On 3/30/23 13:36, Ming Lei wrote: [...] > V6: > - re-design fused command, and make it more generic, moving sharing buffer > as one plugin of fused command, so in future we can implement more plugins > - document potential other use cases of fused command > - drop support for builtin secondary sqe in SQE128, so all secondary > requests has standalone SQE > - make fused command as one feature > - cleanup & improve naming Hi Ming, et al., I started to wonder if fused SQE could be extended to combine multiple syscalls, for example open/read/close. Which would be another solution for the readfile syscall Miklos had proposed some time ago. https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/ If fused SQEs could be extended, I think it would be quite helpful for many other patterns. Another similar examples would open/write/close, but ideal would be also to allow to have it more complex like "open/write/sync_file_range/close" - open/write/close might be the fastest and could possibly return before sync_file_range. Use case for the latter would be a file server that wants to give notifications to client when pages have been written out. Thanks, Bernd
On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote: > On 3/30/23 13:36, Ming Lei wrote: > [...] > > V6: > > - re-design fused command, and make it more generic, moving sharing buffer > > as one plugin of fused command, so in future we can implement more plugins > > - document potential other use cases of fused command > > - drop support for builtin secondary sqe in SQE128, so all secondary > > requests has standalone SQE > > - make fused command as one feature > > - cleanup & improve naming > > Hi Ming, et al., > > I started to wonder if fused SQE could be extended to combine multiple > syscalls, for example open/read/close. Which would be another solution > for the readfile syscall Miklos had proposed some time ago. > > https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/ > > If fused SQEs could be extended, I think it would be quite helpful for > many other patterns. Another similar examples would open/write/close, > but ideal would be also to allow to have it more complex like > "open/write/sync_file_range/close" - open/write/close might be the > fastest and could possibly return before sync_file_range. Use case for > the latter would be a file server that wants to give notifications to > client when pages have been written out. The above pattern needn't fused command, and it can be done by plain SQEs chain, follows the usage: 1) suppose you get one command from /dev/fuse, then FUSE daemon needs to handle the command as open/write/sync/close 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK; 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK; 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK; 5) get sqe4, prepare it for close syscall 6) io_uring_enter(); //for submit and get events Then all the four OPs are done one by one by io_uring internal machinery, and you can choose to get successful CQE for each OP. Is the above what you want to do? The fused command proposal is actually for zero copy(but not limited to zc). If the above write OP need to write to file with in-kernel buffer of /dev/fuse directly, you can get one sqe0 and prepare it for primary command before 1), and set sqe2->addr to offet of the buffer in 3). However, fused command is usually used in the following way, such as FUSE daemon gets one READ request from /dev/fuse, FUSE userspace can handle the READ request as io_uring fused command: 1) get sqe0 and prepare it for primary command, in which you need to provide info for retrieving kernel buffer/pages of this READ request 2) suppose this READ request needs to be handled by translating it to READs to two files/devices, considering it as one mirror: - get sqe1, prepare it for read from file1, and set sqe->addr to offset of the buffer in 1), set sqe->len as length for read; this READ OP uses the kernel buffer in 1) directly - get sqe2, prepare it for read from file2, and set sqe->addr to offset of buffer in 1), set sqe->len as length for read; this READ OP uses the kernel buffer in 1) directly 3) submit the three sqe by io_uring_enter() sqe1 and sqe2 can be submitted concurrently or be issued one by one in order, fused command supports both, and depends on user requirement. But io_uring linked OPs is usually slower. Also file1/file2 needs to be opened beforehand in this example, and FD is passed to sqe1/sqe2, another choice is to use fixed File; Also you can add the open/close() OPs into above steps, which need these open/close/READ to be linked in order, usually slower tnan non-linked OPs. Thanks, Ming
On 4/19/23 03:51, Ming Lei wrote: > On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote: >> On 3/30/23 13:36, Ming Lei wrote: >> [...] >>> V6: >>> - re-design fused command, and make it more generic, moving sharing buffer >>> as one plugin of fused command, so in future we can implement more plugins >>> - document potential other use cases of fused command >>> - drop support for builtin secondary sqe in SQE128, so all secondary >>> requests has standalone SQE >>> - make fused command as one feature >>> - cleanup & improve naming >> >> Hi Ming, et al., >> >> I started to wonder if fused SQE could be extended to combine multiple >> syscalls, for example open/read/close. Which would be another solution >> for the readfile syscall Miklos had proposed some time ago. >> >> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/ >> >> If fused SQEs could be extended, I think it would be quite helpful for >> many other patterns. Another similar examples would open/write/close, >> but ideal would be also to allow to have it more complex like >> "open/write/sync_file_range/close" - open/write/close might be the >> fastest and could possibly return before sync_file_range. Use case for >> the latter would be a file server that wants to give notifications to >> client when pages have been written out. > > The above pattern needn't fused command, and it can be done by plain > SQEs chain, follows the usage: > > 1) suppose you get one command from /dev/fuse, then FUSE daemon > needs to handle the command as open/write/sync/close > 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK; > 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK; > 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK; > 5) get sqe4, prepare it for close syscall > 6) io_uring_enter(); //for submit and get events Oh, I was not aware that IOSQE_IO_LINK could pass the result of open down to the others. Hmm, the example I find for open is io_uring_prep_openat_direct in test_open_fixed(). It probably gets off topic here, but one needs to have ring prepared with io_uring_register_files_sparse, then manually manages available indexes and can then link commands? Interesting! > > Then all the four OPs are done one by one by io_uring internal > machinery, and you can choose to get successful CQE for each OP. > > Is the above what you want to do? > > The fused command proposal is actually for zero copy(but not limited to zc). Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to support generic passing, as it kind of hands data (buffers) from one sqe to the other. I.e. instead of buffers it would have passed the fd, but if this is already possible - no need to make IORING_OP_FUSED_CMD more complex.man > > If the above write OP need to write to file with in-kernel buffer > of /dev/fuse directly, you can get one sqe0 and prepare it for primary command > before 1), and set sqe2->addr to offet of the buffer in 3). > > However, fused command is usually used in the following way, such as FUSE daemon > gets one READ request from /dev/fuse, FUSE userspace can handle the READ request > as io_uring fused command: > > 1) get sqe0 and prepare it for primary command, in which you need to > provide info for retrieving kernel buffer/pages of this READ request > > 2) suppose this READ request needs to be handled by translating it to > READs to two files/devices, considering it as one mirror: > > - get sqe1, prepare it for read from file1, and set sqe->addr to offset > of the buffer in 1), set sqe->len as length for read; this READ OP > uses the kernel buffer in 1) directly > > - get sqe2, prepare it for read from file2, and set sqe->addr to offset > of buffer in 1), set sqe->len as length for read; this READ OP > uses the kernel buffer in 1) directly > > 3) submit the three sqe by io_uring_enter() > > sqe1 and sqe2 can be submitted concurrently or be issued one by one > in order, fused command supports both, and depends on user requirement. > But io_uring linked OPs is usually slower. > > Also file1/file2 needs to be opened beforehand in this example, and FD is > passed to sqe1/sqe2, another choice is to use fixed File; Also you can > add the open/close() OPs into above steps, which need these open/close/READ > to be linked in order, usually slower tnan non-linked OPs. Yes thanks, I'm going to prepare this in an branch, otherwise current fuse-uring would have a ZC regression (although my target ddn projects cannot make use of it, as we need access to the buffer for checksums, etc). Thanks, Bernd
On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote: > On 4/19/23 03:51, Ming Lei wrote: > > On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote: > >> On 3/30/23 13:36, Ming Lei wrote: > >> [...] > >>> V6: > >>> - re-design fused command, and make it more generic, moving sharing buffer > >>> as one plugin of fused command, so in future we can implement more plugins > >>> - document potential other use cases of fused command > >>> - drop support for builtin secondary sqe in SQE128, so all secondary > >>> requests has standalone SQE > >>> - make fused command as one feature > >>> - cleanup & improve naming > >> > >> Hi Ming, et al., > >> > >> I started to wonder if fused SQE could be extended to combine multiple > >> syscalls, for example open/read/close. Which would be another solution > >> for the readfile syscall Miklos had proposed some time ago. > >> > >> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/ > >> > >> If fused SQEs could be extended, I think it would be quite helpful for > >> many other patterns. Another similar examples would open/write/close, > >> but ideal would be also to allow to have it more complex like > >> "open/write/sync_file_range/close" - open/write/close might be the > >> fastest and could possibly return before sync_file_range. Use case for > >> the latter would be a file server that wants to give notifications to > >> client when pages have been written out. > > > > The above pattern needn't fused command, and it can be done by plain > > SQEs chain, follows the usage: > > > > 1) suppose you get one command from /dev/fuse, then FUSE daemon > > needs to handle the command as open/write/sync/close > > 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK; > > 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK; > > 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK; > > 5) get sqe4, prepare it for close syscall > > 6) io_uring_enter(); //for submit and get events > > Oh, I was not aware that IOSQE_IO_LINK could pass the result of open > down to the others. Hmm, the example I find for open is > io_uring_prep_openat_direct in test_open_fixed(). It probably gets off > topic here, but one needs to have ring prepared with > io_uring_register_files_sparse, then manually manages available indexes > and can then link commands? Interesting! Yeah, see test/fixed-reuse.c of liburing > > > > > Then all the four OPs are done one by one by io_uring internal > > machinery, and you can choose to get successful CQE for each OP. > > > > Is the above what you want to do? > > > > The fused command proposal is actually for zero copy(but not limited to zc). > > Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to > support generic passing, as it kind of hands data (buffers) from one sqe > to the other. I.e. instead of buffers it would have passed the fd, but > if this is already possible - no need to make IORING_OP_FUSED_CMD more > complex.man The way of passing FD introduces other cost, read op running into async, and adding it into global table, which introduces runtime cost. That is the reason why fused command is designed in the following way: - link can be avoided, so OPs needn't to be run in async - no need to add buffer into global table Cause it is really in fast io path. > > > > > If the above write OP need to write to file with in-kernel buffer > > of /dev/fuse directly, you can get one sqe0 and prepare it for primary command > > before 1), and set sqe2->addr to offet of the buffer in 3). > > > > However, fused command is usually used in the following way, such as FUSE daemon > > gets one READ request from /dev/fuse, FUSE userspace can handle the READ request > > as io_uring fused command: > > > > 1) get sqe0 and prepare it for primary command, in which you need to > > provide info for retrieving kernel buffer/pages of this READ request > > > > 2) suppose this READ request needs to be handled by translating it to > > READs to two files/devices, considering it as one mirror: > > > > - get sqe1, prepare it for read from file1, and set sqe->addr to offset > > of the buffer in 1), set sqe->len as length for read; this READ OP > > uses the kernel buffer in 1) directly > > > > - get sqe2, prepare it for read from file2, and set sqe->addr to offset > > of buffer in 1), set sqe->len as length for read; this READ OP > > uses the kernel buffer in 1) directly > > > > 3) submit the three sqe by io_uring_enter() > > > > sqe1 and sqe2 can be submitted concurrently or be issued one by one > > in order, fused command supports both, and depends on user requirement. > > But io_uring linked OPs is usually slower. > > > > Also file1/file2 needs to be opened beforehand in this example, and FD is > > passed to sqe1/sqe2, another choice is to use fixed File; Also you can > > add the open/close() OPs into above steps, which need these open/close/READ > > to be linked in order, usually slower tnan non-linked OPs. > > > Yes thanks, I'm going to prepare this in an branch, otherwise current > fuse-uring would have a ZC regression (although my target ddn projects > cannot make use of it, as we need access to the buffer for checksums, etc). storage has similar use case too, such as encrypt, nvme tcp data digest, ..., if the checksum/encrypt approach is standard, maybe one new OP or syscall can be added for doing that on kernel buffer directly. Thanks Ming
On 4/19/23 13:19, Ming Lei wrote: > On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote: >> On 4/19/23 03:51, Ming Lei wrote: >>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote: >>>> On 3/30/23 13:36, Ming Lei wrote: >>>> [...] >>>>> V6: >>>>> - re-design fused command, and make it more generic, moving sharing buffer >>>>> as one plugin of fused command, so in future we can implement more plugins >>>>> - document potential other use cases of fused command >>>>> - drop support for builtin secondary sqe in SQE128, so all secondary >>>>> requests has standalone SQE >>>>> - make fused command as one feature >>>>> - cleanup & improve naming >>>> >>>> Hi Ming, et al., >>>> >>>> I started to wonder if fused SQE could be extended to combine multiple >>>> syscalls, for example open/read/close. Which would be another solution >>>> for the readfile syscall Miklos had proposed some time ago. >>>> >>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/ >>>> >>>> If fused SQEs could be extended, I think it would be quite helpful for >>>> many other patterns. Another similar examples would open/write/close, >>>> but ideal would be also to allow to have it more complex like >>>> "open/write/sync_file_range/close" - open/write/close might be the >>>> fastest and could possibly return before sync_file_range. Use case for >>>> the latter would be a file server that wants to give notifications to >>>> client when pages have been written out. >>> >>> The above pattern needn't fused command, and it can be done by plain >>> SQEs chain, follows the usage: >>> >>> 1) suppose you get one command from /dev/fuse, then FUSE daemon >>> needs to handle the command as open/write/sync/close >>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK; >>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK; >>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK; >>> 5) get sqe4, prepare it for close syscall >>> 6) io_uring_enter(); //for submit and get events >> >> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open >> down to the others. Hmm, the example I find for open is >> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off >> topic here, but one needs to have ring prepared with >> io_uring_register_files_sparse, then manually manages available indexes >> and can then link commands? Interesting! > > Yeah, see test/fixed-reuse.c of liburing > >> >>> >>> Then all the four OPs are done one by one by io_uring internal >>> machinery, and you can choose to get successful CQE for each OP. >>> >>> Is the above what you want to do? >>> >>> The fused command proposal is actually for zero copy(but not limited to zc). >> >> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to >> support generic passing, as it kind of hands data (buffers) from one sqe >> to the other. I.e. instead of buffers it would have passed the fd, but >> if this is already possible - no need to make IORING_OP_FUSED_CMD more >> complex.man > > The way of passing FD introduces other cost, read op running into async, > and adding it into global table, which introduces runtime cost. Hmm, question from my side is why it needs to be in the global table, when it could be just passed to the linked or fused sqe? > > That is the reason why fused command is designed in the following way: > > - link can be avoided, so OPs needn't to be run in async > - no need to add buffer into global table > > Cause it is really in fast io path. > >> >>> >>> If the above write OP need to write to file with in-kernel buffer >>> of /dev/fuse directly, you can get one sqe0 and prepare it for primary command >>> before 1), and set sqe2->addr to offet of the buffer in 3). >>> >>> However, fused command is usually used in the following way, such as FUSE daemon >>> gets one READ request from /dev/fuse, FUSE userspace can handle the READ request >>> as io_uring fused command: >>> >>> 1) get sqe0 and prepare it for primary command, in which you need to >>> provide info for retrieving kernel buffer/pages of this READ request >>> >>> 2) suppose this READ request needs to be handled by translating it to >>> READs to two files/devices, considering it as one mirror: >>> >>> - get sqe1, prepare it for read from file1, and set sqe->addr to offset >>> of the buffer in 1), set sqe->len as length for read; this READ OP >>> uses the kernel buffer in 1) directly >>> >>> - get sqe2, prepare it for read from file2, and set sqe->addr to offset >>> of buffer in 1), set sqe->len as length for read; this READ OP >>> uses the kernel buffer in 1) directly >>> >>> 3) submit the three sqe by io_uring_enter() >>> >>> sqe1 and sqe2 can be submitted concurrently or be issued one by one >>> in order, fused command supports both, and depends on user requirement. >>> But io_uring linked OPs is usually slower. >>> >>> Also file1/file2 needs to be opened beforehand in this example, and FD is >>> passed to sqe1/sqe2, another choice is to use fixed File; Also you can >>> add the open/close() OPs into above steps, which need these open/close/READ >>> to be linked in order, usually slower tnan non-linked OPs. >> >> >> Yes thanks, I'm going to prepare this in an branch, otherwise current >> fuse-uring would have a ZC regression (although my target ddn projects >> cannot make use of it, as we need access to the buffer for checksums, etc). > > storage has similar use case too, such as encrypt, nvme tcp data digest, > ..., if the checksum/encrypt approach is standard, maybe one new OP or > syscall can be added for doing that on kernel buffer directly. I very much see the use case for FUSED_CMD for overlay or simple network sockets. Now in the HPC world one typically uses IB RDMA and if that fails for some reasons (like connection down), tcp or other interfaces as fallback. And there is sending the right part of the buffer to the right server and erasure coding involved - it gets complex and I don't think there is a way for us without a buffer copy. Thanks, Bernd
On 4/19/23 16:42, Bernd Schubert wrote: > On 4/19/23 13:19, Ming Lei wrote: >> On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote: >>> On 4/19/23 03:51, Ming Lei wrote: >>>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote: >>>>> On 3/30/23 13:36, Ming Lei wrote: >>>>> [...] >>>>>> V6: >>>>>> - re-design fused command, and make it more generic, moving sharing buffer >>>>>> as one plugin of fused command, so in future we can implement more plugins >>>>>> - document potential other use cases of fused command >>>>>> - drop support for builtin secondary sqe in SQE128, so all secondary >>>>>> requests has standalone SQE >>>>>> - make fused command as one feature >>>>>> - cleanup & improve naming >>>>> >>>>> Hi Ming, et al., >>>>> >>>>> I started to wonder if fused SQE could be extended to combine multiple >>>>> syscalls, for example open/read/close. Which would be another solution >>>>> for the readfile syscall Miklos had proposed some time ago. >>>>> >>>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/ >>>>> >>>>> If fused SQEs could be extended, I think it would be quite helpful for >>>>> many other patterns. Another similar examples would open/write/close, >>>>> but ideal would be also to allow to have it more complex like >>>>> "open/write/sync_file_range/close" - open/write/close might be the >>>>> fastest and could possibly return before sync_file_range. Use case for >>>>> the latter would be a file server that wants to give notifications to >>>>> client when pages have been written out. >>>> >>>> The above pattern needn't fused command, and it can be done by plain >>>> SQEs chain, follows the usage: >>>> >>>> 1) suppose you get one command from /dev/fuse, then FUSE daemon >>>> needs to handle the command as open/write/sync/close >>>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK; >>>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK; >>>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK; >>>> 5) get sqe4, prepare it for close syscall >>>> 6) io_uring_enter(); //for submit and get events >>> >>> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open >>> down to the others. Hmm, the example I find for open is >>> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off >>> topic here, but one needs to have ring prepared with >>> io_uring_register_files_sparse, then manually manages available indexes >>> and can then link commands? Interesting! >> >> Yeah, see test/fixed-reuse.c of liburing >> >>> >>>> >>>> Then all the four OPs are done one by one by io_uring internal >>>> machinery, and you can choose to get successful CQE for each OP. >>>> >>>> Is the above what you want to do? >>>> >>>> The fused command proposal is actually for zero copy(but not limited to zc). >>> >>> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to >>> support generic passing, as it kind of hands data (buffers) from one sqe >>> to the other. I.e. instead of buffers it would have passed the fd, but >>> if this is already possible - no need to make IORING_OP_FUSED_CMD more >>> complex.man >> >> The way of passing FD introduces other cost, read op running into async, >> and adding it into global table, which introduces runtime cost. > > Hmm, question from my side is why it needs to be in the global table, > when it could be just passed to the linked or fused sqe? Because for every such type of state you need to write custom code, it's not scalable, not to say that it usually can't be kept to a specific operation and leaks into generic paths / other requests. Some may want to pass a file or a buffer, there might be a need to pass a result in some specific way (e.g. nr = recv(); send(nr)), and the list continues... I tried adding BPF in the middle ~2y ago, but it was no different in perf than returning to the userspace, and gets worse with higher submission batching. Maybe I need to test it again. >> That is the reason why fused command is designed in the following way: >> >> - link can be avoided, so OPs needn't to be run in async >> - no need to add buffer into global table >> >> Cause it is really in fast io path. >>
On Wed, Apr 19, 2023 at 03:42:40PM +0000, Bernd Schubert wrote: > On 4/19/23 13:19, Ming Lei wrote: > > On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote: > >> On 4/19/23 03:51, Ming Lei wrote: > >>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote: > >>>> On 3/30/23 13:36, Ming Lei wrote: > >>>> [...] > >>>>> V6: > >>>>> - re-design fused command, and make it more generic, moving sharing buffer > >>>>> as one plugin of fused command, so in future we can implement more plugins > >>>>> - document potential other use cases of fused command > >>>>> - drop support for builtin secondary sqe in SQE128, so all secondary > >>>>> requests has standalone SQE > >>>>> - make fused command as one feature > >>>>> - cleanup & improve naming > >>>> > >>>> Hi Ming, et al., > >>>> > >>>> I started to wonder if fused SQE could be extended to combine multiple > >>>> syscalls, for example open/read/close. Which would be another solution > >>>> for the readfile syscall Miklos had proposed some time ago. > >>>> > >>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/ > >>>> > >>>> If fused SQEs could be extended, I think it would be quite helpful for > >>>> many other patterns. Another similar examples would open/write/close, > >>>> but ideal would be also to allow to have it more complex like > >>>> "open/write/sync_file_range/close" - open/write/close might be the > >>>> fastest and could possibly return before sync_file_range. Use case for > >>>> the latter would be a file server that wants to give notifications to > >>>> client when pages have been written out. > >>> > >>> The above pattern needn't fused command, and it can be done by plain > >>> SQEs chain, follows the usage: > >>> > >>> 1) suppose you get one command from /dev/fuse, then FUSE daemon > >>> needs to handle the command as open/write/sync/close > >>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK; > >>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK; > >>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK; > >>> 5) get sqe4, prepare it for close syscall > >>> 6) io_uring_enter(); //for submit and get events > >> > >> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open > >> down to the others. Hmm, the example I find for open is > >> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off > >> topic here, but one needs to have ring prepared with > >> io_uring_register_files_sparse, then manually manages available indexes > >> and can then link commands? Interesting! > > > > Yeah, see test/fixed-reuse.c of liburing > > > >> > >>> > >>> Then all the four OPs are done one by one by io_uring internal > >>> machinery, and you can choose to get successful CQE for each OP. > >>> > >>> Is the above what you want to do? > >>> > >>> The fused command proposal is actually for zero copy(but not limited to zc). > >> > >> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to > >> support generic passing, as it kind of hands data (buffers) from one sqe > >> to the other. I.e. instead of buffers it would have passed the fd, but > >> if this is already possible - no need to make IORING_OP_FUSED_CMD more > >> complex.man > > > > The way of passing FD introduces other cost, read op running into async, > > and adding it into global table, which introduces runtime cost. > > Hmm, question from my side is why it needs to be in the global table, > when it could be just passed to the linked or fused sqe? Any data which crosses OPs need be registered to somewhere, such as fixed buffer, fixed FD, here global meant context wide, and it is actually from OP/SQE viewpoint. Fused command actually is one whole command logically, even though it may includes multiple SQEs. Then registration as context wide isn't needn't(since it is known buffer sharing isn't context wide, and just among several IOs), meantime dependency is avoided, so link isn't needed. This way helps performance a lot, such as, in test on ublk/loop over tmpfs, iops drops to 1/2 with registration in 4k rand io, but fused command actually improves iops a bit, baseline is current in-tree ublk driver/ublksrv. > > > > > That is the reason why fused command is designed in the following way: > > > > - link can be avoided, so OPs needn't to be run in async > > - no need to add buffer into global table > > > > Cause it is really in fast io path. > > > >> > >>> > >>> If the above write OP need to write to file with in-kernel buffer > >>> of /dev/fuse directly, you can get one sqe0 and prepare it for primary command > >>> before 1), and set sqe2->addr to offet of the buffer in 3). > >>> > >>> However, fused command is usually used in the following way, such as FUSE daemon > >>> gets one READ request from /dev/fuse, FUSE userspace can handle the READ request > >>> as io_uring fused command: > >>> > >>> 1) get sqe0 and prepare it for primary command, in which you need to > >>> provide info for retrieving kernel buffer/pages of this READ request > >>> > >>> 2) suppose this READ request needs to be handled by translating it to > >>> READs to two files/devices, considering it as one mirror: > >>> > >>> - get sqe1, prepare it for read from file1, and set sqe->addr to offset > >>> of the buffer in 1), set sqe->len as length for read; this READ OP > >>> uses the kernel buffer in 1) directly > >>> > >>> - get sqe2, prepare it for read from file2, and set sqe->addr to offset > >>> of buffer in 1), set sqe->len as length for read; this READ OP > >>> uses the kernel buffer in 1) directly > >>> > >>> 3) submit the three sqe by io_uring_enter() > >>> > >>> sqe1 and sqe2 can be submitted concurrently or be issued one by one > >>> in order, fused command supports both, and depends on user requirement. > >>> But io_uring linked OPs is usually slower. > >>> > >>> Also file1/file2 needs to be opened beforehand in this example, and FD is > >>> passed to sqe1/sqe2, another choice is to use fixed File; Also you can > >>> add the open/close() OPs into above steps, which need these open/close/READ > >>> to be linked in order, usually slower tnan non-linked OPs. > >> > >> > >> Yes thanks, I'm going to prepare this in an branch, otherwise current > >> fuse-uring would have a ZC regression (although my target ddn projects > >> cannot make use of it, as we need access to the buffer for checksums, etc). > > > > storage has similar use case too, such as encrypt, nvme tcp data digest, > > ..., if the checksum/encrypt approach is standard, maybe one new OP or > > syscall can be added for doing that on kernel buffer directly. > > I very much see the use case for FUSED_CMD for overlay or simple network > sockets. Now in the HPC world one typically uses IB RDMA and if that > fails for some reasons (like connection down), tcp or other interfaces > as fallback. And there is sending the right part of the buffer to the > right server and erasure coding involved - it gets complex and I don't > think there is a way for us without a buffer copy. As I mentioned, it(checksum, encrypt, ...) becomes one generic issue if the zero copy approach is accepted, meantime the problem itself is well-defined, so I don't worry no solution can be figured out. Meantime big memory copy does consume both cpu and memory bandwidth a lot, and 64k/512k ublk io has shown this big difference wrt. copy vs. zero copy. Thanks, Ming
On 4/20/23 03:38, Ming Lei wrote: > On Wed, Apr 19, 2023 at 03:42:40PM +0000, Bernd Schubert wrote: >> I very much see the use case for FUSED_CMD for overlay or simple network >> sockets. Now in the HPC world one typically uses IB RDMA and if that >> fails for some reasons (like connection down), tcp or other interfaces >> as fallback. And there is sending the right part of the buffer to the >> right server and erasure coding involved - it gets complex and I don't >> think there is a way for us without a buffer copy. > > As I mentioned, it(checksum, encrypt, ...) becomes one generic issue if > the zero copy approach is accepted, meantime the problem itself is well-defined, > so I don't worry no solution can be figured out. > > Meantime big memory copy does consume both cpu and memory bandwidth a > lot, and 64k/512k ublk io has shown this big difference wrt. copy vs. > zero copy. I don't have any doubt about that, but I believe there is no current way to support it in all use cases. As example, let's consider we would like to extend nbd with verbs/rdma instead of plain tcp - verbs/rdma needs registered memory and does not take a simple socket fd to send buffers to. Thanks, Bernd