Message ID | 20240408010322.4104395-6-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | io_uring: support sqe group and provide group kbuf | expand |
On 4/7/24 7:03 PM, Ming Lei wrote: > SQE group is defined as one chain of SQEs starting with the first sqe that > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > doesn't have it set, and it is similar with chain of linked sqes. > > The 1st SQE is group leader, and the other SQEs are group member. The group > leader is always freed after all members are completed. Group members > aren't submitted until the group leader is completed, and there isn't any > dependency among group members, and IOSQE_IO_LINK can't be set for group > members, same with IOSQE_IO_DRAIN. > > Typically the group leader provides or makes resource, and the other members > consume the resource, such as scenario of multiple backup, the 1st SQE is to > read data from source file into fixed buffer, the other SQEs write data from > the same buffer into other destination files. SQE group provides very > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > submitted in single syscall, no need to submit fs read SQE first, and wait > until read SQE is completed, 2) no need to link all write SQEs together, then > write SQEs can be submitted to files concurrently. Meantime application is > simplified a lot in this way. > > Another use case is to for supporting generic device zero copy: > > - the lead SQE is for providing device buffer, which is owned by device or > kernel, can't be cross userspace, otherwise easy to cause leak for devil > application or panic > > - member SQEs reads or writes concurrently against the buffer provided by lead > SQE In concept, this looks very similar to "sqe bundles" that I played with in the past: https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle Didn't look too closely yet at the implementation, but in spirit it's about the same in that the first entry is processed first, and there's no ordering implied between the test of the members of the bundle / group. I do think that's a flexible thing to support, particularly if: 1) We can do it more efficiently than links, which are pretty horrible. 2) It enables new worthwhile use cases 3) It's done cleanly 4) It's easily understandable and easy to document, so that users will actually understand what this is and what use cases it enable. Part of that is actually naming, it should be readily apparent what a group is, what the lead is, and what the members are. Using your terminology here, definitely worth spending some time on that to get it just right and self evident.
Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: > On 4/7/24 7:03 PM, Ming Lei wrote: > > SQE group is defined as one chain of SQEs starting with the first sqe that > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > > doesn't have it set, and it is similar with chain of linked sqes. > > > > The 1st SQE is group leader, and the other SQEs are group member. The group > > leader is always freed after all members are completed. Group members > > aren't submitted until the group leader is completed, and there isn't any > > dependency among group members, and IOSQE_IO_LINK can't be set for group > > members, same with IOSQE_IO_DRAIN. > > > > Typically the group leader provides or makes resource, and the other members > > consume the resource, such as scenario of multiple backup, the 1st SQE is to > > read data from source file into fixed buffer, the other SQEs write data from > > the same buffer into other destination files. SQE group provides very > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > > submitted in single syscall, no need to submit fs read SQE first, and wait > > until read SQE is completed, 2) no need to link all write SQEs together, then > > write SQEs can be submitted to files concurrently. Meantime application is > > simplified a lot in this way. > > > > Another use case is to for supporting generic device zero copy: > > > > - the lead SQE is for providing device buffer, which is owned by device or > > kernel, can't be cross userspace, otherwise easy to cause leak for devil > > application or panic > > > > - member SQEs reads or writes concurrently against the buffer provided by lead > > SQE > > In concept, this looks very similar to "sqe bundles" that I played with > in the past: > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle > > Didn't look too closely yet at the implementation, but in spirit it's > about the same in that the first entry is processed first, and there's > no ordering implied between the test of the members of the bundle / > group. When I first read this patch, I wondered if it wouldn't make sense to allow linking a group with subsequent requests, e.g. first having a few requests that run in parallel and once all of them have completed continue with the next linked one sequentially. For SQE bundles, you reused the LINK flag, which doesn't easily allow this. Ming's patch uses a new flag for groups, so the interface would be more obvious, you simply set the LINK flag on the last member of the group (or on the leader, doesn't really matter). Of course, this doesn't mean it has to be implemented now, but there is a clear way forward if it's wanted. The part that looks a bit arbitrary in Ming's patch is that the group leader is always completed before the rest starts. It makes perfect sense in the context that this series is really after (enabling zero copy for ublk), but it doesn't really allow the case you mention in the SQE bundle commit message, running everything in parallel and getting a single CQE for the whole group. I suppose you could hack around the sequential nature of the first request by using an extra NOP as the group leader - which isn't any worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but the group completion would still be missing. (Of course, removing the sequential first operation would mean that ublk wouldn't have the buffer ready any more when the other requests try to use it, so that would defeat the purpose of the series...) I wonder if we can still combine both approaches and create some generally useful infrastructure and not something where it's visible that it was designed mostly for ublk's special case and other use cases just happened to be enabled as a side effect. Kevin
On Mon, Apr 22, 2024 at 12:27:28PM -0600, Jens Axboe wrote: > On 4/7/24 7:03 PM, Ming Lei wrote: > > SQE group is defined as one chain of SQEs starting with the first sqe that > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > > doesn't have it set, and it is similar with chain of linked sqes. > > > > The 1st SQE is group leader, and the other SQEs are group member. The group > > leader is always freed after all members are completed. Group members > > aren't submitted until the group leader is completed, and there isn't any > > dependency among group members, and IOSQE_IO_LINK can't be set for group > > members, same with IOSQE_IO_DRAIN. > > > > Typically the group leader provides or makes resource, and the other members > > consume the resource, such as scenario of multiple backup, the 1st SQE is to > > read data from source file into fixed buffer, the other SQEs write data from > > the same buffer into other destination files. SQE group provides very > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > > submitted in single syscall, no need to submit fs read SQE first, and wait > > until read SQE is completed, 2) no need to link all write SQEs together, then > > write SQEs can be submitted to files concurrently. Meantime application is > > simplified a lot in this way. > > > > Another use case is to for supporting generic device zero copy: > > > > - the lead SQE is for providing device buffer, which is owned by device or > > kernel, can't be cross userspace, otherwise easy to cause leak for devil > > application or panic > > > > - member SQEs reads or writes concurrently against the buffer provided by lead > > SQE > > In concept, this looks very similar to "sqe bundles" that I played with > in the past: > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle Indeed, so looks it is something which io_uring needs. > > Didn't look too closely yet at the implementation, but in spirit it's > about the same in that the first entry is processed first, and there's > no ordering implied between the test of the members of the bundle / > group. Yeah. > > I do think that's a flexible thing to support, particularly if: > > 1) We can do it more efficiently than links, which are pretty horrible. Agree, link is hard to use in async/.await of modern language per my experience. Also sqe group won't break link, and the group is thought as a whole wrt. linking. > 2) It enables new worthwhile use cases > 3) It's done cleanly > 4) It's easily understandable and easy to document, so that users will > actually understand what this is and what use cases it enable. Part > of that is actually naming, it should be readily apparent what a > group is, what the lead is, and what the members are. Using your > terminology here, definitely worth spending some time on that to get > it just right and self evident. All are nice suggestions, and I will follow above and make them in V2. Thanks, Ming
On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote: > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: > > On 4/7/24 7:03 PM, Ming Lei wrote: > > > SQE group is defined as one chain of SQEs starting with the first sqe that > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > > > doesn't have it set, and it is similar with chain of linked sqes. > > > > > > The 1st SQE is group leader, and the other SQEs are group member. The group > > > leader is always freed after all members are completed. Group members > > > aren't submitted until the group leader is completed, and there isn't any > > > dependency among group members, and IOSQE_IO_LINK can't be set for group > > > members, same with IOSQE_IO_DRAIN. > > > > > > Typically the group leader provides or makes resource, and the other members > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to > > > read data from source file into fixed buffer, the other SQEs write data from > > > the same buffer into other destination files. SQE group provides very > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > > > submitted in single syscall, no need to submit fs read SQE first, and wait > > > until read SQE is completed, 2) no need to link all write SQEs together, then > > > write SQEs can be submitted to files concurrently. Meantime application is > > > simplified a lot in this way. > > > > > > Another use case is to for supporting generic device zero copy: > > > > > > - the lead SQE is for providing device buffer, which is owned by device or > > > kernel, can't be cross userspace, otherwise easy to cause leak for devil > > > application or panic > > > > > > - member SQEs reads or writes concurrently against the buffer provided by lead > > > SQE > > > > In concept, this looks very similar to "sqe bundles" that I played with > > in the past: > > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle > > > > Didn't look too closely yet at the implementation, but in spirit it's > > about the same in that the first entry is processed first, and there's > > no ordering implied between the test of the members of the bundle / > > group. > > When I first read this patch, I wondered if it wouldn't make sense to > allow linking a group with subsequent requests, e.g. first having a few > requests that run in parallel and once all of them have completed > continue with the next linked one sequentially. > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > this. Ming's patch uses a new flag for groups, so the interface would be > more obvious, you simply set the LINK flag on the last member of the > group (or on the leader, doesn't really matter). Of course, this doesn't > mean it has to be implemented now, but there is a clear way forward if > it's wanted. Reusing LINK for bundle breaks existed link chains(BUNDLE linked to existed link chain), so I think it may not work. The link rule is explicit for sqe group: - only group leader can set link flag, which is applied on the whole group: the next sqe in the link chain won't be started until the previous linked sqe group is completed - link flag can't be set for group members Also sqe group doesn't limit async for both group leader and member. sqe group vs link & async is covered in the last liburing test code. > > The part that looks a bit arbitrary in Ming's patch is that the group > leader is always completed before the rest starts. It makes perfect > sense in the context that this series is really after (enabling zero > copy for ublk), but it doesn't really allow the case you mention in the > SQE bundle commit message, running everything in parallel and getting a > single CQE for the whole group. I think it should be easy to cover bundle in this way, such as add one new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for whole group/bundle. > > I suppose you could hack around the sequential nature of the first > request by using an extra NOP as the group leader - which isn't any > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but > the group completion would still be missing. (Of course, removing the > sequential first operation would mean that ublk wouldn't have the buffer > ready any more when the other requests try to use it, so that would > defeat the purpose of the series...) > > I wonder if we can still combine both approaches and create some > generally useful infrastructure and not something where it's visible > that it was designed mostly for ublk's special case and other use cases > just happened to be enabled as a side effect. sqe group is actually one generic interface, please see the multiple copy( copy one file to multiple destinations in single syscall for one range) example in the last patch, and it can support generic device zero copy: any device internal buffer can be linked with io_uring operations in this way, which can't be done by traditional splice/pipe. I guess it can be used in network Rx zero copy too, but may depend on actual network Rx use case. Thanks, Ming
Am 24.04.2024 um 03:39 hat Ming Lei geschrieben: > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote: > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: > > > On 4/7/24 7:03 PM, Ming Lei wrote: > > > > SQE group is defined as one chain of SQEs starting with the first sqe that > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > > > > doesn't have it set, and it is similar with chain of linked sqes. > > > > > > > > The 1st SQE is group leader, and the other SQEs are group member. The group > > > > leader is always freed after all members are completed. Group members > > > > aren't submitted until the group leader is completed, and there isn't any > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group > > > > members, same with IOSQE_IO_DRAIN. > > > > > > > > Typically the group leader provides or makes resource, and the other members > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to > > > > read data from source file into fixed buffer, the other SQEs write data from > > > > the same buffer into other destination files. SQE group provides very > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > > > > submitted in single syscall, no need to submit fs read SQE first, and wait > > > > until read SQE is completed, 2) no need to link all write SQEs together, then > > > > write SQEs can be submitted to files concurrently. Meantime application is > > > > simplified a lot in this way. > > > > > > > > Another use case is to for supporting generic device zero copy: > > > > > > > > - the lead SQE is for providing device buffer, which is owned by device or > > > > kernel, can't be cross userspace, otherwise easy to cause leak for devil > > > > application or panic > > > > > > > > - member SQEs reads or writes concurrently against the buffer provided by lead > > > > SQE > > > > > > In concept, this looks very similar to "sqe bundles" that I played with > > > in the past: > > > > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle > > > > > > Didn't look too closely yet at the implementation, but in spirit it's > > > about the same in that the first entry is processed first, and there's > > > no ordering implied between the test of the members of the bundle / > > > group. > > > > When I first read this patch, I wondered if it wouldn't make sense to > > allow linking a group with subsequent requests, e.g. first having a few > > requests that run in parallel and once all of them have completed > > continue with the next linked one sequentially. > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > > this. Ming's patch uses a new flag for groups, so the interface would be > > more obvious, you simply set the LINK flag on the last member of the > > group (or on the leader, doesn't really matter). Of course, this doesn't > > mean it has to be implemented now, but there is a clear way forward if > > it's wanted. > > Reusing LINK for bundle breaks existed link chains (BUNDLE linked to > existed link chain), so I think it may not work. You can always extend things *somehow*, but it wouldn't fit very naturally. That's why I feel your approach on this detail is a little better. > The link rule is explicit for sqe group: > > - only group leader can set link flag, which is applied on the whole > group: the next sqe in the link chain won't be started until the > previous linked sqe group is completed > > - link flag can't be set for group members > > Also sqe group doesn't limit async for both group leader and member. > > sqe group vs link & async is covered in the last liburing test code. Oh right, I didn't actually notice that you already implement what I proposed! I was expecting the flag on the last SQE and I saw in the code that this isn't allowed, but I completely missed your comment that explicitly states that it's the group leader that gets the link flag. Of course, this is just as good. > > The part that looks a bit arbitrary in Ming's patch is that the group > > leader is always completed before the rest starts. It makes perfect > > sense in the context that this series is really after (enabling zero > > copy for ublk), but it doesn't really allow the case you mention in the > > SQE bundle commit message, running everything in parallel and getting a > > single CQE for the whole group. > > I think it should be easy to cover bundle in this way, such as add one > new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for > whole group/bundle. This requires an extra SQE compared to just creating the group with flags, but I suppose this is not a big problem. An alternative might be sending the CQE for the group leader only after the whole group has completed if we're okay with userspace never knowing when the leader itself completed. However, assuming an IORING_OP_BUNDLE command, if this command only completes after the whole group, doesn't that conflict with the principle that all other commands are only started after the first one has completed? Maybe we shouldn't wait for the whole group leader request to complete, but just give the group leader a chance to prepare the group before all requests in the group (including the leader itself) are run in parallel. Maybe io_issue_sqe() could just start the rest of the group somewhere after calling def->issue() for the leader. Then you can't prepare the group buffer asynchronously, but I don't think this is needed, right? Your example with one read followed by multiple writes would then have to be written slightly differently: First the read outside of the group, linked to a group of writes. I honestly think this makes more sense as an interface, too, because then links are for sequential things and groups are (only) for parallel things. This feels clearer than having both a sequential and a parallel element in groups. > > I suppose you could hack around the sequential nature of the first > > request by using an extra NOP as the group leader - which isn't any > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but > > the group completion would still be missing. (Of course, removing the > > sequential first operation would mean that ublk wouldn't have the buffer > > ready any more when the other requests try to use it, so that would > > defeat the purpose of the series...) > > > > I wonder if we can still combine both approaches and create some > > generally useful infrastructure and not something where it's visible > > that it was designed mostly for ublk's special case and other use cases > > just happened to be enabled as a side effect. > > sqe group is actually one generic interface, please see the multiple > copy( copy one file to multiple destinations in single syscall for one > range) example in the last patch Yes, that's an example that happens to work well with the model that you derived from ublk. If you have the opposite case, reading a buffer that is spread across multiple files and then writing it to one target (i.e. first step parallel, second step sequential), you can't represent this well currently. You could work around it by having a NOP leader, but that's not very elegant. This asymmetry suggests that it's not the perfect interface yet. If the whole group runs in parallel instead, including the leader, then both examples become symmetrical. You have a group for the parallel I/O and a linked single request for the other operation. Or if both steps are parallel, you can just have two linked groups. > and it can support generic device zero copy: any device internal > buffer can be linked with io_uring operations in this way, which can't > be done by traditional splice/pipe. Is this actually implemented or is it just a potential direction for the future? > I guess it can be used in network Rx zero copy too, but may depend on > actual network Rx use case. Kevin
On Thu, Apr 25, 2024 at 11:27:32AM +0200, Kevin Wolf wrote: > Am 24.04.2024 um 03:39 hat Ming Lei geschrieben: > > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote: > > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: > > > > On 4/7/24 7:03 PM, Ming Lei wrote: > > > > > SQE group is defined as one chain of SQEs starting with the first sqe that > > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > > > > > doesn't have it set, and it is similar with chain of linked sqes. > > > > > > > > > > The 1st SQE is group leader, and the other SQEs are group member. The group > > > > > leader is always freed after all members are completed. Group members > > > > > aren't submitted until the group leader is completed, and there isn't any > > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group > > > > > members, same with IOSQE_IO_DRAIN. > > > > > > > > > > Typically the group leader provides or makes resource, and the other members > > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to > > > > > read data from source file into fixed buffer, the other SQEs write data from > > > > > the same buffer into other destination files. SQE group provides very > > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > > > > > submitted in single syscall, no need to submit fs read SQE first, and wait > > > > > until read SQE is completed, 2) no need to link all write SQEs together, then > > > > > write SQEs can be submitted to files concurrently. Meantime application is > > > > > simplified a lot in this way. > > > > > > > > > > Another use case is to for supporting generic device zero copy: > > > > > > > > > > - the lead SQE is for providing device buffer, which is owned by device or > > > > > kernel, can't be cross userspace, otherwise easy to cause leak for devil > > > > > application or panic > > > > > > > > > > - member SQEs reads or writes concurrently against the buffer provided by lead > > > > > SQE > > > > > > > > In concept, this looks very similar to "sqe bundles" that I played with > > > > in the past: > > > > > > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle > > > > > > > > Didn't look too closely yet at the implementation, but in spirit it's > > > > about the same in that the first entry is processed first, and there's > > > > no ordering implied between the test of the members of the bundle / > > > > group. > > > > > > When I first read this patch, I wondered if it wouldn't make sense to > > > allow linking a group with subsequent requests, e.g. first having a few > > > requests that run in parallel and once all of them have completed > > > continue with the next linked one sequentially. > > > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > > > this. Ming's patch uses a new flag for groups, so the interface would be > > > more obvious, you simply set the LINK flag on the last member of the > > > group (or on the leader, doesn't really matter). Of course, this doesn't > > > mean it has to be implemented now, but there is a clear way forward if > > > it's wanted. > > > > Reusing LINK for bundle breaks existed link chains (BUNDLE linked to > > existed link chain), so I think it may not work. > > You can always extend things *somehow*, but it wouldn't fit very > naturally. That's why I feel your approach on this detail is a little > better. Linking group in traditionally way is real use case, please see ublk-nbd's zero copy implementation. https://github.com/ublk-org/ublksrv/blob/group-provide-buf/nbd/tgt_nbd.cpp > > > The link rule is explicit for sqe group: > > > > - only group leader can set link flag, which is applied on the whole > > group: the next sqe in the link chain won't be started until the > > previous linked sqe group is completed > > > > - link flag can't be set for group members > > > > Also sqe group doesn't limit async for both group leader and member. > > > > sqe group vs link & async is covered in the last liburing test code. > > Oh right, I didn't actually notice that you already implement what I > proposed! > > I was expecting the flag on the last SQE and I saw in the code that this > isn't allowed, but I completely missed your comment that explicitly > states that it's the group leader that gets the link flag. Of course, > this is just as good. > > > > The part that looks a bit arbitrary in Ming's patch is that the group > > > leader is always completed before the rest starts. It makes perfect > > > sense in the context that this series is really after (enabling zero > > > copy for ublk), but it doesn't really allow the case you mention in the > > > SQE bundle commit message, running everything in parallel and getting a > > > single CQE for the whole group. > > > > I think it should be easy to cover bundle in this way, such as add one > > new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for > > whole group/bundle. > > This requires an extra SQE compared to just creating the group with > flags, but I suppose this is not a big problem. An alternative might be > sending the CQE for the group leader only after the whole group has > completed if we're okay with userspace never knowing when the leader > itself completed. > > However, assuming an IORING_OP_BUNDLE command, if this command only > completes after the whole group, doesn't that conflict with the Here the completion means committing CQE to userspace ring. > principle that all other commands are only started after the first one > has completed? I meant IORING_OP_BUNDLE is the group leader, and the first one is the the leader. The member requests won't be started until the leader is completed, and here the completion means that the request is completed from subsystem (FS, netowork, ...), so there isn't conflict, but yes, we need to describe the whole ideas/terms more carefully. > > Maybe we shouldn't wait for the whole group leader request to complete, > but just give the group leader a chance to prepare the group before all > requests in the group (including the leader itself) are run in parallel. > Maybe io_issue_sqe() could just start the rest of the group somewhere > after calling def->issue() for the leader. Then you can't prepare the > group buffer asynchronously, but I don't think this is needed, right? That isn't true, if leader request is one network RX, we need to wait until the recv is done, then the following member requests can be started for consuming the received data. Same example with the multiple copy one in last patch. > > Your example with one read followed by multiple writes would then have > to be written slightly differently: First the read outside of the group, > linked to a group of writes. I honestly think this makes more sense as > an interface, too, because then links are for sequential things and > groups are (only) for parallel things. This feels clearer than having > both a sequential and a parallel element in groups. Group also implements 1:N dependency, in which N members depends on single group leader, meantime there isn't any dependency among each members. That is something the current io_uring is missing. > > > > I suppose you could hack around the sequential nature of the first > > > request by using an extra NOP as the group leader - which isn't any > > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but > > > the group completion would still be missing. (Of course, removing the > > > sequential first operation would mean that ublk wouldn't have the buffer > > > ready any more when the other requests try to use it, so that would > > > defeat the purpose of the series...) > > > > > > I wonder if we can still combine both approaches and create some > > > generally useful infrastructure and not something where it's visible > > > that it was designed mostly for ublk's special case and other use cases > > > just happened to be enabled as a side effect. > > > > sqe group is actually one generic interface, please see the multiple > > copy( copy one file to multiple destinations in single syscall for one > > range) example in the last patch > > Yes, that's an example that happens to work well with the model that you > derived from ublk. Not only for ublk and device zero copy, it also have the multiple copy example. > > If you have the opposite case, reading a buffer that is spread across > multiple files and then writing it to one target (i.e. first step > parallel, second step sequential), you can't represent this well > currently. You could work around it by having a NOP leader, but that's > not very elegant. Yeah, linking the group(nop & reads) with the following write does work for the above copy case, :-) > > This asymmetry suggests that it's not the perfect interface yet. 1:N dependency requires the asymmetry, an nothing in this world is perfect, :-) But we can try to make it better. > > If the whole group runs in parallel instead, including the leader, then > both examples become symmetrical. You have a group for the parallel I/O > and a linked single request for the other operation. > > Or if both steps are parallel, you can just have two linked groups. I think sqe group can be extended to this way easily by one new flag if there is such real use case. We still can use leader's link flag for same purpose, an only advance the linking chain until the whole group is done. Then sqe group just degrades to single link group without 1:N dependency covered and leader is just for providing group link flag, looks it can be well defined and documented, and could be useful too, IMO. > > > and it can support generic device zero copy: any device internal > > buffer can be linked with io_uring operations in this way, which can't > > be done by traditional splice/pipe. > > Is this actually implemented or is it just a potential direction for the > future? It is potential direction since sqe group & provide buffer provides one generic framework to export device internal buffer for further consuming in zero copy & non-mmap way. Thanks, Ming
Am 26.04.2024 um 09:53 hat Ming Lei geschrieben: > On Thu, Apr 25, 2024 at 11:27:32AM +0200, Kevin Wolf wrote: > > Am 24.04.2024 um 03:39 hat Ming Lei geschrieben: > > > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote: > > > > When I first read this patch, I wondered if it wouldn't make sense to > > > > allow linking a group with subsequent requests, e.g. first having a few > > > > requests that run in parallel and once all of them have completed > > > > continue with the next linked one sequentially. > > > > > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > > > > this. Ming's patch uses a new flag for groups, so the interface would be > > > > more obvious, you simply set the LINK flag on the last member of the > > > > group (or on the leader, doesn't really matter). Of course, this doesn't > > > > mean it has to be implemented now, but there is a clear way forward if > > > > it's wanted. > > > > > > Reusing LINK for bundle breaks existed link chains (BUNDLE linked to > > > existed link chain), so I think it may not work. > > > > You can always extend things *somehow*, but it wouldn't fit very > > naturally. That's why I feel your approach on this detail is a little > > better. > > Linking group in traditionally way is real use case, please see > ublk-nbd's zero copy implementation. > > https://github.com/ublk-org/ublksrv/blob/group-provide-buf/nbd/tgt_nbd.cpp I'm not sure what you're trying to argue, I agreed with you twice? I don't think Jens's bundles break existing users of links because the special meaning is only triggered with IORING_OP_BUNDLE. But obviously they don't allow linking a bundle with something else after it, which feels more limiting than necessary. > > > The link rule is explicit for sqe group: > > > > > > - only group leader can set link flag, which is applied on the whole > > > group: the next sqe in the link chain won't be started until the > > > previous linked sqe group is completed > > > > > > - link flag can't be set for group members > > > > > > Also sqe group doesn't limit async for both group leader and member. > > > > > > sqe group vs link & async is covered in the last liburing test code. > > > > Oh right, I didn't actually notice that you already implement what I > > proposed! > > > > I was expecting the flag on the last SQE and I saw in the code that this > > isn't allowed, but I completely missed your comment that explicitly > > states that it's the group leader that gets the link flag. Of course, > > this is just as good. > > > > > > The part that looks a bit arbitrary in Ming's patch is that the group > > > > leader is always completed before the rest starts. It makes perfect > > > > sense in the context that this series is really after (enabling zero > > > > copy for ublk), but it doesn't really allow the case you mention in the > > > > SQE bundle commit message, running everything in parallel and getting a > > > > single CQE for the whole group. > > > > > > I think it should be easy to cover bundle in this way, such as add one > > > new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for > > > whole group/bundle. > > > > This requires an extra SQE compared to just creating the group with > > flags, but I suppose this is not a big problem. An alternative might be > > sending the CQE for the group leader only after the whole group has > > completed if we're okay with userspace never knowing when the leader > > itself completed. > > > > However, assuming an IORING_OP_BUNDLE command, if this command only > > completes after the whole group, doesn't that conflict with the > > Here the completion means committing CQE to userspace ring. > > > principle that all other commands are only started after the first one > > has completed? > > I meant IORING_OP_BUNDLE is the group leader, and the first one is the > the leader. > > The member requests won't be started until the leader is completed, and > here the completion means that the request is completed from subsystem > (FS, netowork, ...), so there isn't conflict, but yes, we need to > describe the whole ideas/terms more carefully. Is there precedence for requests that are completed, but don't result in a CQE immediately? But yes, it's the same as I had in mind above when I was talking about completing the leader only after the whole group has completed. > > Maybe we shouldn't wait for the whole group leader request to complete, > > but just give the group leader a chance to prepare the group before all > > requests in the group (including the leader itself) are run in parallel. > > Maybe io_issue_sqe() could just start the rest of the group somewhere > > after calling def->issue() for the leader. Then you can't prepare the > > group buffer asynchronously, but I don't think this is needed, right? > > That isn't true, if leader request is one network RX, we need to wait > until the recv is done, then the following member requests can be > started for consuming the received data. > > Same example with the multiple copy one in last patch. I don't see a group kernel buffer in the last patch at all? It seems to use userspace buffers. In which case the read doesn't have to be part of the group at all: You can have a read and link that with a group of writes. Then you have the clear semantics of link = sequential, group = parallel again. But let's assume that the read actually did provide a group buffer. What this example showed me is that grouping requests for parallel submission is logically independent from grouping requests for sharing a buffer. For full flexibility, they would probably have to be separate concepts. You could then have the same setup as before (read linked to a group of writes), but still share a group kernel buffer for the whole sequence. However, it's not clear if that the full flexibility is needed, and it would probably complicate the implementation a bit. > > Your example with one read followed by multiple writes would then have > > to be written slightly differently: First the read outside of the group, > > linked to a group of writes. I honestly think this makes more sense as > > an interface, too, because then links are for sequential things and > > groups are (only) for parallel things. This feels clearer than having > > both a sequential and a parallel element in groups. > > Group also implements 1:N dependency, in which N members depends on > single group leader, meantime there isn't any dependency among each > members. That is something the current io_uring is missing. Dependencies are currently expressed with links, which is why I felt that it would be good to use them in this case, too. Groups that only include parallel requests and can be part of a link chain even provide N:M dependencies, so are even more powerful than the fixed 1:N of your groups. The only thing that doesn't work as nicely then is sharing the buffer as long as it's not treated as a separate concept. > > > > I suppose you could hack around the sequential nature of the first > > > > request by using an extra NOP as the group leader - which isn't any > > > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but > > > > the group completion would still be missing. (Of course, removing the > > > > sequential first operation would mean that ublk wouldn't have the buffer > > > > ready any more when the other requests try to use it, so that would > > > > defeat the purpose of the series...) > > > > > > > > I wonder if we can still combine both approaches and create some > > > > generally useful infrastructure and not something where it's visible > > > > that it was designed mostly for ublk's special case and other use cases > > > > just happened to be enabled as a side effect. > > > > > > sqe group is actually one generic interface, please see the multiple > > > copy( copy one file to multiple destinations in single syscall for one > > > range) example in the last patch > > > > Yes, that's an example that happens to work well with the model that you > > derived from ublk. > > Not only for ublk and device zero copy, it also have the multiple copy example. This is what I replied to. Yes, it's an example where the model works fine. This is not evidence that the model is as generic as it could be, just that it's an example that fits it. > > If you have the opposite case, reading a buffer that is spread across > > multiple files and then writing it to one target (i.e. first step > > parallel, second step sequential), you can't represent this well > > currently. You could work around it by having a NOP leader, but that's > > not very elegant. > > Yeah, linking the group(nop & reads) with the following write does > work for the above copy case, :-) > > > > > This asymmetry suggests that it's not the perfect interface yet. > > 1:N dependency requires the asymmetry, an nothing in this world is perfect, :-) > But we can try to make it better. The asymmetry doesn't contribute anything to the 1:N dependency. As discussed above, normal links combined with fully parallel (and therefore symmetrical) groups provide this functionality, too. The only real reason I see for justifying it is the group kernel buffer. > > If the whole group runs in parallel instead, including the leader, then > > both examples become symmetrical. You have a group for the parallel I/O > > and a linked single request for the other operation. > > > > Or if both steps are parallel, you can just have two linked groups. > > I think sqe group can be extended to this way easily by one new flag if > there is such real use case. We still can use leader's link flag for > same purpose, an only advance the linking chain until the whole group > is done. > > Then sqe group just degrades to single link group without 1:N dependency > covered and leader is just for providing group link flag, looks it can be > well defined and documented, and could be useful too, IMO. If you're willing to introduce a second flag, then I'd consider using that flag to define buffer sharing groups independently of groups for parallel execution. I think it's usually preferable to build the semantics you need by combining flags that provide independent building blocks with a straightforward meaning than having a single complex building block and then flags that modify the way the complex concept works. > > > and it can support generic device zero copy: any device internal > > > buffer can be linked with io_uring operations in this way, which can't > > > be done by traditional splice/pipe. > > > > Is this actually implemented or is it just a potential direction for the > > future? > > It is potential direction since sqe group & provide buffer provides one > generic framework to export device internal buffer for further consuming > in zero copy & non-mmap way. I see. This contributes a bit to the impression that much of the design is driven by ublk alone, because it's the only thing that seems to make use of group buffers so far. Anyway, I'm just throwing in a few thoughts and ideas from outside. In the end, Jens and you need to agree on something. Kevin
On Fri, Apr 26, 2024 at 07:05:17PM +0200, Kevin Wolf wrote: > Am 26.04.2024 um 09:53 hat Ming Lei geschrieben: > > On Thu, Apr 25, 2024 at 11:27:32AM +0200, Kevin Wolf wrote: > > > Am 24.04.2024 um 03:39 hat Ming Lei geschrieben: > > > > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote: > > > > > When I first read this patch, I wondered if it wouldn't make sense to > > > > > allow linking a group with subsequent requests, e.g. first having a few > > > > > requests that run in parallel and once all of them have completed > > > > > continue with the next linked one sequentially. > > > > > > > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > > > > > this. Ming's patch uses a new flag for groups, so the interface would be > > > > > more obvious, you simply set the LINK flag on the last member of the > > > > > group (or on the leader, doesn't really matter). Of course, this doesn't > > > > > mean it has to be implemented now, but there is a clear way forward if > > > > > it's wanted. > > > > > > > > Reusing LINK for bundle breaks existed link chains (BUNDLE linked to > > > > existed link chain), so I think it may not work. > > > > > > You can always extend things *somehow*, but it wouldn't fit very > > > naturally. That's why I feel your approach on this detail is a little > > > better. > > > > Linking group in traditionally way is real use case, please see > > ublk-nbd's zero copy implementation. > > > > https://github.com/ublk-org/ublksrv/blob/group-provide-buf/nbd/tgt_nbd.cpp > > I'm not sure what you're trying to argue, I agreed with you twice? > > I don't think Jens's bundles break existing users of links because the > special meaning is only triggered with IORING_OP_BUNDLE. But obviously > they don't allow linking a bundle with something else after it, which > feels more limiting than necessary. OK. > > > > > The link rule is explicit for sqe group: > > > > > > > > - only group leader can set link flag, which is applied on the whole > > > > group: the next sqe in the link chain won't be started until the > > > > previous linked sqe group is completed > > > > > > > > - link flag can't be set for group members > > > > > > > > Also sqe group doesn't limit async for both group leader and member. > > > > > > > > sqe group vs link & async is covered in the last liburing test code. > > > > > > Oh right, I didn't actually notice that you already implement what I > > > proposed! > > > > > > I was expecting the flag on the last SQE and I saw in the code that this > > > isn't allowed, but I completely missed your comment that explicitly > > > states that it's the group leader that gets the link flag. Of course, > > > this is just as good. > > > > > > > > The part that looks a bit arbitrary in Ming's patch is that the group > > > > > leader is always completed before the rest starts. It makes perfect > > > > > sense in the context that this series is really after (enabling zero > > > > > copy for ublk), but it doesn't really allow the case you mention in the > > > > > SQE bundle commit message, running everything in parallel and getting a > > > > > single CQE for the whole group. > > > > > > > > I think it should be easy to cover bundle in this way, such as add one > > > > new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for > > > > whole group/bundle. > > > > > > This requires an extra SQE compared to just creating the group with > > > flags, but I suppose this is not a big problem. An alternative might be > > > sending the CQE for the group leader only after the whole group has > > > completed if we're okay with userspace never knowing when the leader > > > itself completed. > > > > > > However, assuming an IORING_OP_BUNDLE command, if this command only > > > completes after the whole group, doesn't that conflict with the > > > > Here the completion means committing CQE to userspace ring. > > > > > principle that all other commands are only started after the first one > > > has completed? > > > > I meant IORING_OP_BUNDLE is the group leader, and the first one is the > > the leader. > > > > The member requests won't be started until the leader is completed, and > > here the completion means that the request is completed from subsystem > > (FS, netowork, ...), so there isn't conflict, but yes, we need to > > describe the whole ideas/terms more carefully. > > Is there precedence for requests that are completed, but don't result in > a CQE immediately? But yes, it's the same as I had in mind above when I > was talking about completing the leader only after the whole group has > completed. > > > > Maybe we shouldn't wait for the whole group leader request to complete, > > > but just give the group leader a chance to prepare the group before all > > > requests in the group (including the leader itself) are run in parallel. > > > Maybe io_issue_sqe() could just start the rest of the group somewhere > > > after calling def->issue() for the leader. Then you can't prepare the > > > group buffer asynchronously, but I don't think this is needed, right? > > > > That isn't true, if leader request is one network RX, we need to wait > > until the recv is done, then the following member requests can be > > started for consuming the received data. > > > > Same example with the multiple copy one in last patch. > > I don't see a group kernel buffer in the last patch at all? It seems to > use userspace buffers. In which case the read doesn't have to be part of > the group at all: You can have a read and link that with a group of > writes. Then you have the clear semantics of link = sequential, > group = parallel again. > > But let's assume that the read actually did provide a group buffer. > > What this example showed me is that grouping requests for parallel > submission is logically independent from grouping requests for sharing a > buffer. For full flexibility, they would probably have to be separate > concepts. You could then have the same setup as before (read linked to a > group of writes), but still share a group kernel buffer for the whole > sequence. kernel buffer lifetime is aligned with the whole group lifetime, that is why the leader can't be moved out of group. Otherwise the two groups are actually same, but still lots of things are in common, such as, how to handle link for whole group, and the implementation should share most of code, same with the group concept. > > However, it's not clear if that the full flexibility is needed, and it > would probably complicate the implementation a bit. > > > > Your example with one read followed by multiple writes would then have > > > to be written slightly differently: First the read outside of the group, > > > linked to a group of writes. I honestly think this makes more sense as > > > an interface, too, because then links are for sequential things and > > > groups are (only) for parallel things. This feels clearer than having > > > both a sequential and a parallel element in groups. > > > > Group also implements 1:N dependency, in which N members depends on > > single group leader, meantime there isn't any dependency among each > > members. That is something the current io_uring is missing. > > Dependencies are currently expressed with links, which is why I felt > that it would be good to use them in this case, too. Groups that only > include parallel requests and can be part of a link chain even provide > N:M dependencies, so are even more powerful than the fixed 1:N of your > groups. > > The only thing that doesn't work as nicely then is sharing the buffer as > long as it's not treated as a separate concept. Right, kernel buffer lifetime is really one hard problem, aligning it with group lifetime simplifies a lot for zero copy problem. > > > > > > I suppose you could hack around the sequential nature of the first > > > > > request by using an extra NOP as the group leader - which isn't any > > > > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but > > > > > the group completion would still be missing. (Of course, removing the > > > > > sequential first operation would mean that ublk wouldn't have the buffer > > > > > ready any more when the other requests try to use it, so that would > > > > > defeat the purpose of the series...) > > > > > > > > > > I wonder if we can still combine both approaches and create some > > > > > generally useful infrastructure and not something where it's visible > > > > > that it was designed mostly for ublk's special case and other use cases > > > > > just happened to be enabled as a side effect. > > > > > > > > sqe group is actually one generic interface, please see the multiple > > > > copy( copy one file to multiple destinations in single syscall for one > > > > range) example in the last patch > > > > > > Yes, that's an example that happens to work well with the model that you > > > derived from ublk. > > > > Not only for ublk and device zero copy, it also have the multiple copy example. > > This is what I replied to. Yes, it's an example where the model works > fine. This is not evidence that the model is as generic as it could be, > just that it's an example that fits it. > > > > If you have the opposite case, reading a buffer that is spread across > > > multiple files and then writing it to one target (i.e. first step > > > parallel, second step sequential), you can't represent this well > > > currently. You could work around it by having a NOP leader, but that's > > > not very elegant. > > > > Yeah, linking the group(nop & reads) with the following write does > > work for the above copy case, :-) > > > > > > > > This asymmetry suggests that it's not the perfect interface yet. > > > > 1:N dependency requires the asymmetry, an nothing in this world is perfect, :-) > > But we can try to make it better. > > The asymmetry doesn't contribute anything to the 1:N dependency. As > discussed above, normal links combined with fully parallel (and > therefore symmetrical) groups provide this functionality, too. > > The only real reason I see for justifying it is the group kernel buffer. Yes, but that is also exactly what the N depends on. > > > > If the whole group runs in parallel instead, including the leader, then > > > both examples become symmetrical. You have a group for the parallel I/O > > > and a linked single request for the other operation. > > > > > > Or if both steps are parallel, you can just have two linked groups. > > > > I think sqe group can be extended to this way easily by one new flag if > > there is such real use case. We still can use leader's link flag for > > same purpose, an only advance the linking chain until the whole group > > is done. > > > > Then sqe group just degrades to single link group without 1:N dependency > > covered and leader is just for providing group link flag, looks it can be > > well defined and documented, and could be useful too, IMO. > > If you're willing to introduce a second flag, then I'd consider using > that flag to define buffer sharing groups independently of groups for > parallel execution. > > I think it's usually preferable to build the semantics you need by > combining flags that provide independent building blocks with a > straightforward meaning than having a single complex building block and > then flags that modify the way the complex concept works. This way is fine. > > > > > and it can support generic device zero copy: any device internal > > > > buffer can be linked with io_uring operations in this way, which can't > > > > be done by traditional splice/pipe. > > > > > > Is this actually implemented or is it just a potential direction for the > > > future? > > > > It is potential direction since sqe group & provide buffer provides one > > generic framework to export device internal buffer for further consuming > > in zero copy & non-mmap way. > > I see. This contributes a bit to the impression that much of the design > is driven by ublk alone, because it's the only thing that seems to make > use of group buffers so far. > > Anyway, I'm just throwing in a few thoughts and ideas from outside. In > the end, Jens and you need to agree on something. Your thoughts and ideas are really helpful, thanks! Thanks, Ming
On 4/23/24 14:08, Kevin Wolf wrote: > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: >> On 4/7/24 7:03 PM, Ming Lei wrote: >>> SQE group is defined as one chain of SQEs starting with the first sqe that >>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that >>> doesn't have it set, and it is similar with chain of linked sqes. >>> >>> The 1st SQE is group leader, and the other SQEs are group member. The group >>> leader is always freed after all members are completed. Group members >>> aren't submitted until the group leader is completed, and there isn't any >>> dependency among group members, and IOSQE_IO_LINK can't be set for group >>> members, same with IOSQE_IO_DRAIN. >>> >>> Typically the group leader provides or makes resource, and the other members >>> consume the resource, such as scenario of multiple backup, the 1st SQE is to >>> read data from source file into fixed buffer, the other SQEs write data from >>> the same buffer into other destination files. SQE group provides very >>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be >>> submitted in single syscall, no need to submit fs read SQE first, and wait >>> until read SQE is completed, 2) no need to link all write SQEs together, then >>> write SQEs can be submitted to files concurrently. Meantime application is >>> simplified a lot in this way. >>> >>> Another use case is to for supporting generic device zero copy: >>> >>> - the lead SQE is for providing device buffer, which is owned by device or >>> kernel, can't be cross userspace, otherwise easy to cause leak for devil >>> application or panic >>> >>> - member SQEs reads or writes concurrently against the buffer provided by lead >>> SQE >> >> In concept, this looks very similar to "sqe bundles" that I played with >> in the past: >> >> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle >> >> Didn't look too closely yet at the implementation, but in spirit it's >> about the same in that the first entry is processed first, and there's >> no ordering implied between the test of the members of the bundle / >> group. > > When I first read this patch, I wondered if it wouldn't make sense to > allow linking a group with subsequent requests, e.g. first having a few > requests that run in parallel and once all of them have completed > continue with the next linked one sequentially. > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > this. Ming's patch uses a new flag for groups, so the interface would be > more obvious, you simply set the LINK flag on the last member of the > group (or on the leader, doesn't really matter). Of course, this doesn't > mean it has to be implemented now, but there is a clear way forward if > it's wanted. Putting zc aside, links, graphs, groups, it all sounds interesting in concept but let's not fool anyone, all the different ordering relationships between requests proved to be a bad idea. I can complaint for long, error handling is miserable, user handling resubmitting a part of a link is horrible, the concept of errors is hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK and the MSG_WAITALL workaround). The handling and workarounds are leaking into generic paths, e.g. we can't init files when it's the most convenient. For cancellation we're walking links, which need more care than just looking at a request (is cancellation by user_data of a "linked" to a group request even supported?). The list goes on And what does it achieve? The infra has matured since early days, it saves user-kernel transitions at best but not context switching overhead, and not even that if you do wait(1) and happen to catch middle CQEs. And it disables LAZY_WAKE, so CQ side batching with timers and what not is effectively useless with links. So, please, please! instead of trying to invent a new uber scheme of request linking, which surely wouldn't step on same problems over and over again, and would definitely be destined to overshadow all previous attempts and finally conquer the world, let's rather focus on minimasing the damage from this patchset's zero copy if it's going to be taken. Piggy backing bits on top of links should be just fine. May help to save space in io_kiocb by unionising with links. And half of the overhead (including completely destroying all inlining in the submission patch) can be mitigated by folding it into REQ_F_LINK handling and generally borowing the code structure for it in the submission path. > The part that looks a bit arbitrary in Ming's patch is that the group > leader is always completed before the rest starts. It makes perfect > sense in the context that this series is really after (enabling zero > copy for ublk), but it doesn't really allow the case you mention in the > SQE bundle commit message, running everything in parallel and getting a > single CQE for the whole group. > > I suppose you could hack around the sequential nature of the first > request by using an extra NOP as the group leader - which isn't any > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but > the group completion would still be missing. (Of course, removing the > sequential first operation would mean that ublk wouldn't have the buffer > ready any more when the other requests try to use it, so that would > defeat the purpose of the series...) > > I wonder if we can still combine both approaches and create some > generally useful infrastructure and not something where it's visible > that it was designed mostly for ublk's special case and other use cases > just happened to be enabled as a side effect.
On 4/24/24 02:39, Ming Lei wrote: > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote: >> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: >>> On 4/7/24 7:03 PM, Ming Lei wrote: >>>> SQE group is defined as one chain of SQEs starting with the first sqe that >>>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that >>>> doesn't have it set, and it is similar with chain of linked sqes. >>>> >>>> The 1st SQE is group leader, and the other SQEs are group member. The group >>>> leader is always freed after all members are completed. Group members >>>> aren't submitted until the group leader is completed, and there isn't any >>>> dependency among group members, and IOSQE_IO_LINK can't be set for group >>>> members, same with IOSQE_IO_DRAIN. >>>> >>>> Typically the group leader provides or makes resource, and the other members >>>> consume the resource, such as scenario of multiple backup, the 1st SQE is to >>>> read data from source file into fixed buffer, the other SQEs write data from >>>> the same buffer into other destination files. SQE group provides very >>>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be >>>> submitted in single syscall, no need to submit fs read SQE first, and wait >>>> until read SQE is completed, 2) no need to link all write SQEs together, then >>>> write SQEs can be submitted to files concurrently. Meantime application is >>>> simplified a lot in this way. >>>> >>>> Another use case is to for supporting generic device zero copy: >>>> >>>> - the lead SQE is for providing device buffer, which is owned by device or >>>> kernel, can't be cross userspace, otherwise easy to cause leak for devil >>>> application or panic >>>> >>>> - member SQEs reads or writes concurrently against the buffer provided by lead >>>> SQE >>> >>> In concept, this looks very similar to "sqe bundles" that I played with >>> in the past: >>> >>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle >>> >>> Didn't look too closely yet at the implementation, but in spirit it's >>> about the same in that the first entry is processed first, and there's >>> no ordering implied between the test of the members of the bundle / >>> group. >> >> When I first read this patch, I wondered if it wouldn't make sense to >> allow linking a group with subsequent requests, e.g. first having a few >> requests that run in parallel and once all of them have completed >> continue with the next linked one sequentially. >> >> For SQE bundles, you reused the LINK flag, which doesn't easily allow >> this. Ming's patch uses a new flag for groups, so the interface would be >> more obvious, you simply set the LINK flag on the last member of the >> group (or on the leader, doesn't really matter). Of course, this doesn't >> mean it has to be implemented now, but there is a clear way forward if >> it's wanted. > > Reusing LINK for bundle breaks existed link chains(BUNDLE linked to existed > link chain), so I think it may not work. > > The link rule is explicit for sqe group: > > - only group leader can set link flag, which is applied on the whole > group: the next sqe in the link chain won't be started until the > previous linked sqe group is completed > > - link flag can't be set for group members > > Also sqe group doesn't limit async for both group leader and member. > > sqe group vs link & async is covered in the last liburing test code. > >> >> The part that looks a bit arbitrary in Ming's patch is that the group >> leader is always completed before the rest starts. It makes perfect >> sense in the context that this series is really after (enabling zero >> copy for ublk), but it doesn't really allow the case you mention in the >> SQE bundle commit message, running everything in parallel and getting a >> single CQE for the whole group. > > I think it should be easy to cover bundle in this way, such as add one new > op IORING_OP_BUNDLE as Jens did, and implement the single CQE for whole group/bundle. > >> >> I suppose you could hack around the sequential nature of the first >> request by using an extra NOP as the group leader - which isn't any >> worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but >> the group completion would still be missing. (Of course, removing the >> sequential first operation would mean that ublk wouldn't have the buffer >> ready any more when the other requests try to use it, so that would >> defeat the purpose of the series...) >> >> I wonder if we can still combine both approaches and create some >> generally useful infrastructure and not something where it's visible >> that it was designed mostly for ublk's special case and other use cases >> just happened to be enabled as a side effect. > > sqe group is actually one generic interface, please see the multiple copy( > copy one file to multiple destinations in single syscall for one range) example > in the last patch, and it can support generic device zero copy: any device internal > buffer can be linked with io_uring operations in this way, which can't > be done by traditional splice/pipe. > > I guess it can be used in network Rx zero copy too, but may depend on actual > network Rx use case. I doubt. With storage same data can be read twice. Socket recv consumes data. Locking a buffer over the duration of another IO doesn't really sound plausible, same we returning a buffer back. It'd be different if you can read the buffer into the userspace if something goes wrong, but perhaps you remember the fused discussion.
On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote: > On 4/23/24 14:08, Kevin Wolf wrote: > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: > > > On 4/7/24 7:03 PM, Ming Lei wrote: > > > > SQE group is defined as one chain of SQEs starting with the first sqe that > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > > > > doesn't have it set, and it is similar with chain of linked sqes. > > > > > > > > The 1st SQE is group leader, and the other SQEs are group member. The group > > > > leader is always freed after all members are completed. Group members > > > > aren't submitted until the group leader is completed, and there isn't any > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group > > > > members, same with IOSQE_IO_DRAIN. > > > > > > > > Typically the group leader provides or makes resource, and the other members > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to > > > > read data from source file into fixed buffer, the other SQEs write data from > > > > the same buffer into other destination files. SQE group provides very > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > > > > submitted in single syscall, no need to submit fs read SQE first, and wait > > > > until read SQE is completed, 2) no need to link all write SQEs together, then > > > > write SQEs can be submitted to files concurrently. Meantime application is > > > > simplified a lot in this way. > > > > > > > > Another use case is to for supporting generic device zero copy: > > > > > > > > - the lead SQE is for providing device buffer, which is owned by device or > > > > kernel, can't be cross userspace, otherwise easy to cause leak for devil > > > > application or panic > > > > > > > > - member SQEs reads or writes concurrently against the buffer provided by lead > > > > SQE > > > > > > In concept, this looks very similar to "sqe bundles" that I played with > > > in the past: > > > > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle > > > > > > Didn't look too closely yet at the implementation, but in spirit it's > > > about the same in that the first entry is processed first, and there's > > > no ordering implied between the test of the members of the bundle / > > > group. > > > > When I first read this patch, I wondered if it wouldn't make sense to > > allow linking a group with subsequent requests, e.g. first having a few > > requests that run in parallel and once all of them have completed > > continue with the next linked one sequentially. > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > > this. Ming's patch uses a new flag for groups, so the interface would be > > more obvious, you simply set the LINK flag on the last member of the > > group (or on the leader, doesn't really matter). Of course, this doesn't > > mean it has to be implemented now, but there is a clear way forward if > > it's wanted. > > Putting zc aside, links, graphs, groups, it all sounds interesting in > concept but let's not fool anyone, all the different ordering > relationships between requests proved to be a bad idea. As Jens mentioned, sqe group is very similar with bundle: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle which is really something io_uring is missing. > > I can complaint for long, error handling is miserable, user handling > resubmitting a part of a link is horrible, the concept of errors is > hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK > and the MSG_WAITALL workaround). The handling and workarounds are > leaking into generic paths, e.g. we can't init files when it's the most > convenient. For cancellation we're walking links, which need more care > than just looking at a request (is cancellation by user_data of a > "linked" to a group request even supported?). The list goes on Only the group leader is linked, if the group leader is canceled, all requests in the whole group will be canceled. But yes, cancelling by user_data for group members can't be supported, and it can be documented clearly, since user still can cancel the whole group with group leader's user_data. > > And what does it achieve? The infra has matured since early days, > it saves user-kernel transitions at best but not context switching > overhead, and not even that if you do wait(1) and happen to catch > middle CQEs. And it disables LAZY_WAKE, so CQ side batching with > timers and what not is effectively useless with links. Not only the context switch, it supports 1:N or N:M dependency which is missing in io_uring, but also makes async application easier to write by saving extra context switches, which just adds extra intermediate states for application. > > So, please, please! instead of trying to invent a new uber scheme > of request linking, which surely wouldn't step on same problems > over and over again, and would definitely be destined to overshadow > all previous attempts and finally conquer the world, let's rather > focus on minimasing the damage from this patchset's zero copy if > it's going to be taken. One key problem for zero copy is lifetime of the kernel buffer, which can't cross OPs, that is why sqe group is introduced, for aligning kernel buffer lifetime with the group. Thanks, Ming
On Mon, Apr 29, 2024 at 04:48:37PM +0100, Pavel Begunkov wrote: > On 4/24/24 02:39, Ming Lei wrote: > > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote: > > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: > > > > On 4/7/24 7:03 PM, Ming Lei wrote: > > > > > SQE group is defined as one chain of SQEs starting with the first sqe that > > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > > > > > doesn't have it set, and it is similar with chain of linked sqes. > > > > > > > > > > The 1st SQE is group leader, and the other SQEs are group member. The group > > > > > leader is always freed after all members are completed. Group members > > > > > aren't submitted until the group leader is completed, and there isn't any > > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group > > > > > members, same with IOSQE_IO_DRAIN. > > > > > > > > > > Typically the group leader provides or makes resource, and the other members > > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to > > > > > read data from source file into fixed buffer, the other SQEs write data from > > > > > the same buffer into other destination files. SQE group provides very > > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > > > > > submitted in single syscall, no need to submit fs read SQE first, and wait > > > > > until read SQE is completed, 2) no need to link all write SQEs together, then > > > > > write SQEs can be submitted to files concurrently. Meantime application is > > > > > simplified a lot in this way. > > > > > > > > > > Another use case is to for supporting generic device zero copy: > > > > > > > > > > - the lead SQE is for providing device buffer, which is owned by device or > > > > > kernel, can't be cross userspace, otherwise easy to cause leak for devil > > > > > application or panic > > > > > > > > > > - member SQEs reads or writes concurrently against the buffer provided by lead > > > > > SQE > > > > > > > > In concept, this looks very similar to "sqe bundles" that I played with > > > > in the past: > > > > > > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle > > > > > > > > Didn't look too closely yet at the implementation, but in spirit it's > > > > about the same in that the first entry is processed first, and there's > > > > no ordering implied between the test of the members of the bundle / > > > > group. > > > > > > When I first read this patch, I wondered if it wouldn't make sense to > > > allow linking a group with subsequent requests, e.g. first having a few > > > requests that run in parallel and once all of them have completed > > > continue with the next linked one sequentially. > > > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > > > this. Ming's patch uses a new flag for groups, so the interface would be > > > more obvious, you simply set the LINK flag on the last member of the > > > group (or on the leader, doesn't really matter). Of course, this doesn't > > > mean it has to be implemented now, but there is a clear way forward if > > > it's wanted. > > > > Reusing LINK for bundle breaks existed link chains(BUNDLE linked to existed > > link chain), so I think it may not work. > > > > The link rule is explicit for sqe group: > > > > - only group leader can set link flag, which is applied on the whole > > group: the next sqe in the link chain won't be started until the > > previous linked sqe group is completed > > > > - link flag can't be set for group members > > > > Also sqe group doesn't limit async for both group leader and member. > > > > sqe group vs link & async is covered in the last liburing test code. > > > > > > > > The part that looks a bit arbitrary in Ming's patch is that the group > > > leader is always completed before the rest starts. It makes perfect > > > sense in the context that this series is really after (enabling zero > > > copy for ublk), but it doesn't really allow the case you mention in the > > > SQE bundle commit message, running everything in parallel and getting a > > > single CQE for the whole group. > > > > I think it should be easy to cover bundle in this way, such as add one new > > op IORING_OP_BUNDLE as Jens did, and implement the single CQE for whole group/bundle. > > > > > > > > I suppose you could hack around the sequential nature of the first > > > request by using an extra NOP as the group leader - which isn't any > > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but > > > the group completion would still be missing. (Of course, removing the > > > sequential first operation would mean that ublk wouldn't have the buffer > > > ready any more when the other requests try to use it, so that would > > > defeat the purpose of the series...) > > > > > > I wonder if we can still combine both approaches and create some > > > generally useful infrastructure and not something where it's visible > > > that it was designed mostly for ublk's special case and other use cases > > > just happened to be enabled as a side effect. > > > > sqe group is actually one generic interface, please see the multiple copy( > > copy one file to multiple destinations in single syscall for one range) example > > in the last patch, and it can support generic device zero copy: any device internal > > buffer can be linked with io_uring operations in this way, which can't > > be done by traditional splice/pipe. > > > > I guess it can be used in network Rx zero copy too, but may depend on actual > > network Rx use case. > > I doubt. With storage same data can be read twice. Socket recv consumes No, we don't depend on read twice of storage. > data. Locking a buffer over the duration of another IO doesn't really sound > plausible, same we returning a buffer back. It'd be different if you can > read the buffer into the userspace if something goes wrong, but perhaps > you remember the fused discussion. As I mentioned, it depends on actual use case. If the received data can be consumed by following io_uring OPs, such as be sent to network or FS, this way works just fine, but if data needs to live longer or further processing, the data can still be saved to userpsace by io_uring OPs linked to the sqe group. Thanks, Ming
On 4/30/24 04:03, Ming Lei wrote: > On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote: >> On 4/23/24 14:08, Kevin Wolf wrote: >>> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: >>>> On 4/7/24 7:03 PM, Ming Lei wrote: >>>>> SQE group is defined as one chain of SQEs starting with the first sqe that >>>>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that >>>>> doesn't have it set, and it is similar with chain of linked sqes. >>>>> >>>>> The 1st SQE is group leader, and the other SQEs are group member. The group >>>>> leader is always freed after all members are completed. Group members >>>>> aren't submitted until the group leader is completed, and there isn't any >>>>> dependency among group members, and IOSQE_IO_LINK can't be set for group >>>>> members, same with IOSQE_IO_DRAIN. >>>>> >>>>> Typically the group leader provides or makes resource, and the other members >>>>> consume the resource, such as scenario of multiple backup, the 1st SQE is to >>>>> read data from source file into fixed buffer, the other SQEs write data from >>>>> the same buffer into other destination files. SQE group provides very >>>>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be >>>>> submitted in single syscall, no need to submit fs read SQE first, and wait >>>>> until read SQE is completed, 2) no need to link all write SQEs together, then >>>>> write SQEs can be submitted to files concurrently. Meantime application is >>>>> simplified a lot in this way. >>>>> >>>>> Another use case is to for supporting generic device zero copy: >>>>> >>>>> - the lead SQE is for providing device buffer, which is owned by device or >>>>> kernel, can't be cross userspace, otherwise easy to cause leak for devil >>>>> application or panic >>>>> >>>>> - member SQEs reads or writes concurrently against the buffer provided by lead >>>>> SQE >>>> >>>> In concept, this looks very similar to "sqe bundles" that I played with >>>> in the past: >>>> >>>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle >>>> >>>> Didn't look too closely yet at the implementation, but in spirit it's >>>> about the same in that the first entry is processed first, and there's >>>> no ordering implied between the test of the members of the bundle / >>>> group. >>> >>> When I first read this patch, I wondered if it wouldn't make sense to >>> allow linking a group with subsequent requests, e.g. first having a few >>> requests that run in parallel and once all of them have completed >>> continue with the next linked one sequentially. >>> >>> For SQE bundles, you reused the LINK flag, which doesn't easily allow >>> this. Ming's patch uses a new flag for groups, so the interface would be >>> more obvious, you simply set the LINK flag on the last member of the >>> group (or on the leader, doesn't really matter). Of course, this doesn't >>> mean it has to be implemented now, but there is a clear way forward if >>> it's wanted. >> >> Putting zc aside, links, graphs, groups, it all sounds interesting in >> concept but let's not fool anyone, all the different ordering >> relationships between requests proved to be a bad idea. > > As Jens mentioned, sqe group is very similar with bundle: > > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle > > which is really something io_uring is missing. One could've said same about links, retrospectively I argue that it was a mistake, so I pretty much doubt arguments like "io_uring is missing it". Another thing is that zero copy, which is not possible to implement by returning to the userspace. >> I can complaint for long, error handling is miserable, user handling >> resubmitting a part of a link is horrible, the concept of errors is >> hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK >> and the MSG_WAITALL workaround). The handling and workarounds are >> leaking into generic paths, e.g. we can't init files when it's the most >> convenient. For cancellation we're walking links, which need more care >> than just looking at a request (is cancellation by user_data of a >> "linked" to a group request even supported?). The list goes on > > Only the group leader is linked, if the group leader is canceled, all > requests in the whole group will be canceled. > > But yes, cancelling by user_data for group members can't be supported, > and it can be documented clearly, since user still can cancel the whole > group with group leader's user_data. Which means it'd break the case REQ_F_INFLIGHT covers, and you need to disallow linking REQ_F_INFLIGHT marked requests. >> And what does it achieve? The infra has matured since early days, >> it saves user-kernel transitions at best but not context switching >> overhead, and not even that if you do wait(1) and happen to catch >> middle CQEs. And it disables LAZY_WAKE, so CQ side batching with >> timers and what not is effectively useless with links. > > Not only the context switch, it supports 1:N or N:M dependency which I completely missed, how N:M is supported? That starting to sound terrifying. > is missing in io_uring, but also makes async application easier to write by > saving extra context switches, which just adds extra intermediate states for > application. You're still executing requests (i.e. ->issue) primarily from the submitter task context, they would still fly back to the task and wake it up. You may save something by completing all of them together via that refcounting, but you might just as well try to batch CQ, which is a more generic issue. It's not clear what context switches you save then. As for simplicity, using the link example and considering error handling, it only complicates it. In case of an error you need to figure out a middle req failed, collect all failed CQEs linked to it and automatically cancelled (unless SKIP_COMPLETE is used), and then resubmit the failed. That's great your reads are idempotent and presumably you don't have to resubmit half a link, but in the grand picture of things it's rather one of use cases where a generic feature can be used. >> So, please, please! instead of trying to invent a new uber scheme >> of request linking, which surely wouldn't step on same problems >> over and over again, and would definitely be destined to overshadow >> all previous attempts and finally conquer the world, let's rather >> focus on minimasing the damage from this patchset's zero copy if >> it's going to be taken. > > One key problem for zero copy is lifetime of the kernel buffer, which > can't cross OPs, that is why sqe group is introduced, for aligning > kernel buffer lifetime with the group. Right, which is why I'm saying if we're leaving groups with zero copy, let's rather try to make them simple and not intrusive as much as possible, instead of creating an unsupportable overarching beast out of it, which would fail as a generic feature.
On Tue, Apr 30, 2024 at 01:27:10PM +0100, Pavel Begunkov wrote: > On 4/30/24 04:03, Ming Lei wrote: > > On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote: > > > On 4/23/24 14:08, Kevin Wolf wrote: > > > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: > > > > > On 4/7/24 7:03 PM, Ming Lei wrote: > > > > > > SQE group is defined as one chain of SQEs starting with the first sqe that > > > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > > > > > > doesn't have it set, and it is similar with chain of linked sqes. > > > > > > > > > > > > The 1st SQE is group leader, and the other SQEs are group member. The group > > > > > > leader is always freed after all members are completed. Group members > > > > > > aren't submitted until the group leader is completed, and there isn't any > > > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group > > > > > > members, same with IOSQE_IO_DRAIN. > > > > > > > > > > > > Typically the group leader provides or makes resource, and the other members > > > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to > > > > > > read data from source file into fixed buffer, the other SQEs write data from > > > > > > the same buffer into other destination files. SQE group provides very > > > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > > > > > > submitted in single syscall, no need to submit fs read SQE first, and wait > > > > > > until read SQE is completed, 2) no need to link all write SQEs together, then > > > > > > write SQEs can be submitted to files concurrently. Meantime application is > > > > > > simplified a lot in this way. > > > > > > > > > > > > Another use case is to for supporting generic device zero copy: > > > > > > > > > > > > - the lead SQE is for providing device buffer, which is owned by device or > > > > > > kernel, can't be cross userspace, otherwise easy to cause leak for devil > > > > > > application or panic > > > > > > > > > > > > - member SQEs reads or writes concurrently against the buffer provided by lead > > > > > > SQE > > > > > > > > > > In concept, this looks very similar to "sqe bundles" that I played with > > > > > in the past: > > > > > > > > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle > > > > > > > > > > Didn't look too closely yet at the implementation, but in spirit it's > > > > > about the same in that the first entry is processed first, and there's > > > > > no ordering implied between the test of the members of the bundle / > > > > > group. > > > > > > > > When I first read this patch, I wondered if it wouldn't make sense to > > > > allow linking a group with subsequent requests, e.g. first having a few > > > > requests that run in parallel and once all of them have completed > > > > continue with the next linked one sequentially. > > > > > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > > > > this. Ming's patch uses a new flag for groups, so the interface would be > > > > more obvious, you simply set the LINK flag on the last member of the > > > > group (or on the leader, doesn't really matter). Of course, this doesn't > > > > mean it has to be implemented now, but there is a clear way forward if > > > > it's wanted. > > > > > > Putting zc aside, links, graphs, groups, it all sounds interesting in > > > concept but let's not fool anyone, all the different ordering > > > relationships between requests proved to be a bad idea. > > > > As Jens mentioned, sqe group is very similar with bundle: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle > > > > which is really something io_uring is missing. > > One could've said same about links, retrospectively I argue that it > was a mistake, so I pretty much doubt arguments like "io_uring is > missing it". Another thing is that zero copy, which is not possible > to implement by returning to the userspace. > > > > I can complaint for long, error handling is miserable, user handling > > > resubmitting a part of a link is horrible, the concept of errors is > > > hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK > > > and the MSG_WAITALL workaround). The handling and workarounds are > > > leaking into generic paths, e.g. we can't init files when it's the most > > > convenient. For cancellation we're walking links, which need more care > > > than just looking at a request (is cancellation by user_data of a > > > "linked" to a group request even supported?). The list goes on > > > > Only the group leader is linked, if the group leader is canceled, all > > requests in the whole group will be canceled. > > > > But yes, cancelling by user_data for group members can't be supported, > > and it can be documented clearly, since user still can cancel the whole > > group with group leader's user_data. > > Which means it'd break the case REQ_F_INFLIGHT covers, and you need > to disallow linking REQ_F_INFLIGHT marked requests. Both io_match_linked() and io_match_task() only iterates over req's link chain, and only the group leader can appear in this link chain, which is exactly the usual handling. So care to explain it a bit what the real link issue is about sqe group? > > > > And what does it achieve? The infra has matured since early days, > > > it saves user-kernel transitions at best but not context switching > > > overhead, and not even that if you do wait(1) and happen to catch > > > middle CQEs. And it disables LAZY_WAKE, so CQ side batching with > > > timers and what not is effectively useless with links. > > > > Not only the context switch, it supports 1:N or N:M dependency which > > I completely missed, how N:M is supported? That starting to sound > terrifying. N:M is actually from Kevin's idea. sqe group can be made to be more flexible by: Inside the group, all SQEs are submitted in parallel, so there isn't any dependency among SQEs in one group. The 1st SQE is group leader, and the other SQEs are group member. The whole group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and the two flags can't be set for group members. When the group is in one link chain, this group isn't submitted until the previous SQE or group is completed. And the following SQE or group can't be started if this group isn't completed. When IOSQE_IO_DRAIN is set for group leader, all requests in this group and previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for group leader only, we respect IO_DRAIN for SQE group by always completing group leader as the last on in the group. SQE group provides flexible way to support N:M dependency, such as: - group A is chained with group B together by IOSQE_IO_LINK - group A has N SQEs - group B has M SQEs then M SQEs in group B depend on N SQEs in group A. > > > is missing in io_uring, but also makes async application easier to write by > > saving extra context switches, which just adds extra intermediate states for > > application. > > You're still executing requests (i.e. ->issue) primarily from the > submitter task context, they would still fly back to the task and > wake it up. You may save something by completing all of them > together via that refcounting, but you might just as well try to > batch CQ, which is a more generic issue. It's not clear what > context switches you save then. Wrt. the above N:M example, one io_uring_enter() is enough, and it can't be done in single context switch without sqe group, please see the liburing test code: https://lore.kernel.org/io-uring/ZiHA+pN28hRdprhX@fedora/T/#ma755c500eab0b7dc8c1473448dd98f093097e066 > > As for simplicity, using the link example and considering error > handling, it only complicates it. In case of an error you need to > figure out a middle req failed, collect all failed CQEs linked to > it and automatically cancelled (unless SKIP_COMPLETE is used), and > then resubmit the failed. That's great your reads are idempotent > and presumably you don't have to resubmit half a link, but in the > grand picture of things it's rather one of use cases where a generic > feature can be used. SQE group doesn't change the current link implementation, and N:M dependency is built over IOSQE_IO_LINK actually. > > > > So, please, please! instead of trying to invent a new uber scheme > > > of request linking, which surely wouldn't step on same problems > > > over and over again, and would definitely be destined to overshadow > > > all previous attempts and finally conquer the world, let's rather > > > focus on minimasing the damage from this patchset's zero copy if > > > it's going to be taken. > > > > One key problem for zero copy is lifetime of the kernel buffer, which > > can't cross OPs, that is why sqe group is introduced, for aligning > > kernel buffer lifetime with the group. > > Right, which is why I'm saying if we're leaving groups with zero > copy, let's rather try to make them simple and not intrusive as > much as possible, instead of creating an unsupportable overarching > beast out of it, which would fail as a generic feature. Then it degraded to the original fused command, :-) Thanks, Ming
On 4/30/24 16:00, Ming Lei wrote: > On Tue, Apr 30, 2024 at 01:27:10PM +0100, Pavel Begunkov wrote: >> On 4/30/24 04:03, Ming Lei wrote: >>> On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote: >>>> On 4/23/24 14:08, Kevin Wolf wrote: >>>>> Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: >>>>>> On 4/7/24 7:03 PM, Ming Lei wrote: >>>>>>> SQE group is defined as one chain of SQEs starting with the first sqe that >>>>>>> has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that >>>>>>> doesn't have it set, and it is similar with chain of linked sqes. >>>>>>> >>>>>>> The 1st SQE is group leader, and the other SQEs are group member. The group >>>>>>> leader is always freed after all members are completed. Group members >>>>>>> aren't submitted until the group leader is completed, and there isn't any >>>>>>> dependency among group members, and IOSQE_IO_LINK can't be set for group >>>>>>> members, same with IOSQE_IO_DRAIN. >>>>>>> >>>>>>> Typically the group leader provides or makes resource, and the other members >>>>>>> consume the resource, such as scenario of multiple backup, the 1st SQE is to >>>>>>> read data from source file into fixed buffer, the other SQEs write data from >>>>>>> the same buffer into other destination files. SQE group provides very >>>>>>> efficient way to complete this task: 1) fs write SQEs and fs read SQE can be >>>>>>> submitted in single syscall, no need to submit fs read SQE first, and wait >>>>>>> until read SQE is completed, 2) no need to link all write SQEs together, then >>>>>>> write SQEs can be submitted to files concurrently. Meantime application is >>>>>>> simplified a lot in this way. >>>>>>> >>>>>>> Another use case is to for supporting generic device zero copy: >>>>>>> >>>>>>> - the lead SQE is for providing device buffer, which is owned by device or >>>>>>> kernel, can't be cross userspace, otherwise easy to cause leak for devil >>>>>>> application or panic >>>>>>> >>>>>>> - member SQEs reads or writes concurrently against the buffer provided by lead >>>>>>> SQE >>>>>> >>>>>> In concept, this looks very similar to "sqe bundles" that I played with >>>>>> in the past: >>>>>> >>>>>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle >>>>>> >>>>>> Didn't look too closely yet at the implementation, but in spirit it's >>>>>> about the same in that the first entry is processed first, and there's >>>>>> no ordering implied between the test of the members of the bundle / >>>>>> group. >>>>> >>>>> When I first read this patch, I wondered if it wouldn't make sense to >>>>> allow linking a group with subsequent requests, e.g. first having a few >>>>> requests that run in parallel and once all of them have completed >>>>> continue with the next linked one sequentially. >>>>> >>>>> For SQE bundles, you reused the LINK flag, which doesn't easily allow >>>>> this. Ming's patch uses a new flag for groups, so the interface would be >>>>> more obvious, you simply set the LINK flag on the last member of the >>>>> group (or on the leader, doesn't really matter). Of course, this doesn't >>>>> mean it has to be implemented now, but there is a clear way forward if >>>>> it's wanted. >>>> >>>> Putting zc aside, links, graphs, groups, it all sounds interesting in >>>> concept but let's not fool anyone, all the different ordering >>>> relationships between requests proved to be a bad idea. >>> >>> As Jens mentioned, sqe group is very similar with bundle: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle >>> >>> which is really something io_uring is missing. >> >> One could've said same about links, retrospectively I argue that it >> was a mistake, so I pretty much doubt arguments like "io_uring is >> missing it". Another thing is that zero copy, which is not possible >> to implement by returning to the userspace. >> >>>> I can complaint for long, error handling is miserable, user handling >>>> resubmitting a part of a link is horrible, the concept of errors is >>>> hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK >>>> and the MSG_WAITALL workaround). The handling and workarounds are >>>> leaking into generic paths, e.g. we can't init files when it's the most >>>> convenient. For cancellation we're walking links, which need more care >>>> than just looking at a request (is cancellation by user_data of a >>>> "linked" to a group request even supported?). The list goes on >>> >>> Only the group leader is linked, if the group leader is canceled, all >>> requests in the whole group will be canceled. >>> >>> But yes, cancelling by user_data for group members can't be supported, >>> and it can be documented clearly, since user still can cancel the whole >>> group with group leader's user_data. >> >> Which means it'd break the case REQ_F_INFLIGHT covers, and you need >> to disallow linking REQ_F_INFLIGHT marked requests. > > Both io_match_linked() and io_match_task() only iterates over req's > link chain, and only the group leader can appear in this link chain, > which is exactly the usual handling. > > So care to explain it a bit what the real link issue is about sqe group? Because of ref deps when a task exits it has to cancel REQ_F_INFLIGHT requests, and therefore they should be discoverable. The flag is only for POLL_ADD requests polling io_uring fds, should be good enough if you disallow such requests from grouping. >>>> And what does it achieve? The infra has matured since early days, >>>> it saves user-kernel transitions at best but not context switching >>>> overhead, and not even that if you do wait(1) and happen to catch >>>> middle CQEs. And it disables LAZY_WAKE, so CQ side batching with >>>> timers and what not is effectively useless with links. >>> >>> Not only the context switch, it supports 1:N or N:M dependency which >> >> I completely missed, how N:M is supported? That starting to sound >> terrifying. > > N:M is actually from Kevin's idea. > > sqe group can be made to be more flexible by: > > Inside the group, all SQEs are submitted in parallel, so there isn't any > dependency among SQEs in one group. > > The 1st SQE is group leader, and the other SQEs are group member. The whole > group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and > the two flags can't be set for group members. > > When the group is in one link chain, this group isn't submitted until > the previous SQE or group is completed. And the following SQE or group > can't be started if this group isn't completed. > > When IOSQE_IO_DRAIN is set for group leader, all requests in this group > and previous requests submitted are drained. Given IOSQE_IO_DRAIN can > be set for group leader only, we respect IO_DRAIN for SQE group by > always completing group leader as the last on in the group. > > SQE group provides flexible way to support N:M dependency, such as: > > - group A is chained with group B together by IOSQE_IO_LINK > - group A has N SQEs > - group B has M SQEs > > then M SQEs in group B depend on N SQEs in group A. In other words, linking groups together with basically no extra rules. Fwiw, sounds generic, but if there are complications with IOSQE_IO_DRAIN that I don't immediately see, it'd be more reasonable to just disable it. >>> is missing in io_uring, but also makes async application easier to write by >>> saving extra context switches, which just adds extra intermediate states for >>> application. >> >> You're still executing requests (i.e. ->issue) primarily from the >> submitter task context, they would still fly back to the task and >> wake it up. You may save something by completing all of them >> together via that refcounting, but you might just as well try to >> batch CQ, which is a more generic issue. It's not clear what >> context switches you save then. > > Wrt. the above N:M example, one io_uring_enter() is enough, and > it can't be done in single context switch without sqe group, please > see the liburing test code: > > https://lore.kernel.org/io-uring/ZiHA+pN28hRdprhX@fedora/T/#ma755c500eab0b7dc8c1473448dd98f093097e066 > >> >> As for simplicity, using the link example and considering error >> handling, it only complicates it. In case of an error you need to >> figure out a middle req failed, collect all failed CQEs linked to >> it and automatically cancelled (unless SKIP_COMPLETE is used), and >> then resubmit the failed. That's great your reads are idempotent >> and presumably you don't have to resubmit half a link, but in the >> grand picture of things it's rather one of use cases where a generic >> feature can be used. > > SQE group doesn't change the current link implementation, and N:M > dependency is built over IOSQE_IO_LINK actually. > >> >>>> So, please, please! instead of trying to invent a new uber scheme >>>> of request linking, which surely wouldn't step on same problems >>>> over and over again, and would definitely be destined to overshadow >>>> all previous attempts and finally conquer the world, let's rather >>>> focus on minimasing the damage from this patchset's zero copy if >>>> it's going to be taken. >>> >>> One key problem for zero copy is lifetime of the kernel buffer, which >>> can't cross OPs, that is why sqe group is introduced, for aligning >>> kernel buffer lifetime with the group. >> >> Right, which is why I'm saying if we're leaving groups with zero >> copy, let's rather try to make them simple and not intrusive as >> much as possible, instead of creating an unsupportable overarching >> beast out of it, which would fail as a generic feature. > > Then it degraded to the original fused command, :-) I'm not arguing about restricting it to 1 request in a group apart from the master/leader/etc., if that's what you mean. The argument is rather to limit the overhead and abstraction leakage into hot paths. For example that N:M, all that sounds great on paper until it sees the harsh reality. And instead of linking groups together, you can perfectly fine be sending one group at a time and queuing the next group from the userspace when the previous completes. And that would save space in io_kiocb, maybe(?) a flag, maybe something else. Regardless of interoperability with links, I'd also prefer it to be folded into the link code structure and state machine, so it's not all over in the submission/completion paths adding overhead for one narrow feature, including un-inlining the entire submission path. E.g. there should be no separate hand coded duplicated SQE / request assembling like in io_init_req_group(). Instead it should be able to consume SQEs and requests it's given from submit_sqes(), and e.g. process grouping in io_submit_sqe() around the place requests are linked.
On 4/30/24 16:00, Ming Lei wrote: > On Tue, Apr 30, 2024 at 01:27:10PM +0100, Pavel Begunkov wrote: ... >>>> And what does it achieve? The infra has matured since early days, >>>> it saves user-kernel transitions at best but not context switching >>>> overhead, and not even that if you do wait(1) and happen to catch >>>> middle CQEs. And it disables LAZY_WAKE, so CQ side batching with >>>> timers and what not is effectively useless with links. >>> >>> Not only the context switch, it supports 1:N or N:M dependency which >> >> I completely missed, how N:M is supported? That starting to sound >> terrifying. > > N:M is actually from Kevin's idea. > > sqe group can be made to be more flexible by: > > Inside the group, all SQEs are submitted in parallel, so there isn't any > dependency among SQEs in one group. > > The 1st SQE is group leader, and the other SQEs are group member. The whole > group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and > the two flags can't be set for group members. > > When the group is in one link chain, this group isn't submitted until > the previous SQE or group is completed. And the following SQE or group > can't be started if this group isn't completed. > > When IOSQE_IO_DRAIN is set for group leader, all requests in this group > and previous requests submitted are drained. Given IOSQE_IO_DRAIN can > be set for group leader only, we respect IO_DRAIN for SQE group by > always completing group leader as the last on in the group. > > SQE group provides flexible way to support N:M dependency, such as: > > - group A is chained with group B together by IOSQE_IO_LINK > - group A has N SQEs > - group B has M SQEs > > then M SQEs in group B depend on N SQEs in group A. > > >> >>> is missing in io_uring, but also makes async application easier to write by >>> saving extra context switches, which just adds extra intermediate states for >>> application. >> >> You're still executing requests (i.e. ->issue) primarily from the >> submitter task context, they would still fly back to the task and >> wake it up. You may save something by completing all of them >> together via that refcounting, but you might just as well try to >> batch CQ, which is a more generic issue. It's not clear what >> context switches you save then. > > Wrt. the above N:M example, one io_uring_enter() is enough, and > it can't be done in single context switch without sqe group, please > see the liburing test code: Do you mean doing all that in a single system call? The main performance problem for io_uring is waiting, i.e. schedule()ing the task out and in, that's what I meant by context switching.
On Thu, May 02, 2024 at 03:09:13PM +0100, Pavel Begunkov wrote: > On 4/30/24 16:00, Ming Lei wrote: > > On Tue, Apr 30, 2024 at 01:27:10PM +0100, Pavel Begunkov wrote: > > > On 4/30/24 04:03, Ming Lei wrote: > > > > On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote: > > > > > On 4/23/24 14:08, Kevin Wolf wrote: > > > > > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben: > > > > > > > On 4/7/24 7:03 PM, Ming Lei wrote: > > > > > > > > SQE group is defined as one chain of SQEs starting with the first sqe that > > > > > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that > > > > > > > > doesn't have it set, and it is similar with chain of linked sqes. > > > > > > > > > > > > > > > > The 1st SQE is group leader, and the other SQEs are group member. The group > > > > > > > > leader is always freed after all members are completed. Group members > > > > > > > > aren't submitted until the group leader is completed, and there isn't any > > > > > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group > > > > > > > > members, same with IOSQE_IO_DRAIN. > > > > > > > > > > > > > > > > Typically the group leader provides or makes resource, and the other members > > > > > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to > > > > > > > > read data from source file into fixed buffer, the other SQEs write data from > > > > > > > > the same buffer into other destination files. SQE group provides very > > > > > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be > > > > > > > > submitted in single syscall, no need to submit fs read SQE first, and wait > > > > > > > > until read SQE is completed, 2) no need to link all write SQEs together, then > > > > > > > > write SQEs can be submitted to files concurrently. Meantime application is > > > > > > > > simplified a lot in this way. > > > > > > > > > > > > > > > > Another use case is to for supporting generic device zero copy: > > > > > > > > > > > > > > > > - the lead SQE is for providing device buffer, which is owned by device or > > > > > > > > kernel, can't be cross userspace, otherwise easy to cause leak for devil > > > > > > > > application or panic > > > > > > > > > > > > > > > > - member SQEs reads or writes concurrently against the buffer provided by lead > > > > > > > > SQE > > > > > > > > > > > > > > In concept, this looks very similar to "sqe bundles" that I played with > > > > > > > in the past: > > > > > > > > > > > > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle > > > > > > > > > > > > > > Didn't look too closely yet at the implementation, but in spirit it's > > > > > > > about the same in that the first entry is processed first, and there's > > > > > > > no ordering implied between the test of the members of the bundle / > > > > > > > group. > > > > > > > > > > > > When I first read this patch, I wondered if it wouldn't make sense to > > > > > > allow linking a group with subsequent requests, e.g. first having a few > > > > > > requests that run in parallel and once all of them have completed > > > > > > continue with the next linked one sequentially. > > > > > > > > > > > > For SQE bundles, you reused the LINK flag, which doesn't easily allow > > > > > > this. Ming's patch uses a new flag for groups, so the interface would be > > > > > > more obvious, you simply set the LINK flag on the last member of the > > > > > > group (or on the leader, doesn't really matter). Of course, this doesn't > > > > > > mean it has to be implemented now, but there is a clear way forward if > > > > > > it's wanted. > > > > > > > > > > Putting zc aside, links, graphs, groups, it all sounds interesting in > > > > > concept but let's not fool anyone, all the different ordering > > > > > relationships between requests proved to be a bad idea. > > > > > > > > As Jens mentioned, sqe group is very similar with bundle: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle > > > > > > > > which is really something io_uring is missing. > > > > > > One could've said same about links, retrospectively I argue that it > > > was a mistake, so I pretty much doubt arguments like "io_uring is > > > missing it". Another thing is that zero copy, which is not possible > > > to implement by returning to the userspace. > > > > > > > > I can complaint for long, error handling is miserable, user handling > > > > > resubmitting a part of a link is horrible, the concept of errors is > > > > > hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK > > > > > and the MSG_WAITALL workaround). The handling and workarounds are > > > > > leaking into generic paths, e.g. we can't init files when it's the most > > > > > convenient. For cancellation we're walking links, which need more care > > > > > than just looking at a request (is cancellation by user_data of a > > > > > "linked" to a group request even supported?). The list goes on > > > > > > > > Only the group leader is linked, if the group leader is canceled, all > > > > requests in the whole group will be canceled. > > > > > > > > But yes, cancelling by user_data for group members can't be supported, > > > > and it can be documented clearly, since user still can cancel the whole > > > > group with group leader's user_data. > > > > > > Which means it'd break the case REQ_F_INFLIGHT covers, and you need > > > to disallow linking REQ_F_INFLIGHT marked requests. > > > > Both io_match_linked() and io_match_task() only iterates over req's > > link chain, and only the group leader can appear in this link chain, > > which is exactly the usual handling. > > > > So care to explain it a bit what the real link issue is about sqe group? > > Because of ref deps when a task exits it has to cancel REQ_F_INFLIGHT > requests, and therefore they should be discoverable. The flag is only > for POLL_ADD requests polling io_uring fds, should be good enough if > you disallow such requests from grouping. The group leader is always marked as REQ_F_INFLIGHT and discoverable, and the cancel code path cancels group leader request which will guarantees that all group members are canceled, so I think this change isn't needed. But io_get_sequence() needs to take it into account, such as: @@ -1680,8 +1846,12 @@ static u32 io_get_sequence(struct io_kiocb *req) struct io_kiocb *cur; /* need original cached_sq_head, but it was increased for each req */ - io_for_each_link(cur, req) - seq--; + io_for_each_link(cur, req) { + if (req_is_group_lead(cur)) + seq -= atomic_read(&cur->grp_refs); + else + seq--; + } > > > > > > And what does it achieve? The infra has matured since early days, > > > > > it saves user-kernel transitions at best but not context switching > > > > > overhead, and not even that if you do wait(1) and happen to catch > > > > > middle CQEs. And it disables LAZY_WAKE, so CQ side batching with > > > > > timers and what not is effectively useless with links. > > > > > > > > Not only the context switch, it supports 1:N or N:M dependency which > > > > > > I completely missed, how N:M is supported? That starting to sound > > > terrifying. > > > > N:M is actually from Kevin's idea. > > > > sqe group can be made to be more flexible by: > > > > Inside the group, all SQEs are submitted in parallel, so there isn't any > > dependency among SQEs in one group. > > The 1st SQE is group leader, and the other SQEs are group member. The whole > > group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and > > the two flags can't be set for group members. > > When the group is in one link chain, this group isn't submitted until > > the previous SQE or group is completed. And the following SQE or group > > can't be started if this group isn't completed. > > When IOSQE_IO_DRAIN is set for group leader, all requests in this group > > and previous requests submitted are drained. Given IOSQE_IO_DRAIN can > > be set for group leader only, we respect IO_DRAIN for SQE group by > > always completing group leader as the last on in the group. > > SQE group provides flexible way to support N:M dependency, such as: > > - group A is chained with group B together by IOSQE_IO_LINK > > - group A has N SQEs > > - group B has M SQEs > > then M SQEs in group B depend on N SQEs in group A. > > In other words, linking groups together with basically no extra rules. > Fwiw, sounds generic, but if there are complications with IOSQE_IO_DRAIN > that I don't immediately see, it'd be more reasonable to just disable it. The only change on IO_DRAIN is on io_get_sequence() for taking leader->grp_refs into account, and leader->grp_refs covers all requests in this group. And my local patchset passes all related sqe->flags combination(DRAIN, LINKING, ASYNC) on both single group or linked groups, meantime with extra change of sharing same FORCE_ASYNC for both leader and members. > > > > > is missing in io_uring, but also makes async application easier to write by > > > > saving extra context switches, which just adds extra intermediate states for > > > > application. > > > > > > You're still executing requests (i.e. ->issue) primarily from the > > > submitter task context, they would still fly back to the task and > > > wake it up. You may save something by completing all of them > > > together via that refcounting, but you might just as well try to > > > batch CQ, which is a more generic issue. It's not clear what > > > context switches you save then. > > > > Wrt. the above N:M example, one io_uring_enter() is enough, and > > it can't be done in single context switch without sqe group, please > > see the liburing test code: > > > > https://lore.kernel.org/io-uring/ZiHA+pN28hRdprhX@fedora/T/#ma755c500eab0b7dc8c1473448dd98f093097e066 > > > > > > > > As for simplicity, using the link example and considering error > > > handling, it only complicates it. In case of an error you need to > > > figure out a middle req failed, collect all failed CQEs linked to > > > it and automatically cancelled (unless SKIP_COMPLETE is used), and > > > then resubmit the failed. That's great your reads are idempotent > > > and presumably you don't have to resubmit half a link, but in the > > > grand picture of things it's rather one of use cases where a generic > > > feature can be used. > > > > SQE group doesn't change the current link implementation, and N:M > > dependency is built over IOSQE_IO_LINK actually. > > > > > > > > > > So, please, please! instead of trying to invent a new uber scheme > > > > > of request linking, which surely wouldn't step on same problems > > > > > over and over again, and would definitely be destined to overshadow > > > > > all previous attempts and finally conquer the world, let's rather > > > > > focus on minimasing the damage from this patchset's zero copy if > > > > > it's going to be taken. > > > > > > > > One key problem for zero copy is lifetime of the kernel buffer, which > > > > can't cross OPs, that is why sqe group is introduced, for aligning > > > > kernel buffer lifetime with the group. > > > > > > Right, which is why I'm saying if we're leaving groups with zero > > > copy, let's rather try to make them simple and not intrusive as > > > much as possible, instead of creating an unsupportable overarching > > > beast out of it, which would fail as a generic feature. > > > > Then it degraded to the original fused command, :-) > > I'm not arguing about restricting it to 1 request in a group apart > from the master/leader/etc., if that's what you mean. The argument > is rather to limit the overhead and abstraction leakage into hot > paths. > > For example that N:M, all that sounds great on paper until it sees > the harsh reality. And instead of linking groups together, you > can perfectly fine be sending one group at a time and queuing the > next group from the userspace when the previous completes. And > that would save space in io_kiocb, maybe(?) a flag, maybe something > else. Yes, it can be done with userspace, with cost of one extra io_uring_enter() for handling two depended groups in userspace. io_uring excellent performance is attributed to great batch processing, if group linking is done in per-IO level, this way could degrade performance a lot. And we know io_uring may perform worse in QD=1. And the introduced 12 bytes doesn't add extra l1 cacheline, and won't be touched in code path without using group. > > Regardless of interoperability with links, I'd also prefer it to > be folded into the link code structure and state machine, so > it's not all over in the submission/completion paths adding > overhead for one narrow feature, including un-inlining the > entire submission path. I can work toward this direction if the idea of sqe group may be accepted but the biggest blocker could be where to allocate the extra SQE_GROUP flag. > > E.g. there should be no separate hand coded duplicated SQE > / request assembling like in io_init_req_group(). Instead > it should be able to consume SQEs and requests it's given > from submit_sqes(), and e.g. process grouping in > io_submit_sqe() around the place requests are linked. Yeah, IOSQE_IO_LINK sort of handling could be needed for processing requests in group. Thanks, Ming
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 67347e5d06ec..7ce4a2d4a8b8 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -444,6 +444,7 @@ enum { REQ_F_BUFFER_SELECT_BIT = IOSQE_BUFFER_SELECT_BIT, REQ_F_CQE_SKIP_BIT = IOSQE_CQE_SKIP_SUCCESS_BIT, REQ_F_SQE_EXT_FLAGS_BIT = IOSQE_HAS_EXT_FLAGS_BIT, + REQ_F_SQE_GROUP_BIT = 8 + IOSQE_EXT_SQE_GROUP_BIT, /* first 2 bytes are taken by user flags, shift it to not overlap */ REQ_F_FAIL_BIT = 16, @@ -474,6 +475,7 @@ enum { REQ_F_CAN_POLL_BIT, REQ_F_BL_EMPTY_BIT, REQ_F_BL_NO_RECYCLE_BIT, + REQ_F_SQE_GROUP_LEAD_BIT, /* not a real bit, just to check we're not overflowing the space */ __REQ_F_LAST_BIT, @@ -497,6 +499,8 @@ enum { REQ_F_BUFFER_SELECT = IO_REQ_FLAG(REQ_F_BUFFER_SELECT_BIT), /* IOSQE_CQE_SKIP_SUCCESS */ REQ_F_CQE_SKIP = IO_REQ_FLAG(REQ_F_CQE_SKIP_BIT), + /* IOSQE_EXT_SQE_GROUP */ + REQ_F_SQE_GROUP = IO_REQ_FLAG(REQ_F_SQE_GROUP_BIT), /* fail rest of links */ REQ_F_FAIL = IO_REQ_FLAG(REQ_F_FAIL_BIT), @@ -552,6 +556,8 @@ enum { REQ_F_BL_EMPTY = IO_REQ_FLAG(REQ_F_BL_EMPTY_BIT), /* don't recycle provided buffers for this request */ REQ_F_BL_NO_RECYCLE = IO_REQ_FLAG(REQ_F_BL_NO_RECYCLE_BIT), + /* sqe group lead */ + REQ_F_SQE_GROUP_LEAD = IO_REQ_FLAG(REQ_F_SQE_GROUP_LEAD_BIT), }; typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts); @@ -665,6 +671,10 @@ struct io_kiocb { u64 extra1; u64 extra2; } big_cqe; + + /* all SQE group members linked here for group lead */ + struct io_kiocb *grp_link; + atomic_t grp_refs; }; struct io_overflow_cqe { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 4847d7cf1ac9..c0d34f2a2c17 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -128,6 +128,8 @@ enum io_uring_sqe_flags_bit { IOSQE_BUFFER_SELECT_BIT, IOSQE_CQE_SKIP_SUCCESS_BIT, IOSQE_HAS_EXT_FLAGS_BIT, + + IOSQE_EXT_SQE_GROUP_BIT = 0, }; /* @@ -152,6 +154,8 @@ enum io_uring_sqe_flags_bit { * sqe->uring_cmd_flags for IORING_URING_CMD. */ #define IOSQE_HAS_EXT_FLAGS (1U << IOSQE_HAS_EXT_FLAGS_BIT) +/* defines sqe group */ +#define IOSQE_EXT_SQE_GROUP (1U << IOSQE_EXT_SQE_GROUP_BIT) /* * io_uring_setup() flags diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 4969d21ea2f8..0f41b26723a8 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -147,6 +147,10 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx, bool cancel_all); static void io_queue_sqe(struct io_kiocb *req); +static bool io_get_sqe(struct io_ring_ctx *ctx, + const struct io_uring_sqe **sqe); +static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, + const struct io_uring_sqe *sqe); struct kmem_cache *req_cachep; static struct workqueue_struct *iou_wq __ro_after_init; @@ -925,10 +929,189 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) return posted; } +static void __io_req_failed(struct io_kiocb *req, s32 res, bool skip_cqe) +{ + const struct io_cold_def *def = &io_cold_defs[req->opcode]; + + lockdep_assert_held(&req->ctx->uring_lock); + + if (skip_cqe) + req->flags |= REQ_F_FAIL | REQ_F_CQE_SKIP; + else + req_set_fail(req); + io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED)); + if (def->fail) + def->fail(req); +} + +void io_req_defer_failed(struct io_kiocb *req, s32 res) + __must_hold(&ctx->uring_lock) +{ + __io_req_failed(req, res, false); + io_req_complete_defer(req); +} + +static void io_req_defer_failed_sliently(struct io_kiocb *req, s32 res) + __must_hold(&ctx->uring_lock) +{ + __io_req_failed(req, res, true); + io_req_complete_defer(req); +} + +/* + * Called after member req is completed, and return the lead request for + * caller to fill cqe & free it really + */ +static inline struct io_kiocb *io_complete_group_member(struct io_kiocb *req) +{ + struct io_kiocb *lead = req->grp_link; + + req->grp_link = NULL; + if (lead && atomic_dec_and_test(&lead->grp_refs)) { + req->grp_link = NULL; + lead->flags &= ~REQ_F_SQE_GROUP_LEAD; + return lead; + } + + return NULL; +} + +/* + * Called after lead req is completed and before posting cqe and freeing, + * for issuing member requests + */ +void io_complete_group_lead(struct io_kiocb *req, unsigned issue_flags) +{ + struct io_kiocb *member = req->grp_link; + + if (!member) + return; + + while (member) { + struct io_kiocb *next = member->grp_link; + + member->grp_link = req; + if (unlikely(req->flags & REQ_F_FAIL)) { + /* + * Now group lead is failed, so simply fail members + * with -EIO, and the application can figure out + * the reason from lead's cqe->res + */ + __io_req_failed(member, -EIO, false); + + if (issue_flags & IO_URING_F_COMPLETE_DEFER) + io_req_complete_defer(member); + else { + member->io_task_work.func = io_req_task_complete; + io_req_task_work_add(member); + } + } else { + trace_io_uring_submit_req(member); + if ((issue_flags & IO_URING_F_COMPLETE_DEFER) && + !(member->flags & REQ_F_FORCE_ASYNC)) + io_queue_sqe(member); + else + io_req_task_queue(member); + } + member = next; + } + req->grp_link = NULL; +} + +static bool sqe_is_group_member(const struct io_uring_sqe *sqe) +{ + return (READ_ONCE(sqe->flags) & IOSQE_HAS_EXT_FLAGS) && + (READ_ONCE(sqe->ext_flags) & IOSQE_EXT_SQE_GROUP); +} + +/* + * Initialize the whole SQE group. + * + * Walk every member in this group even though failure happens. If the lead + * request is failed, CQE is only posted for lead, otherwise, CQE is posted + * for every member request. Member requests aren't issued until the lead + * is completed, and the lead request won't be freed until all member + * requests are completed. + * + * The whole group shares the link flag in group lead, and other members + * aren't allowed to set any LINK flag. And only the lead request may + * appear in the submission link list. + */ +static int io_init_req_group(struct io_ring_ctx *ctx, struct io_kiocb *lead, + int *nr, int lead_ret) + __must_hold(&ctx->uring_lock) +{ + bool more = true; + int cnt = 0; + + lead->grp_link = NULL; + do { + const struct io_uring_sqe *sqe; + struct io_kiocb *req = NULL; + int ret = -ENOMEM; + + io_alloc_req(ctx, &req); + + if (unlikely(!io_get_sqe(ctx, &sqe))) { + if (req) + io_req_add_to_cache(req, ctx); + break; + } + + /* one group ends with !IOSQE_EXT_SQE_GROUP */ + if (!sqe_is_group_member(sqe)) + more = false; + + if (req) { + ret = io_init_req(ctx, req, sqe); + /* + * Both IO_LINK and IO_DRAIN aren't allowed for + * group member, and the boundary has to be in + * the lead sqe, so the whole group shares + * same IO_LINK and IO_DRAIN. + */ + if (!ret && (req->flags & (IO_REQ_LINK_FLAGS | + REQ_F_IO_DRAIN))) + ret = -EINVAL; + if (!more) + req->flags |= REQ_F_SQE_GROUP; + if (unlikely(ret)) { + /* + * The lead will be failed, so don't post + * CQE for any member + */ + io_req_defer_failed_sliently(req, ret); + } else { + req->grp_link = lead->grp_link; + lead->grp_link = req; + } + } + cnt += 1; + if (ret) + lead_ret = ret; + } while (more); + + /* Mark lead if we get members, otherwise degrade it to normal req */ + if (cnt > 0) { + lead->flags |= REQ_F_SQE_GROUP_LEAD; + atomic_set(&lead->grp_refs, cnt); + *nr += cnt; + } else { + lead->flags &= ~REQ_F_SQE_GROUP; + } + + return lead_ret; +} + static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags) { struct io_ring_ctx *ctx = req->ctx; + if (req_is_group_lead(req)) { + io_complete_group_lead(req, issue_flags); + return; + } + /* * All execution paths but io-wq use the deferred completions by * passing IO_URING_F_COMPLETE_DEFER and thus should not end up here. @@ -960,20 +1143,6 @@ static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags) req_ref_put(req); } -void io_req_defer_failed(struct io_kiocb *req, s32 res) - __must_hold(&ctx->uring_lock) -{ - const struct io_cold_def *def = &io_cold_defs[req->opcode]; - - lockdep_assert_held(&req->ctx->uring_lock); - - req_set_fail(req); - io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED)); - if (def->fail) - def->fail(req); - io_req_complete_defer(req); -} - /* * Don't initialise the fields below on every allocation, but do that in * advance and keep them valid across allocations. @@ -1459,7 +1628,8 @@ static void io_free_batch_list(struct io_ring_ctx *ctx, } static inline void io_fill_cqe_lists(struct io_ring_ctx *ctx, - struct io_wq_work_list *list) + struct io_wq_work_list *list, + struct io_wq_work_list *grp) { struct io_wq_work_node *node; @@ -1477,6 +1647,13 @@ static inline void io_fill_cqe_lists(struct io_ring_ctx *ctx, io_req_cqe_overflow(req); } } + + if (grp && req_is_group_member(req)) { + struct io_kiocb *lead = io_complete_group_member(req); + + if (lead) + wq_list_add_head(&lead->comp_list, grp); + } } } @@ -1484,14 +1661,19 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) __must_hold(&ctx->uring_lock) { struct io_submit_state *state = &ctx->submit_state; + struct io_wq_work_list list = {NULL, NULL}; __io_cq_lock(ctx); - io_fill_cqe_lists(ctx, &state->compl_reqs); + io_fill_cqe_lists(ctx, &state->compl_reqs, &list); + if (!wq_list_empty(&list)) + io_fill_cqe_lists(ctx, &list, NULL); __io_cq_unlock_post(ctx); - if (!wq_list_empty(&ctx->submit_state.compl_reqs)) { + if (!wq_list_empty(&state->compl_reqs)) { io_free_batch_list(ctx, state->compl_reqs.first); INIT_WQ_LIST(&state->compl_reqs); + if (!wq_list_empty(&list)) + io_free_batch_list(ctx, list.first); } ctx->submit_state.cq_flush = false; } @@ -2212,6 +2394,8 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, *nr = 1; ret = io_init_req(ctx, req, sqe); + if (req->flags & REQ_F_SQE_GROUP) + ret = io_init_req_group(ctx, req, nr, ret); if (unlikely(ret)) return io_submit_fail_init(sqe, req, ret); diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 1eb65324792a..99eeb4eee219 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -104,6 +104,7 @@ bool __io_alloc_req_refill(struct io_ring_ctx *ctx); bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task, bool cancel_all); +void io_complete_group_lead(struct io_kiocb *req, unsigned int issue_flags); enum { IO_EVENTFD_OP_SIGNAL_BIT, @@ -113,6 +114,16 @@ enum { void io_eventfd_ops(struct rcu_head *rcu); void io_activate_pollwq(struct io_ring_ctx *ctx); +static inline bool req_is_group_lead(struct io_kiocb *req) +{ + return req->flags & REQ_F_SQE_GROUP_LEAD; +} + +static inline bool req_is_group_member(struct io_kiocb *req) +{ + return !req_is_group_lead(req) && (req->flags & REQ_F_SQE_GROUP); +} + static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx) { #if defined(CONFIG_PROVE_LOCKING) @@ -355,7 +366,11 @@ static inline void io_req_complete_defer(struct io_kiocb *req) lockdep_assert_held(&req->ctx->uring_lock); - wq_list_add_tail(&req->comp_list, &state->compl_reqs); + if (unlikely(req_is_group_lead(req))) + io_complete_group_lead(req, IO_URING_F_COMPLETE_DEFER | + IO_URING_F_NONBLOCK); + else + wq_list_add_tail(&req->comp_list, &state->compl_reqs); } static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx)
SQE group is defined as one chain of SQEs starting with the first sqe that has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that doesn't have it set, and it is similar with chain of linked sqes. The 1st SQE is group leader, and the other SQEs are group member. The group leader is always freed after all members are completed. Group members aren't submitted until the group leader is completed, and there isn't any dependency among group members, and IOSQE_IO_LINK can't be set for group members, same with IOSQE_IO_DRAIN. Typically the group leader provides or makes resource, and the other members consume the resource, such as scenario of multiple backup, the 1st SQE is to read data from source file into fixed buffer, the other SQEs write data from the same buffer into other destination files. SQE group provides very efficient way to complete this task: 1) fs write SQEs and fs read SQE can be submitted in single syscall, no need to submit fs read SQE first, and wait until read SQE is completed, 2) no need to link all write SQEs together, then write SQEs can be submitted to files concurrently. Meantime application is simplified a lot in this way. Another use case is to for supporting generic device zero copy: - the lead SQE is for providing device buffer, which is owned by device or kernel, can't be cross userspace, otherwise easy to cause leak for devil application or panic - member SQEs reads or writes concurrently against the buffer provided by lead SQE Signed-off-by: Ming Lei <ming.lei@redhat.com> --- include/linux/io_uring_types.h | 10 ++ include/uapi/linux/io_uring.h | 4 + io_uring/io_uring.c | 218 ++++++++++++++++++++++++++++++--- io_uring/io_uring.h | 17 ++- 4 files changed, 231 insertions(+), 18 deletions(-)