Message ID | cover.1663363798.git.metze@samba.org (mailing list archive) |
---|---|
Headers | show |
Series | IORING_OP_SEND_ZC improvements | expand |
On 9/16/22 22:36, Stefan Metzmacher wrote: > Hi Pavel, hi Jens, > > I did some initial testing with IORING_OP_SEND_ZC. > While reading the code I think I found a race that > can lead to IORING_CQE_F_MORE being missing even if > the net layer got references. Hey Stefan, Did you see some kind of buggy behaviour in userspace? If network sends anything it should return how many bytes it queued for sending, otherwise there would be duplicated packets / data on the other endpoint in userspace, and I don't think any driver / lower layer would keep memory after returning an error. In any case, I was looking on a bit different problem, but it should look much cleaner using the same approach, see branch [1], and patch [3] for sendzc in particular. [1] https://github.com/isilence/linux.git partial-fail [2] https://github.com/isilence/linux/tree/io_uring/partial-fail [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894 > While there I added some code to allow userpace to > know how effective the IORING_OP_SEND_ZC attempt was, > in order to avoid it it's not used (e.g. on a long living tcp > connection).> > This change requires a change to the existing test, see: > https://github.com/metze-samba/liburing/tree/test-send-zerocopy > > Stefan Metzmacher (5): > io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC > io_uring/core: move io_cqe->fd over from io_cqe->flags to io_cqe->res > io_uring/core: keep req->cqe.flags on generic errors > io_uring/net: let io_sendzc set IORING_CQE_F_MORE before > sock_sendmsg() > io_uring/notif: let userspace know how effective the zero copy usage > was > > include/linux/io_uring_types.h | 6 +++--- > io_uring/io_uring.c | 18 +++++++++++++----- > io_uring/net.c | 19 +++++++++++++------ > io_uring/notif.c | 18 ++++++++++++++++++ > io_uring/opdef.c | 2 +- > net/ipv4/ip_output.c | 3 ++- > net/ipv4/tcp.c | 2 ++ > net/ipv6/ip6_output.c | 3 ++- > 8 files changed, 54 insertions(+), 17 deletions(-) >
Am 17.09.22 um 11:16 schrieb Pavel Begunkov: > On 9/16/22 22:36, Stefan Metzmacher wrote: >> Hi Pavel, hi Jens, >> >> I did some initial testing with IORING_OP_SEND_ZC. >> While reading the code I think I found a race that >> can lead to IORING_CQE_F_MORE being missing even if >> the net layer got references. > > Hey Stefan, > > Did you see some kind of buggy behaviour in userspace? No I was just reading the code and found it a bit confusing, and couldn't prove that we don't have a problem with loosing a notif cqe. > If network sends anything it should return how many bytes > it queued for sending, otherwise there would be duplicated > packets / data on the other endpoint in userspace, and I > don't think any driver / lower layer would keep memory > after returning an error. As I'm also working on a socket driver for smbdirect, I already thought about how I could hook into IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have a loop sending individual fragments, which have a reference, but if I find a connection drop after the first one, I'd return ECONNRESET or EPIPE in order to get faster recovery instead of announcing a short write to the caller. If we would take my 5/5 we could also have a different strategy to check decide if MORE/NOTIF is needed. If notif->cqe.res is still 0 and io_notif_flush drops the last reference we could go without MORE/NOTIF at all. In all other cases we'd either set MORE/NOTIF at the end of io_sendzc of in the fail hook. > In any case, I was looking on a bit different problem, but > it should look much cleaner using the same approach, see > branch [1], and patch [3] for sendzc in particular. > > [1] https://github.com/isilence/linux.git partial-fail > [2] https://github.com/isilence/linux/tree/io_uring/partial-fail > [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894 const struct io_op_def *def = &io_op_defs[req->opcode]; req_set_fail(req); io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED)); if (def->fail) def->fail(req); io_req_complete_post(req); Will loose req->cqe.flags, but the fail hook in general looks like a good idea. And don't we care about the other failure cases where req->cqe.flags gets overwritten? metze
On Fri, 16 Sep 2022 23:36:24 +0200, Stefan Metzmacher wrote: > I did some initial testing with IORING_OP_SEND_ZC. > While reading the code I think I found a race that > can lead to IORING_CQE_F_MORE being missing even if > the net layer got references. > > While there I added some code to allow userpace to > know how effective the IORING_OP_SEND_ZC attempt was, > in order to avoid it it's not used (e.g. on a long living tcp > connection). > > [...] Applied, thanks! [1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC commit: 9bd3f728223ebcfef8e9d087bdd142f0e388215d Best regards,
On 9/17/22 11:44, Stefan Metzmacher wrote: > Am 17.09.22 um 11:16 schrieb Pavel Begunkov: >> On 9/16/22 22:36, Stefan Metzmacher wrote: >>> Hi Pavel, hi Jens, >>> >>> I did some initial testing with IORING_OP_SEND_ZC. >>> While reading the code I think I found a race that >>> can lead to IORING_CQE_F_MORE being missing even if >>> the net layer got references. >> >> Hey Stefan, >> >> Did you see some kind of buggy behaviour in userspace? Apologies for the delay, > No I was just reading the code and found it a bit confusing, > and couldn't prove that we don't have a problem with loosing > a notif cqe. > >> If network sends anything it should return how many bytes >> it queued for sending, otherwise there would be duplicated >> packets / data on the other endpoint in userspace, and I >> don't think any driver / lower layer would keep memory >> after returning an error. > > As I'm also working on a socket driver for smbdirect, > I already thought about how I could hook into > IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have > a loop sending individual fragments, which have a reference, > but if I find a connection drop after the first one, I'd > return ECONNRESET or EPIPE in order to get faster recovery > instead of announcing a short write to the caller. I doesn't sound right for me, but neither I know samba to really have an opinion. In any case, I see how it may be more robust if we always try to push a notification cqe. Will you send a patch? > If we would take my 5/5 we could also have a different > strategy to check decide if MORE/NOTIF is needed. > If notif->cqe.res is still 0 and io_notif_flush drops > the last reference we could go without MORE/NOTIF at all. > In all other cases we'd either set MORE/NOTIF at the end > of io_sendzc of in the fail hook. I had a similar optimisation, i.e. when io_notif_flush() in the submission path is dropping the last ref, but killed it as it was completely useless, I haven't hit this path even once even with UDP, not to mention TCP. >> In any case, I was looking on a bit different problem, but >> it should look much cleaner using the same approach, see >> branch [1], and patch [3] for sendzc in particular. >> >> [1] https://github.com/isilence/linux.git partial-fail >> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail >> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894 > > const struct io_op_def *def = &io_op_defs[req->opcode]; > > req_set_fail(req); > io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED)); > if (def->fail) > def->fail(req); > io_req_complete_post(req); > > Will loose req->cqe.flags, but the fail hook in general looks like a good idea. I just don't like those sporadic changes all across core io_uring code also adding some overhead. > And don't we care about the other failure cases where req->cqe.flags gets overwritten? We don't usually carry them around ->issue handler boundaries, e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE); IORING_CQE_F_BUFFER is a bit more trickier, but there is special handling for this one and it wouldn't fit "set cflags in advance" logic anyway. iow, ->fail callback sounds good enough for now, we'll change it later if needed.
Hi Pavel, >>> If network sends anything it should return how many bytes >>> it queued for sending, otherwise there would be duplicated >>> packets / data on the other endpoint in userspace, and I >>> don't think any driver / lower layer would keep memory >>> after returning an error. >> >> As I'm also working on a socket driver for smbdirect, >> I already thought about how I could hook into >> IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have >> a loop sending individual fragments, which have a reference, >> but if I find a connection drop after the first one, I'd >> return ECONNRESET or EPIPE in order to get faster recovery >> instead of announcing a short write to the caller. > > I doesn't sound right for me, but neither I know samba to > really have an opinion. In any case, I see how it may be > more robust if we always try to push a notification cqe. > Will you send a patch? You mean the IORING_OP_SEND_ZC should always issue a NOTIF cqe, one it passed the io_sendzc_prep stage? >> If we would take my 5/5 we could also have a different >> strategy to check decide if MORE/NOTIF is needed. >> If notif->cqe.res is still 0 and io_notif_flush drops >> the last reference we could go without MORE/NOTIF at all. >> In all other cases we'd either set MORE/NOTIF at the end >> of io_sendzc of in the fail hook. > > I had a similar optimisation, i.e. when io_notif_flush() in > the submission path is dropping the last ref, but killed it > as it was completely useless, I haven't hit this path even > once even with UDP, not to mention TCP. If I remember correctly I hit it all the time on loopback, but I'd have to recheck. >>> In any case, I was looking on a bit different problem, but >>> it should look much cleaner using the same approach, see >>> branch [1], and patch [3] for sendzc in particular. >>> >>> [1] https://github.com/isilence/linux.git partial-fail >>> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail >>> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894 >> >> const struct io_op_def *def = &io_op_defs[req->opcode]; >> >> req_set_fail(req); >> io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED)); >> if (def->fail) >> def->fail(req); >> io_req_complete_post(req); >> >> Will loose req->cqe.flags, but the fail hook in general looks like a good idea. > > I just don't like those sporadic changes all across core io_uring > code also adding some overhead. > >> And don't we care about the other failure cases where req->cqe.flags gets overwritten? > > We don't usually carry them around ->issue handler boundaries, > e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE); > > IORING_CQE_F_BUFFER is a bit more trickier, but there is > special handling for this one and it wouldn't fit "set cflags > in advance" logic anyway. > > iow, ->fail callback sounds good enough for now, we'll change > it later if needed. The fail hook should re-add the MORE flag? So I'll try to do the following changes: 1. take your ->fail() patches 2. once io_sendzc_prep() is over always trigger MORE/NOFIF cqes (But the documentation should still make it clear that userspace have to cope with just a single cqe (without MORE) for both successs and failure, so that we can improve things later) 3. Can I change the cqe.res of the NOTIF cqe to be 0xffffffff ? That would indicate to userspace that we don't give any information if zero copy was actually used or not. This would present someone from relying on cqe.res == 0 and we can improve it by providing more useful values in future. Are you ok with that plan for 6.0? metze
On 9/21/22 13:18, Stefan Metzmacher wrote: > > Hi Pavel, > >>>> If network sends anything it should return how many bytes >>>> it queued for sending, otherwise there would be duplicated >>>> packets / data on the other endpoint in userspace, and I >>>> don't think any driver / lower layer would keep memory >>>> after returning an error. >>> >>> As I'm also working on a socket driver for smbdirect, >>> I already thought about how I could hook into >>> IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have >>> a loop sending individual fragments, which have a reference, >>> but if I find a connection drop after the first one, I'd >>> return ECONNRESET or EPIPE in order to get faster recovery >>> instead of announcing a short write to the caller. >> >> I doesn't sound right for me, but neither I know samba to >> really have an opinion. In any case, I see how it may be >> more robust if we always try to push a notification cqe. >> Will you send a patch? > > You mean the IORING_OP_SEND_ZC should always > issue a NOTIF cqe, one it passed the io_sendzc_prep stage? Currently we add F_MORE iff cqe->res >= 0, but I want to decouple them so users don't try to infer one from another. In theory, it's already a subset of this, but it sounds like a good idea to always emit notifications (when we can) to stop users making assumptions about it. And should also be cleaner. >>> If we would take my 5/5 we could also have a different >>> strategy to check decide if MORE/NOTIF is needed. >>> If notif->cqe.res is still 0 and io_notif_flush drops >>> the last reference we could go without MORE/NOTIF at all. >>> In all other cases we'd either set MORE/NOTIF at the end >>> of io_sendzc of in the fail hook. >> >> I had a similar optimisation, i.e. when io_notif_flush() in >> the submission path is dropping the last ref, but killed it >> as it was completely useless, I haven't hit this path even >> once even with UDP, not to mention TCP. > > If I remember correctly I hit it all the time on loopback, > but I'd have to recheck. Yeah, I meant real network in particular. I've seen it with virtual interfaces, but for instance loopback is not interesting as it doesn't support zerocopy in the first place. You was probably testing with a modified skb_orphan_frags_rx(). >>>> In any case, I was looking on a bit different problem, but >>>> it should look much cleaner using the same approach, see >>>> branch [1], and patch [3] for sendzc in particular. >>>> >>>> [1] https://github.com/isilence/linux.git partial-fail >>>> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail >>>> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894 >>> >>> const struct io_op_def *def = &io_op_defs[req->opcode]; >>> >>> req_set_fail(req); >>> io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED)); >>> if (def->fail) >>> def->fail(req); >>> io_req_complete_post(req); >>> >>> Will loose req->cqe.flags, but the fail hook in general looks like a good idea. >> >> I just don't like those sporadic changes all across core io_uring >> code also adding some overhead. >> >>> And don't we care about the other failure cases where req->cqe.flags gets overwritten? >> >> We don't usually carry them around ->issue handler boundaries, >> e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE); >> >> IORING_CQE_F_BUFFER is a bit more trickier, but there is >> special handling for this one and it wouldn't fit "set cflags >> in advance" logic anyway. >> >> iow, ->fail callback sounds good enough for now, we'll change >> it later if needed. > > The fail hook should re-add the MORE flag? Never set CQE_SKIP for notifications and add MORE flag in the fail hook, sounds good. > So I'll try to do the following changes: > > 1. take your ->fail() patches > > 2. once io_sendzc_prep() is over always trigger MORE/NOFIF cqes > (But the documentation should still make it clear that > userspace have to cope with just a single cqe (without MORE) > for both successs and failure, so that we can improve things later) I've sent a liburing man patch, should be good enough. > 3. Can I change the cqe.res of the NOTIF cqe to be 0xffffffff ? > That would indicate to userspace that we don't give any information > if zero copy was actually used or not. This would present someone > from relying on cqe.res == 0 and we can improve it by providing > more useful values in future. I don't get the difference, someone just as easily may rely on 0xf..ff. What does it gives us? 0 usually suits better as default. > Are you ok with that plan for 6.0? It's late 1-2 are better to be 6.1 + stable. It doesn't break uapi, so it's fine. It's not even strictly necessary to back port it (but still a good idea to do it).