diff mbox series

[for-6.1] io_uring/net: don't skip notifs for failed requests

Message ID 9c8bead87b2b980fcec441b8faef52188b4a6588.1664292100.git.asml.silence@gmail.com (mailing list archive)
State New
Headers show
Series [for-6.1] io_uring/net: don't skip notifs for failed requests | expand

Commit Message

Pavel Begunkov Sept. 27, 2022, 11:51 p.m. UTC
We currently only add a notification CQE when the send succeded, i.e.
cqe.res >= 0. However, it'd be more robust to do buffer notifications
for failed requests as well in case drivers decide do something fanky.

Always return a buffer notification after initial prep, don't hide it.
This behaviour is better aligned with documentation and the patch also
helps the userspace to respect it.

Cc: stable@vger.kernel.org # 6.0
Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---

We need it as soon as possible, and it's likely almost time
for 6.1 rcs.

 io_uring/net.c | 29 ++++++++---------------------
 1 file changed, 8 insertions(+), 21 deletions(-)

Comments

Jens Axboe Sept. 28, 2022, 1:53 p.m. UTC | #1
On Wed, 28 Sep 2022 00:51:49 +0100, Pavel Begunkov wrote:
> We currently only add a notification CQE when the send succeded, i.e.
> cqe.res >= 0. However, it'd be more robust to do buffer notifications
> for failed requests as well in case drivers decide do something fanky.
> 
> Always return a buffer notification after initial prep, don't hide it.
> This behaviour is better aligned with documentation and the patch also
> helps the userspace to respect it.
> 
> [...]

Applied, thanks!

[1/1] io_uring/net: don't skip notifs for failed requests
      commit: 6ae91ac9a6aa7d6005c3c6d0f4d263fbab9f377f

Best regards,
Stefan Metzmacher Sept. 28, 2022, 3:23 p.m. UTC | #2
Hi Pavel,

> We currently only add a notification CQE when the send succeded, i.e.
> cqe.res >= 0. However, it'd be more robust to do buffer notifications
> for failed requests as well in case drivers decide do something fanky.
> 
> Always return a buffer notification after initial prep, don't hide it.
> This behaviour is better aligned with documentation and the patch also
> helps the userspace to respect it.

Just as reference, this was the version I was testing with:
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=7ffb896cdb8ccd55065f7ffae9fb8050e39211c7

>   void io_sendrecv_fail(struct io_kiocb *req)
>   {
>   	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
> -	int res = req->cqe.res;
>   
>   	if (req->flags & REQ_F_PARTIAL_IO)
> -		res = sr->done_io;
> +		req->cqe.res = sr->done_io;
> +
>   	if ((req->flags & REQ_F_NEED_CLEANUP) &&
> -	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) {
> -		/* preserve notification for partial I/O */
> -		if (res < 0)
> -			sr->notif->flags |= REQ_F_CQE_SKIP;
> -		io_notif_flush(sr->notif);
> -		sr->notif = NULL;

Here we rely on io_send_zc_cleanup(), correct?

Note that I hit a very bad problem during my tests of SENDMSG_ZC.
BUG(); in first_iovec_segment() triggered very easily.
The problem is io_setup_async_msg() in the partial retry case,
which seems to happen more often with _ZC.

        if (!async_msg->free_iov)
                async_msg->msg.msg_iter.iov = async_msg->fast_iov;

Is wrong it needs to be something like this:

+       if (!kmsg->free_iov) {
+               size_t fast_idx = kmsg->msg.msg_iter.iov - kmsg->fast_iov;
+               async_msg->msg.msg_iter.iov = &async_msg->fast_iov[fast_idx];
+       }

As iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
being only relative to the first element.

I'm not sure about the 'kmsg->free_iov' case, do we reuse the
callers memory or should we make a copy?
I initially used this
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=e1d3a9f5c7708a37172d258753ed7377eaac9e33
But I didn't test with the non-fast_iov case.

BTW: I tested with 5 vectors with length like this 4, 0, 64, 32, 8388608
and got a short write with about ~ 2000000.

I'm not sure if it was already a problem before:

commit 257e84a5377fbbc336ff563833a8712619acce56
io_uring: refactor sendmsg/recvmsg iov managing

But I guess it was a potential problem before starting with
7ba89d2af17aa879dda30f5d5d3f152e587fc551 where io_net_retry()
was introduced.

metze
Pavel Begunkov Sept. 28, 2022, 4:56 p.m. UTC | #3
On 9/28/22 16:23, Stefan Metzmacher wrote:
> 
> Hi Pavel,
> 
>> We currently only add a notification CQE when the send succeded, i.e.
>> cqe.res >= 0. However, it'd be more robust to do buffer notifications
>> for failed requests as well in case drivers decide do something fanky.
>>
>> Always return a buffer notification after initial prep, don't hide it.
>> This behaviour is better aligned with documentation and the patch also
>> helps the userspace to respect it.
> 
> Just as reference, this was the version I was testing with:
> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=7ffb896cdb8ccd55065f7ffae9fb8050e39211c7
> 
>>   void io_sendrecv_fail(struct io_kiocb *req)
>>   {
>>       struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
>> -    int res = req->cqe.res;
>>       if (req->flags & REQ_F_PARTIAL_IO)
>> -        res = sr->done_io;
>> +        req->cqe.res = sr->done_io;
>> +
>>       if ((req->flags & REQ_F_NEED_CLEANUP) &&
>> -        (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) {
>> -        /* preserve notification for partial I/O */
>> -        if (res < 0)
>> -            sr->notif->flags |= REQ_F_CQE_SKIP;
>> -        io_notif_flush(sr->notif);
>> -        sr->notif = NULL;
> 
> Here we rely on io_send_zc_cleanup(), correct?
> 
> Note that I hit a very bad problem during my tests of SENDMSG_ZC.
> BUG(); in first_iovec_segment() triggered very easily.
> The problem is io_setup_async_msg() in the partial retry case,
> which seems to happen more often with _ZC.
> 
>         if (!async_msg->free_iov)
>                 async_msg->msg.msg_iter.iov = async_msg->fast_iov;
> 
> Is wrong it needs to be something like this:
> 
> +       if (!kmsg->free_iov) {
> +               size_t fast_idx = kmsg->msg.msg_iter.iov - kmsg->fast_iov;
> +               async_msg->msg.msg_iter.iov = &async_msg->fast_iov[fast_idx];
> +       }

I agree, it doesn't look right. It indeed needs sth like
io_uring/rw.c:io_req_map_rw()


> As iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
> being only relative to the first element.
> 
> I'm not sure about the 'kmsg->free_iov' case, do we reuse the
> callers memory or should we make a copy?
> I initially used this
> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=e1d3a9f5c7708a37172d258753ed7377eaac9e33
> But I didn't test with the non-fast_iov case.
> 
> BTW: I tested with 5 vectors with length like this 4, 0, 64, 32, 8388608
> and got a short write with about ~ 2000000.
> 
> I'm not sure if it was already a problem before:
> 
> commit 257e84a5377fbbc336ff563833a8712619acce56
> io_uring: refactor sendmsg/recvmsg iov managing
> 
> But I guess it was a potential problem before starting with
> 7ba89d2af17aa879dda30f5d5d3f152e587fc551 where io_net_retry()
> was introduced.
> 
> metze
Pavel Begunkov Sept. 28, 2022, 6:58 p.m. UTC | #4
On 9/28/22 17:56, Pavel Begunkov wrote:
> On 9/28/22 16:23, Stefan Metzmacher wrote:
>> Just as reference, this was the version I was testing with:
>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=7ffb896cdb8ccd55065f7ffae9fb8050e39211c7
>>
>>>   void io_sendrecv_fail(struct io_kiocb *req)
>>>   {
>>>       struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
>>> -    int res = req->cqe.res;
>>>       if (req->flags & REQ_F_PARTIAL_IO)
>>> -        res = sr->done_io;
>>> +        req->cqe.res = sr->done_io;
>>> +
>>>       if ((req->flags & REQ_F_NEED_CLEANUP) &&
>>> -        (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) {
>>> -        /* preserve notification for partial I/O */
>>> -        if (res < 0)
>>> -            sr->notif->flags |= REQ_F_CQE_SKIP;
>>> -        io_notif_flush(sr->notif);
>>> -        sr->notif = NULL;
>>
>> Here we rely on io_send_zc_cleanup(), correct?

Right

>> Note that I hit a very bad problem during my tests of SENDMSG_ZC.
>> BUG(); in first_iovec_segment() triggered very easily.
>> The problem is io_setup_async_msg() in the partial retry case,
>> which seems to happen more often with _ZC.
>>
>>         if (!async_msg->free_iov)
>>                 async_msg->msg.msg_iter.iov = async_msg->fast_iov;
>>
>> Is wrong it needs to be something like this:
>>
>> +       if (!kmsg->free_iov) {
>> +               size_t fast_idx = kmsg->msg.msg_iter.iov - kmsg->fast_iov;
>> +               async_msg->msg.msg_iter.iov = &async_msg->fast_iov[fast_idx];
>> +       }
> 
> I agree, it doesn't look right. It indeed needs sth like
> io_uring/rw.c:io_req_map_rw()

Took a closer look, that chunk above looks good and matches
io_req_map_rw() apart from non essential differences. Can you
send a patch?


>> As iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
>> being only relative to the first element.
>>
>> I'm not sure about the 'kmsg->free_iov' case, do we reuse the
>> callers memory or should we make a copy?

We can reuse it, we own it and it's immutable from
the iter perspective.

>> BTW: I tested with 5 vectors with length like this 4, 0, 64, 32, 8388608
>> and got a short write with about ~ 2000000.

Which is interesting to know. What does 2M here mean? Is it
consistently retries when sending more than 2M bytes?
diff mbox series

Patch

diff --git a/io_uring/net.c b/io_uring/net.c
index 6b69eff6887e..5058a9fc9e9c 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -916,7 +916,6 @@  void io_send_zc_cleanup(struct io_kiocb *req)
 			kfree(io->free_iov);
 	}
 	if (zc->notif) {
-		zc->notif->flags |= REQ_F_CQE_SKIP;
 		io_notif_flush(zc->notif);
 		zc->notif = NULL;
 	}
@@ -1047,7 +1046,7 @@  int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 	struct msghdr msg;
 	struct iovec iov;
 	struct socket *sock;
-	unsigned msg_flags, cflags;
+	unsigned msg_flags;
 	int ret, min_ret = 0;
 
 	sock = sock_from_file(req->file);
@@ -1115,8 +1114,6 @@  int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 			req->flags |= REQ_F_PARTIAL_IO;
 			return io_setup_async_addr(req, &__address, issue_flags);
 		}
-		if (ret < 0 && !zc->done_io)
-			zc->notif->flags |= REQ_F_CQE_SKIP;
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
 		req_set_fail(req);
@@ -1129,8 +1126,7 @@  int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 
 	io_notif_flush(zc->notif);
 	req->flags &= ~REQ_F_NEED_CLEANUP;
-	cflags = ret >= 0 ? IORING_CQE_F_MORE : 0;
-	io_req_set_res(req, ret, cflags);
+	io_req_set_res(req, ret, IORING_CQE_F_MORE);
 	return IOU_OK;
 }
 
@@ -1139,7 +1135,7 @@  int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
 	struct io_async_msghdr iomsg, *kmsg;
 	struct socket *sock;
-	unsigned flags, cflags;
+	unsigned flags;
 	int ret, min_ret = 0;
 
 	sock = sock_from_file(req->file);
@@ -1178,8 +1174,6 @@  int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
 			req->flags |= REQ_F_PARTIAL_IO;
 			return io_setup_async_msg(req, kmsg, issue_flags);
 		}
-		if (ret < 0 && !sr->done_io)
-			sr->notif->flags |= REQ_F_CQE_SKIP;
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
 		req_set_fail(req);
@@ -1196,27 +1190,20 @@  int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
 
 	io_notif_flush(sr->notif);
 	req->flags &= ~REQ_F_NEED_CLEANUP;
-	cflags = ret >= 0 ? IORING_CQE_F_MORE : 0;
-	io_req_set_res(req, ret, cflags);
+	io_req_set_res(req, ret, IORING_CQE_F_MORE);
 	return IOU_OK;
 }
 
 void io_sendrecv_fail(struct io_kiocb *req)
 {
 	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
-	int res = req->cqe.res;
 
 	if (req->flags & REQ_F_PARTIAL_IO)
-		res = sr->done_io;
+		req->cqe.res = sr->done_io;
+
 	if ((req->flags & REQ_F_NEED_CLEANUP) &&
-	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) {
-		/* preserve notification for partial I/O */
-		if (res < 0)
-			sr->notif->flags |= REQ_F_CQE_SKIP;
-		io_notif_flush(sr->notif);
-		sr->notif = NULL;
-	}
-	io_req_set_res(req, res, req->cqe.flags);
+	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
+		req->cqe.flags |= IORING_CQE_F_MORE;
 }
 
 int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)