Message ID | 76cdd53f618e2793e1ec298c837bb17c3b9f12ee.1663363798.git.metze@samba.org (mailing list archive) |
---|---|
State | Awaiting Upstream |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | None | expand |
On 9/16/22 22:36, Stefan Metzmacher wrote: > The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags > and it will now have the number of successful completed > io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits > of cqe->res, the high bit (0x80000000) is set when > io_uring_tx_zerocopy_callback() was called with success=false. It has a couple of problems, and because that "simplify uapi" patch is transitional it doesn't go well with what I'm queuing for 6.1, let's hold it for a while. > If cqe->res is still 0, zero copy wasn't used at all. > > These values give userspace a change to adjust its strategy > choosing IORING_OP_SEND_ZC or IORING_OP_SEND. And it's a bit > richer than just a simple SO_EE_CODE_ZEROCOPY_COPIED indication. > > Fixes: b48c312be05e8 ("io_uring/net: simplify zerocopy send user API") > Fixes: eb315a7d1396b ("tcp: support externally provided ubufs") > Fixes: 1fd3ae8c906c0 ("ipv6/udp: support externally provided ubufs") > Fixes: c445f31b3cfaa ("ipv4/udp: support externally provided ubufs") > Signed-off-by: Stefan Metzmacher <metze@samba.org> > Cc: Pavel Begunkov <asml.silence@gmail.com> > Cc: Jens Axboe <axboe@kernel.dk> > Cc: io-uring@vger.kernel.org > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: netdev@vger.kernel.org > --- > io_uring/notif.c | 18 ++++++++++++++++++ > net/ipv4/ip_output.c | 3 ++- > net/ipv4/tcp.c | 2 ++ > net/ipv6/ip6_output.c | 3 ++- > 4 files changed, 24 insertions(+), 2 deletions(-) > > diff --git a/io_uring/notif.c b/io_uring/notif.c > index e37c6569d82e..b07d2a049931 100644 > --- a/io_uring/notif.c > +++ b/io_uring/notif.c > @@ -28,7 +28,24 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb, > struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg); > struct io_kiocb *notif = cmd_to_io_kiocb(nd); > > + uarg->zerocopy = uarg->zerocopy & success; > + > + if (success && notif->cqe.res < S32_MAX) > + notif->cqe.res++; > + > if (refcount_dec_and_test(&uarg->refcnt)) { > + /* > + * If we hit at least one case that > + * was not able to use zero copy, > + * we set the high bit 0x80000000 > + * so that notif->cqe.res < 0, means it was > + * as least copied once. > + * > + * The other 31 bits are the success count. > + */ > + if (!uarg->zerocopy) > + notif->cqe.res |= S32_MIN; > + > notif->io_task_work.func = __io_notif_complete_tw; > io_req_task_work_add(notif); > } > @@ -53,6 +70,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx) > > nd = io_notif_to_data(notif); > nd->account_pages = 0; > + nd->uarg.zerocopy = 1; > nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN; > nd->uarg.callback = io_uring_tx_zerocopy_callback; > refcount_set(&nd->uarg.refcnt, 1); > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index d7bd1daf022b..4bdea7a4b2f7 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -1032,7 +1032,8 @@ static int __ip_append_data(struct sock *sk, > paged = true; > zc = true; > uarg = msg->msg_ubuf; > - } > + } else > + msg->msg_ubuf->zerocopy = 0; > } else if (sock_flag(sk, SOCK_ZEROCOPY)) { > uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); > if (!uarg) > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 970e9a2cca4a..27a22d470741 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1231,6 +1231,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > uarg = msg->msg_ubuf; > net_zcopy_get(uarg); > zc = sk->sk_route_caps & NETIF_F_SG; > + if (!zc) > + uarg->zerocopy = 0; > } else if (sock_flag(sk, SOCK_ZEROCOPY)) { > uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); > if (!uarg) { > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index f152e51242cb..d85036e91cf7 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -1556,7 +1556,8 @@ static int __ip6_append_data(struct sock *sk, > paged = true; > zc = true; > uarg = msg->msg_ubuf; > - } > + } else > + msg->msg_ubuf->zerocopy = 0; > } else if (sock_flag(sk, SOCK_ZEROCOPY)) { > uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); > if (!uarg)
Am 17.09.22 um 11:22 schrieb Pavel Begunkov: > On 9/16/22 22:36, Stefan Metzmacher wrote: >> The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags >> and it will now have the number of successful completed >> io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits >> of cqe->res, the high bit (0x80000000) is set when >> io_uring_tx_zerocopy_callback() was called with success=false. > > It has a couple of problems, and because that "simplify uapi" > patch is transitional it doesn't go well with what I'm queuing > for 6.1, let's hold it for a while. Once the current behavior gets released stable, we're no longer able to change the meaning of cqe.res. As cqe.res == 0 would mean zero copy wasn't used at all, which would be the indication for userspace to avoid using SEND_ZC. But if 6.0 would always return cqe.res = 0, there's no chance for userspace to have a detection strategy. And I don't think it will cause a lot of trouble for your 6.1 stuff (assuming you mean your SENDMSG_ZC code), I was already having that on top of my test branches, the current one is: https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-6.0.0-rc5-metze.08 I plan to test SENDMSG_ZC with Samba next week. metze
On 9/17/22 11:24, Stefan Metzmacher wrote: > Am 17.09.22 um 11:22 schrieb Pavel Begunkov: >> On 9/16/22 22:36, Stefan Metzmacher wrote: >>> The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags >>> and it will now have the number of successful completed >>> io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits >>> of cqe->res, the high bit (0x80000000) is set when >>> io_uring_tx_zerocopy_callback() was called with success=false. >> >> It has a couple of problems, and because that "simplify uapi" >> patch is transitional it doesn't go well with what I'm queuing >> for 6.1, let's hold it for a while. > > Once the current behavior gets released stable, we're no > longer able to change the meaning of cqe.res. > > As cqe.res == 0 would mean zero copy wasn't used at all, > which would be the indication for userspace to avoid using SEND_ZC. > > But if 6.0 would always return cqe.res = 0, there's no chance for > userspace to have a detection strategy. > > And I don't think it will cause a lot of trouble for your 6.1 stuff (assuming > you mean your SENDMSG_ZC code), I was already having that on top > of my test branches, the current one is: > https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-6.0.0-rc5-metze.08 Not that one though, 1) I want to shrink ubuf_info as we're a bit out of space on the io_uring side and it prevents some embedding, so there won't be additional fields you used. And 2) we want to have a feature merging completion + notif CQEs into one (useful for UDP and some TCP cases), but that would mean we'll be using cqe->res for the normal return value. We can disable the success counting in this case, but it's not nice, and we can't actually count in io_uring_tx_zerocopy_callback() as in the patch (racy). Also, the callback will be called multiple times for a number of different reasons like io_uring flush or net splitting skbs. The number won't be much useful and unnecessary exposes internal details, so I think F_COPIED in cqe->flags is a better option. It's a good question though whether we need more versatile reporting and how to do it right. Probably should be optional and not a part of IO path, e.g. send(MSG_PROBE) / ioctl / proc stats / etc. > I plan to test SENDMSG_ZC with Samba next week. Awesome
Am 21.09.22 um 14:04 schrieb Pavel Begunkov: > On 9/17/22 11:24, Stefan Metzmacher wrote: >> Am 17.09.22 um 11:22 schrieb Pavel Begunkov: >>> On 9/16/22 22:36, Stefan Metzmacher wrote: >>>> The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags >>>> and it will now have the number of successful completed >>>> io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits >>>> of cqe->res, the high bit (0x80000000) is set when >>>> io_uring_tx_zerocopy_callback() was called with success=false. >>> >>> It has a couple of problems, and because that "simplify uapi" >>> patch is transitional it doesn't go well with what I'm queuing >>> for 6.1, let's hold it for a while. >> >> Once the current behavior gets released stable, we're no >> longer able to change the meaning of cqe.res. >> >> As cqe.res == 0 would mean zero copy wasn't used at all, >> which would be the indication for userspace to avoid using SEND_ZC. >> >> But if 6.0 would always return cqe.res = 0, there's no chance for >> userspace to have a detection strategy. >> >> And I don't think it will cause a lot of trouble for your 6.1 stuff (assuming >> you mean your SENDMSG_ZC code), I was already having that on top >> of my test branches, the current one is: >> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-6.0.0-rc5-metze.08 > > Not that one though, 1) I want to shrink ubuf_info as we're a bit out > of space on the io_uring side and it prevents some embedding, so there > won't be additional fields you used. And 2) we want to have a feature > merging completion + notif CQEs into one (useful for UDP and some TCP > cases), but that would mean we'll be using cqe->res for the normal > return value. But wouldn't that just don't have the MORE flag set, and be just like IO_SEND? > We can disable the success counting in this case, but it's not nice, > and we can't actually count in io_uring_tx_zerocopy_callback() as in > the patch (racy). Also, the callback will be called multiple times for > a number of different reasons like io_uring flush or net splitting skbs. > The number won't be much useful and unnecessary exposes internal details, > so I think F_COPIED in cqe->flags is a better option. Ok, that's better than nothing. > It's a good question though whether we need more versatile reporting and > how to do it right. Probably should be optional and not a part of IO path, > e.g. send(MSG_PROBE) / ioctl / proc stats / etc. > >> I plan to test SENDMSG_ZC with Samba next week. > > Awesome Currently I'm fighting with converting samba's libtevent to use POLL_ADD and io_uring_submit_and_wait_timeout(). Which turns out to be much more complex that it should be. And IORING_POLL_ADD_LEVEL doesn't seem to work at all... But I'll write a separate mail about my findings in that area... metze
diff --git a/io_uring/notif.c b/io_uring/notif.c index e37c6569d82e..b07d2a049931 100644 --- a/io_uring/notif.c +++ b/io_uring/notif.c @@ -28,7 +28,24 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb, struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg); struct io_kiocb *notif = cmd_to_io_kiocb(nd); + uarg->zerocopy = uarg->zerocopy & success; + + if (success && notif->cqe.res < S32_MAX) + notif->cqe.res++; + if (refcount_dec_and_test(&uarg->refcnt)) { + /* + * If we hit at least one case that + * was not able to use zero copy, + * we set the high bit 0x80000000 + * so that notif->cqe.res < 0, means it was + * as least copied once. + * + * The other 31 bits are the success count. + */ + if (!uarg->zerocopy) + notif->cqe.res |= S32_MIN; + notif->io_task_work.func = __io_notif_complete_tw; io_req_task_work_add(notif); } @@ -53,6 +70,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx) nd = io_notif_to_data(notif); nd->account_pages = 0; + nd->uarg.zerocopy = 1; nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN; nd->uarg.callback = io_uring_tx_zerocopy_callback; refcount_set(&nd->uarg.refcnt, 1); diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index d7bd1daf022b..4bdea7a4b2f7 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1032,7 +1032,8 @@ static int __ip_append_data(struct sock *sk, paged = true; zc = true; uarg = msg->msg_ubuf; - } + } else + msg->msg_ubuf->zerocopy = 0; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); if (!uarg) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 970e9a2cca4a..27a22d470741 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1231,6 +1231,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) uarg = msg->msg_ubuf; net_zcopy_get(uarg); zc = sk->sk_route_caps & NETIF_F_SG; + if (!zc) + uarg->zerocopy = 0; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); if (!uarg) { diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index f152e51242cb..d85036e91cf7 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1556,7 +1556,8 @@ static int __ip6_append_data(struct sock *sk, paged = true; zc = true; uarg = msg->msg_ubuf; - } + } else + msg->msg_ubuf->zerocopy = 0; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); if (!uarg)
The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags and it will now have the number of successful completed io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits of cqe->res, the high bit (0x80000000) is set when io_uring_tx_zerocopy_callback() was called with success=false. If cqe->res is still 0, zero copy wasn't used at all. These values give userspace a change to adjust its strategy choosing IORING_OP_SEND_ZC or IORING_OP_SEND. And it's a bit richer than just a simple SO_EE_CODE_ZEROCOPY_COPIED indication. Fixes: b48c312be05e8 ("io_uring/net: simplify zerocopy send user API") Fixes: eb315a7d1396b ("tcp: support externally provided ubufs") Fixes: 1fd3ae8c906c0 ("ipv6/udp: support externally provided ubufs") Fixes: c445f31b3cfaa ("ipv4/udp: support externally provided ubufs") Signed-off-by: Stefan Metzmacher <metze@samba.org> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: io-uring@vger.kernel.org Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org --- io_uring/notif.c | 18 ++++++++++++++++++ net/ipv4/ip_output.c | 3 ++- net/ipv4/tcp.c | 2 ++ net/ipv6/ip6_output.c | 3 ++- 4 files changed, 24 insertions(+), 2 deletions(-)