Message ID | 7dae2f61ee9a1ad38822870764fcafad43a3fe4e.1640029579.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | io_uring zerocopy tx | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Guessing tree name failed - patch did not apply, async |
在 2021/12/21 下午11:35, Pavel Begunkov 写道: > Instead of the net stack managing ubuf_info, allow to pass it in from > outside in a struct msghdr (in-kernel structure), so io_uring can make > use of it. > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- Hi Pavel, I've some confusions here since I have a lack of network knowledge. The first one is why do we make ubuf_info visible for io_uring. Why not just follow the old MSG_ZEROCOPY logic? The second one, my understanding about the buffer lifecycle is that the kernel side informs the userspace by a cqe generated by the ubuf_info callback that all the buffers attaching to the same notifier is now free to use when all the data is sent, then why is the flush in 13/19 needed as it is at the submission period? Regards, Hao
On 1/11/22 13:51, Hao Xu wrote: > 在 2021/12/21 下午11:35, Pavel Begunkov 写道: >> Instead of the net stack managing ubuf_info, allow to pass it in from >> outside in a struct msghdr (in-kernel structure), so io_uring can make >> use of it. >> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >> --- > Hi Pavel, > I've some confusions here since I have a lack of > network knowledge. > The first one is why do we make ubuf_info visible > for io_uring. Why not just follow the old MSG_ZEROCOPY > logic? I assume you mean leaving allocation up and so in socket awhile the patchset let's io_uring to manage and control ubufs. In short, performance and out convenience TL;DR; First, we want a nice and uniform API with io_uring, i.e. posting an CQE instead of polling an err queue/etc., and for that the network will need to know about io_uring ctx in some way. As an alternative it may theoretically be registered in socket, but it'll quickly turn into a huge mess, consider that it's a many to many relation b/w io_uring and sockets. The fact that io_uring holds refs to files will only complicate it. It will also limit API. For instance, we won't be able to use a single ubuf with several different sockets. Another problem is performance, registration or some other tricks would some additional sync. It'd also need sync on use, say it's just one rcu_read, but the problem that it only adds up to complexity and prevents some other optimisations. E.g. we amortise to ~0 atomics getting refs on skb setups based on guarantees io_uring provides, and not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled by the issuer, and so it needs some context as currently provided by ubuf. io_uring also caches ubufs, which relies on io_uring locking, so it removes kmalloc/free for almost zero overhead. > The second one, my understanding about the buffer > lifecycle is that the kernel side informs > the userspace by a cqe generated by the ubuf_info > callback that all the buffers attaching to the > same notifier is now free to use when all the data > is sent, then why is the flush in 13/19 needed as > it is at the submission period? Probably I wasn't clear enough. A user has to flush a notifier, only then it's expected to post an CQE after all buffers attached to it are freed. io_uring holds one ubuf ref, which will be release on flush. I also need to add a way to flush without send. Will spend some time documenting for next iteration.
在 2022/1/11 下午11:50, Pavel Begunkov 写道: > On 1/11/22 13:51, Hao Xu wrote: >> 在 2021/12/21 下午11:35, Pavel Begunkov 写道: >>> Instead of the net stack managing ubuf_info, allow to pass it in from >>> outside in a struct msghdr (in-kernel structure), so io_uring can make >>> use of it. >>> >>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>> --- >> Hi Pavel, >> I've some confusions here since I have a lack of >> network knowledge. >> The first one is why do we make ubuf_info visible >> for io_uring. Why not just follow the old MSG_ZEROCOPY >> logic? > > I assume you mean leaving allocation up and so in socket awhile the > patchset let's io_uring to manage and control ubufs. In short, > performance and out convenience > > TL;DR; > First, we want a nice and uniform API with io_uring, i.e. posting > an CQE instead of polling an err queue/etc., and for that the network > will need to know about io_uring ctx in some way. As an alternative it > may theoretically be registered in socket, but it'll quickly turn into > a huge mess, consider that it's a many to many relation b/w io_uring and > sockets. The fact that io_uring holds refs to files will only complicate > it. Make sense to me, thanks. > > It will also limit API. For instance, we won't be able to use a single > ubuf with several different sockets. Is there any use cases for this multiple sockets with single notification? > > Another problem is performance, registration or some other tricks > would some additional sync. It'd also need sync on use, say it's > just one rcu_read, but the problem that it only adds up to complexity > and prevents some other optimisations. E.g. we amortise to ~0 atomics > getting refs on skb setups based on guarantees io_uring provides, and > not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled > by the issuer, and so it needs some context as currently provided by > ubuf. io_uring also caches ubufs, which relies on io_uring locking, so > it removes kmalloc/free for almost zero overhead. > > >> The second one, my understanding about the buffer >> lifecycle is that the kernel side informs >> the userspace by a cqe generated by the ubuf_info >> callback that all the buffers attaching to the >> same notifier is now free to use when all the data >> is sent, then why is the flush in 13/19 needed as >> it is at the submission period? > > Probably I wasn't clear enough. A user has to flush a notifier, only > then it's expected to post an CQE after all buffers attached to it > are freed. io_uring holds one ubuf ref, which will be release on flush. I see, I saw another ref inc in skb_zcopy_set() which I previously misunderstood and thus thought there was only one refcount. Thanks! > I also need to add a way to flush without send. > > Will spend some time documenting for next iteration. >
On 1/12/22 03:39, Hao Xu wrote: > 在 2022/1/11 下午11:50, Pavel Begunkov 写道: >> On 1/11/22 13:51, Hao Xu wrote: >>> 在 2021/12/21 下午11:35, Pavel Begunkov 写道: >>>> Instead of the net stack managing ubuf_info, allow to pass it in from >>>> outside in a struct msghdr (in-kernel structure), so io_uring can make >>>> use of it. >>>> >>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >>>> --- >>> Hi Pavel, >>> I've some confusions here since I have a lack of >>> network knowledge. >>> The first one is why do we make ubuf_info visible >>> for io_uring. Why not just follow the old MSG_ZEROCOPY >>> logic? >> >> I assume you mean leaving allocation up and so in socket awhile the >> patchset let's io_uring to manage and control ubufs. In short, >> performance and out convenience >> >> TL;DR; >> First, we want a nice and uniform API with io_uring, i.e. posting >> an CQE instead of polling an err queue/etc., and for that the network >> will need to know about io_uring ctx in some way. As an alternative it >> may theoretically be registered in socket, but it'll quickly turn into >> a huge mess, consider that it's a many to many relation b/w io_uring and >> sockets. The fact that io_uring holds refs to files will only complicate >> it. > Make sense to me, thanks. >> >> It will also limit API. For instance, we won't be able to use a single >> ubuf with several different sockets. > Is there any use cases for this multiple sockets with single > notification? Don't know, scatter send maybe? It's just one of those moments when a design that feels right (to me) yields more flexibility than was initially planned, which is definitely a good thing >> Another problem is performance, registration or some other tricks >> would some additional sync. It'd also need sync on use, say it's >> just one rcu_read, but the problem that it only adds up to complexity >> and prevents some other optimisations. E.g. we amortise to ~0 atomics >> getting refs on skb setups based on guarantees io_uring provides, and >> not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled >> by the issuer, and so it needs some context as currently provided by >> ubuf. io_uring also caches ubufs, which relies on io_uring locking, so >> it removes kmalloc/free for almost zero overhead. >> >> >>> The second one, my understanding about the buffer >>> lifecycle is that the kernel side informs >>> the userspace by a cqe generated by the ubuf_info >>> callback that all the buffers attaching to the >>> same notifier is now free to use when all the data >>> is sent, then why is the flush in 13/19 needed as >>> it is at the submission period? >> >> Probably I wasn't clear enough. A user has to flush a notifier, only >> then it's expected to post an CQE after all buffers attached to it >> are freed. io_uring holds one ubuf ref, which will be release on flush. > I see, I saw another ref inc in skb_zcopy_set() which I previously > misunderstood and thus thought there was only one refcount. Thanks! >> I also need to add a way to flush without send. >> >> Will spend some time documenting for next iteration.
diff --git a/fs/io_uring.c b/fs/io_uring.c index 72da3a75521a..59380e3454ad 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4911,6 +4911,7 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags) msg.msg_control = NULL; msg.msg_controllen = 0; msg.msg_namelen = 0; + msg.msg_ubuf = NULL; flags = req->sr_msg.msg_flags; if (issue_flags & IO_URING_F_NONBLOCK) @@ -5157,6 +5158,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) msg.msg_namelen = 0; msg.msg_iocb = NULL; msg.msg_flags = 0; + msg.msg_ubuf = NULL; flags = req->sr_msg.msg_flags; if (force_nonblock) diff --git a/include/linux/socket.h b/include/linux/socket.h index 8ef26d89ef49..6bd2c6b0c6f2 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -65,6 +65,7 @@ struct msghdr { __kernel_size_t msg_controllen; /* ancillary data buffer length */ unsigned int msg_flags; /* flags on received message */ struct kiocb *msg_iocb; /* ptr to iocb for async requests */ + struct ubuf_info *msg_ubuf; }; struct user_msghdr { diff --git a/net/compat.c b/net/compat.c index 210fc3b4d0d8..6cd2e7683dd0 100644 --- a/net/compat.c +++ b/net/compat.c @@ -80,6 +80,7 @@ int __get_compat_msghdr(struct msghdr *kmsg, return -EMSGSIZE; kmsg->msg_iocb = NULL; + kmsg->msg_ubuf = NULL; *ptr = msg.msg_iov; *len = msg.msg_iovlen; return 0; diff --git a/net/socket.c b/net/socket.c index 7f64a6eccf63..0a29b616a38c 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2023,6 +2023,7 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags, msg.msg_control = NULL; msg.msg_controllen = 0; msg.msg_namelen = 0; + msg.msg_ubuf = NULL; if (addr) { err = move_addr_to_kernel(addr, addr_len, &address); if (err < 0) @@ -2088,6 +2089,7 @@ int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags, msg.msg_namelen = 0; msg.msg_iocb = NULL; msg.msg_flags = 0; + msg.msg_ubuf = NULL; if (sock->file->f_flags & O_NONBLOCK) flags |= MSG_DONTWAIT; err = sock_recvmsg(sock, &msg, flags); @@ -2326,6 +2328,7 @@ int __copy_msghdr_from_user(struct msghdr *kmsg, return -EMSGSIZE; kmsg->msg_iocb = NULL; + kmsg->msg_ubuf = NULL; *uiov = msg.msg_iov; *nsegs = msg.msg_iovlen; return 0;
Instead of the net stack managing ubuf_info, allow to pass it in from outside in a struct msghdr (in-kernel structure), so io_uring can make use of it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- fs/io_uring.c | 2 ++ include/linux/socket.h | 1 + net/compat.c | 1 + net/socket.c | 3 +++ 4 files changed, 7 insertions(+)