diff mbox series

[RFC,v2,02/19] skbuff: pass a struct ubuf_info in msghdr

Message ID 7dae2f61ee9a1ad38822870764fcafad43a3fe4e.1640029579.git.asml.silence@gmail.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series io_uring zerocopy tx | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply, async

Commit Message

Pavel Begunkov Dec. 21, 2021, 3:35 p.m. UTC
Instead of the net stack managing ubuf_info, allow to pass it in from
outside in a struct msghdr (in-kernel structure), so io_uring can make
use of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c          | 2 ++
 include/linux/socket.h | 1 +
 net/compat.c           | 1 +
 net/socket.c           | 3 +++
 4 files changed, 7 insertions(+)

Comments

Hao Xu Jan. 11, 2022, 1:51 p.m. UTC | #1
在 2021/12/21 下午11:35, Pavel Begunkov 写道:
> Instead of the net stack managing ubuf_info, allow to pass it in from
> outside in a struct msghdr (in-kernel structure), so io_uring can make
> use of it.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
Hi Pavel,
I've some confusions here since I have a lack of
network knowledge.
The first one is why do we make ubuf_info visible
for io_uring. Why not just follow the old MSG_ZEROCOPY
logic?

The second one, my understanding about the buffer
lifecycle is that the kernel side informs
the userspace by a cqe generated by the ubuf_info
callback that all the buffers attaching to the
same notifier is now free to use when all the data
is sent, then why is the flush in 13/19 needed as
it is at the submission period?

Regards,
Hao
Pavel Begunkov Jan. 11, 2022, 3:50 p.m. UTC | #2
On 1/11/22 13:51, Hao Xu wrote:
> 在 2021/12/21 下午11:35, Pavel Begunkov 写道:
>> Instead of the net stack managing ubuf_info, allow to pass it in from
>> outside in a struct msghdr (in-kernel structure), so io_uring can make
>> use of it.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
> Hi Pavel,
> I've some confusions here since I have a lack of
> network knowledge.
> The first one is why do we make ubuf_info visible
> for io_uring. Why not just follow the old MSG_ZEROCOPY
> logic?

I assume you mean leaving allocation up and so in socket awhile the
patchset let's io_uring to manage and control ubufs. In short,
performance and out convenience

TL;DR;
First, we want a nice and uniform API with io_uring, i.e. posting
an CQE instead of polling an err queue/etc., and for that the network
will need to know about io_uring ctx in some way. As an alternative it
may theoretically be registered in socket, but it'll quickly turn into
a huge mess, consider that it's a many to many relation b/w io_uring and
sockets. The fact that io_uring holds refs to files will only complicate
it.

It will also limit API. For instance, we won't be able to use a single
ubuf with several different sockets.

Another problem is performance, registration or some other tricks
would some additional sync. It'd also need sync on use, say it's
just one rcu_read, but the problem that it only adds up to complexity
and prevents some other optimisations. E.g. we amortise to ~0 atomics
getting refs on skb setups based on guarantees io_uring provides, and
not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled
by the issuer, and so it needs some context as currently provided by
ubuf. io_uring also caches ubufs, which relies on io_uring locking, so
it removes kmalloc/free for almost zero overhead.


> The second one, my understanding about the buffer
> lifecycle is that the kernel side informs
> the userspace by a cqe generated by the ubuf_info
> callback that all the buffers attaching to the
> same notifier is now free to use when all the data
> is sent, then why is the flush in 13/19 needed as
> it is at the submission period?

Probably I wasn't clear enough. A user has to flush a notifier, only
then it's expected to post an CQE after all buffers attached to it
are freed. io_uring holds one ubuf ref, which will be release on flush.
I also need to add a way to flush without send.

Will spend some time documenting for next iteration.
Hao Xu Jan. 12, 2022, 3:39 a.m. UTC | #3
在 2022/1/11 下午11:50, Pavel Begunkov 写道:
> On 1/11/22 13:51, Hao Xu wrote:
>> 在 2021/12/21 下午11:35, Pavel Begunkov 写道:
>>> Instead of the net stack managing ubuf_info, allow to pass it in from
>>> outside in a struct msghdr (in-kernel structure), so io_uring can make
>>> use of it.
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> ---
>> Hi Pavel,
>> I've some confusions here since I have a lack of
>> network knowledge.
>> The first one is why do we make ubuf_info visible
>> for io_uring. Why not just follow the old MSG_ZEROCOPY
>> logic?
> 
> I assume you mean leaving allocation up and so in socket awhile the
> patchset let's io_uring to manage and control ubufs. In short,
> performance and out convenience
> 
> TL;DR;
> First, we want a nice and uniform API with io_uring, i.e. posting
> an CQE instead of polling an err queue/etc., and for that the network
> will need to know about io_uring ctx in some way. As an alternative it
> may theoretically be registered in socket, but it'll quickly turn into
> a huge mess, consider that it's a many to many relation b/w io_uring and
> sockets. The fact that io_uring holds refs to files will only complicate
> it.
Make sense to me, thanks.
> 
> It will also limit API. For instance, we won't be able to use a single
> ubuf with several different sockets.
Is there any use cases for this multiple sockets with single
notification?
> 
> Another problem is performance, registration or some other tricks
> would some additional sync. It'd also need sync on use, say it's
> just one rcu_read, but the problem that it only adds up to complexity
> and prevents some other optimisations. E.g. we amortise to ~0 atomics
> getting refs on skb setups based on guarantees io_uring provides, and
> not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled
> by the issuer, and so it needs some context as currently provided by
> ubuf. io_uring also caches ubufs, which relies on io_uring locking, so
> it removes kmalloc/free for almost zero overhead.
> 
> 
>> The second one, my understanding about the buffer
>> lifecycle is that the kernel side informs
>> the userspace by a cqe generated by the ubuf_info
>> callback that all the buffers attaching to the
>> same notifier is now free to use when all the data
>> is sent, then why is the flush in 13/19 needed as
>> it is at the submission period?
> 
> Probably I wasn't clear enough. A user has to flush a notifier, only
> then it's expected to post an CQE after all buffers attached to it
> are freed. io_uring holds one ubuf ref, which will be release on flush.
I see, I saw another ref inc in skb_zcopy_set() which I previously
misunderstood and thus thought there was only one refcount. Thanks!
> I also need to add a way to flush without send.
> 
> Will spend some time documenting for next iteration.
>
Pavel Begunkov Jan. 12, 2022, 4:53 p.m. UTC | #4
On 1/12/22 03:39, Hao Xu wrote:
> 在 2022/1/11 下午11:50, Pavel Begunkov 写道:
>> On 1/11/22 13:51, Hao Xu wrote:
>>> 在 2021/12/21 下午11:35, Pavel Begunkov 写道:
>>>> Instead of the net stack managing ubuf_info, allow to pass it in from
>>>> outside in a struct msghdr (in-kernel structure), so io_uring can make
>>>> use of it.
>>>>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> ---
>>> Hi Pavel,
>>> I've some confusions here since I have a lack of
>>> network knowledge.
>>> The first one is why do we make ubuf_info visible
>>> for io_uring. Why not just follow the old MSG_ZEROCOPY
>>> logic?
>>
>> I assume you mean leaving allocation up and so in socket awhile the
>> patchset let's io_uring to manage and control ubufs. In short,
>> performance and out convenience
>>
>> TL;DR;
>> First, we want a nice and uniform API with io_uring, i.e. posting
>> an CQE instead of polling an err queue/etc., and for that the network
>> will need to know about io_uring ctx in some way. As an alternative it
>> may theoretically be registered in socket, but it'll quickly turn into
>> a huge mess, consider that it's a many to many relation b/w io_uring and
>> sockets. The fact that io_uring holds refs to files will only complicate
>> it.
> Make sense to me, thanks.
>>
>> It will also limit API. For instance, we won't be able to use a single
>> ubuf with several different sockets.
> Is there any use cases for this multiple sockets with single
> notification?

Don't know, scatter send maybe? It's just one of those moments when
a design that feels right (to me) yields more flexibility than was
initially planned, which is definitely a good thing


>> Another problem is performance, registration or some other tricks
>> would some additional sync. It'd also need sync on use, say it's
>> just one rcu_read, but the problem that it only adds up to complexity
>> and prevents some other optimisations. E.g. we amortise to ~0 atomics
>> getting refs on skb setups based on guarantees io_uring provides, and
>> not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled
>> by the issuer, and so it needs some context as currently provided by
>> ubuf. io_uring also caches ubufs, which relies on io_uring locking, so
>> it removes kmalloc/free for almost zero overhead.
>>
>>
>>> The second one, my understanding about the buffer
>>> lifecycle is that the kernel side informs
>>> the userspace by a cqe generated by the ubuf_info
>>> callback that all the buffers attaching to the
>>> same notifier is now free to use when all the data
>>> is sent, then why is the flush in 13/19 needed as
>>> it is at the submission period?
>>
>> Probably I wasn't clear enough. A user has to flush a notifier, only
>> then it's expected to post an CQE after all buffers attached to it
>> are freed. io_uring holds one ubuf ref, which will be release on flush.
> I see, I saw another ref inc in skb_zcopy_set() which I previously
> misunderstood and thus thought there was only one refcount. Thanks!
>> I also need to add a way to flush without send.
>>
>> Will spend some time documenting for next iteration.
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 72da3a75521a..59380e3454ad 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4911,6 +4911,7 @@  static int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
 	msg.msg_namelen = 0;
+	msg.msg_ubuf = NULL;
 
 	flags = req->sr_msg.msg_flags;
 	if (issue_flags & IO_URING_F_NONBLOCK)
@@ -5157,6 +5158,7 @@  static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_namelen = 0;
 	msg.msg_iocb = NULL;
 	msg.msg_flags = 0;
+	msg.msg_ubuf = NULL;
 
 	flags = req->sr_msg.msg_flags;
 	if (force_nonblock)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8ef26d89ef49..6bd2c6b0c6f2 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -65,6 +65,7 @@  struct msghdr {
 	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
 	unsigned int	msg_flags;	/* flags on received message */
 	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
+	struct ubuf_info *msg_ubuf;
 };
 
 struct user_msghdr {
diff --git a/net/compat.c b/net/compat.c
index 210fc3b4d0d8..6cd2e7683dd0 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -80,6 +80,7 @@  int __get_compat_msghdr(struct msghdr *kmsg,
 		return -EMSGSIZE;
 
 	kmsg->msg_iocb = NULL;
+	kmsg->msg_ubuf = NULL;
 	*ptr = msg.msg_iov;
 	*len = msg.msg_iovlen;
 	return 0;
diff --git a/net/socket.c b/net/socket.c
index 7f64a6eccf63..0a29b616a38c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2023,6 +2023,7 @@  int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
 	msg.msg_namelen = 0;
+	msg.msg_ubuf = NULL;
 	if (addr) {
 		err = move_addr_to_kernel(addr, addr_len, &address);
 		if (err < 0)
@@ -2088,6 +2089,7 @@  int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,
 	msg.msg_namelen = 0;
 	msg.msg_iocb = NULL;
 	msg.msg_flags = 0;
+	msg.msg_ubuf = NULL;
 	if (sock->file->f_flags & O_NONBLOCK)
 		flags |= MSG_DONTWAIT;
 	err = sock_recvmsg(sock, &msg, flags);
@@ -2326,6 +2328,7 @@  int __copy_msghdr_from_user(struct msghdr *kmsg,
 		return -EMSGSIZE;
 
 	kmsg->msg_iocb = NULL;
+	kmsg->msg_ubuf = NULL;
 	*uiov = msg.msg_iov;
 	*nsegs = msg.msg_iovlen;
 	return 0;