[RFC,v2,02/19] skbuff: pass a struct ubuf_info in msghdr

Message ID	7dae2f61ee9a1ad38822870764fcafad43a3fe4e.1640029579.git.asml.silence@gmail.com (mailing list archive)
State	RFC
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> From: Pavel Begunkov <asml.silence@gmail.com> To: io-uring@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Jakub Kicinski <kuba@kernel.org>, Jonathan Lemon <jonathan.lemon@gmail.com>, "David S . Miller" <davem@davemloft.net>, Willem de Bruijn <willemb@google.com>, Eric Dumazet <edumazet@google.com>, David Ahern <dsahern@kernel.org>, Jens Axboe <axboe@kernel.dk>, Pavel Begunkov <asml.silence@gmail.com> Subject: [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr Date: Tue, 21 Dec 2021 15:35:24 +0000 Message-Id: <7dae2f61ee9a1ad38822870764fcafad43a3fe4e.1640029579.git.asml.silence@gmail.com> In-Reply-To: <cover.1640029579.git.asml.silence@gmail.com> References: <cover.1640029579.git.asml.silence@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	io_uring zerocopy tx \| expand [RFC,v2,00/19] io_uring zerocopy tx [RFC,v2,01/19] skbuff: add SKBFL_DONT_ORPHAN flag [RFC,v2,02/19] skbuff: pass a struct ubuf_info in msghdr [RFC,v2,03/19] net: add zerocopy_sg_from_iter for bvec [RFC,v2,04/19] net: optimise page get/free for bvec zc [RFC,v2,05/19] net: don't track pfmemalloc for zc registered mem [RFC,v2,06/19] ipv4/udp: add support msgdr::msg_ubuf [RFC,v2,07/19] ipv6/udp: add support msgdr::msg_ubuf [RFC,v2,08/19] ipv4: avoid partial copy for zc [RFC,v2,09/19] ipv6: avoid partial copy for zc [RFC,v2,10/19] io_uring: add send notifiers registration [RFC,v2,11/19] io_uring: infrastructure for send zc notifications [RFC,v2,12/19] io_uring: wire send zc request type [RFC,v2,13/19] io_uring: add an option to flush zc notifications [RFC,v2,14/19] io_uring: opcode independent fixed buf import [RFC,v2,15/19] io_uring: sendzc with fixed buffers [RFC,v2,16/19] io_uring: cache struct ubuf_info [RFC,v2,17/19] io_uring: unclog ctx refs waiting with zc notifiers [RFC,v2,18/19] io_uring: task_work for notification delivery [RFC,v2,19/19] io_uring: optimise task referencing by notifiers

Context	Check	Description
netdev/tree_selection	success	Guessing tree name failed - patch did not apply, async

Pavel Begunkov Dec. 21, 2021, 3:35 p.m. UTC

Instead of the net stack managing ubuf_info, allow to pass it in from
outside in a struct msghdr (in-kernel structure), so io_uring can make
use of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c          | 2 ++
 include/linux/socket.h | 1 +
 net/compat.c           | 1 +
 net/socket.c           | 3 +++
 4 files changed, 7 insertions(+)

Hao Xu Jan. 11, 2022, 1:51 p.m. UTC | #1

在 2021/12/21 下午11:35, Pavel Begunkov 写道:
> Instead of the net stack managing ubuf_info, allow to pass it in from
> outside in a struct msghdr (in-kernel structure), so io_uring can make
> use of it.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
Hi Pavel,
I've some confusions here since I have a lack of
network knowledge.
The first one is why do we make ubuf_info visible
for io_uring. Why not just follow the old MSG_ZEROCOPY
logic?

The second one, my understanding about the buffer
lifecycle is that the kernel side informs
the userspace by a cqe generated by the ubuf_info
callback that all the buffers attaching to the
same notifier is now free to use when all the data
is sent, then why is the flush in 13/19 needed as
it is at the submission period?

Regards,
Hao

Pavel Begunkov Jan. 11, 2022, 3:50 p.m. UTC | #2

On 1/11/22 13:51, Hao Xu wrote:
> 在 2021/12/21 下午11:35, Pavel Begunkov 写道:
>> Instead of the net stack managing ubuf_info, allow to pass it in from
>> outside in a struct msghdr (in-kernel structure), so io_uring can make
>> use of it.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
> Hi Pavel,
> I've some confusions here since I have a lack of
> network knowledge.
> The first one is why do we make ubuf_info visible
> for io_uring. Why not just follow the old MSG_ZEROCOPY
> logic?

I assume you mean leaving allocation up and so in socket awhile the
patchset let's io_uring to manage and control ubufs. In short,
performance and out convenience

TL;DR;
First, we want a nice and uniform API with io_uring, i.e. posting
an CQE instead of polling an err queue/etc., and for that the network
will need to know about io_uring ctx in some way. As an alternative it
may theoretically be registered in socket, but it'll quickly turn into
a huge mess, consider that it's a many to many relation b/w io_uring and
sockets. The fact that io_uring holds refs to files will only complicate
it.

It will also limit API. For instance, we won't be able to use a single
ubuf with several different sockets.

Another problem is performance, registration or some other tricks
would some additional sync. It'd also need sync on use, say it's
just one rcu_read, but the problem that it only adds up to complexity
and prevents some other optimisations. E.g. we amortise to ~0 atomics
getting refs on skb setups based on guarantees io_uring provides, and
not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled
by the issuer, and so it needs some context as currently provided by
ubuf. io_uring also caches ubufs, which relies on io_uring locking, so
it removes kmalloc/free for almost zero overhead.

> The second one, my understanding about the buffer
> lifecycle is that the kernel side informs
> the userspace by a cqe generated by the ubuf_info
> callback that all the buffers attaching to the
> same notifier is now free to use when all the data
> is sent, then why is the flush in 13/19 needed as
> it is at the submission period?

Probably I wasn't clear enough. A user has to flush a notifier, only
then it's expected to post an CQE after all buffers attached to it
are freed. io_uring holds one ubuf ref, which will be release on flush.
I also need to add a way to flush without send.

Will spend some time documenting for next iteration.

Hao Xu Jan. 12, 2022, 3:39 a.m. UTC | #3

在 2022/1/11 下午11:50, Pavel Begunkov 写道:
> On 1/11/22 13:51, Hao Xu wrote:
>> 在 2021/12/21 下午11:35, Pavel Begunkov 写道:
>>> Instead of the net stack managing ubuf_info, allow to pass it in from
>>> outside in a struct msghdr (in-kernel structure), so io_uring can make
>>> use of it.
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> ---
>> Hi Pavel,
>> I've some confusions here since I have a lack of
>> network knowledge.
>> The first one is why do we make ubuf_info visible
>> for io_uring. Why not just follow the old MSG_ZEROCOPY
>> logic?
> 
> I assume you mean leaving allocation up and so in socket awhile the
> patchset let's io_uring to manage and control ubufs. In short,
> performance and out convenience
> 
> TL;DR;
> First, we want a nice and uniform API with io_uring, i.e. posting
> an CQE instead of polling an err queue/etc., and for that the network
> will need to know about io_uring ctx in some way. As an alternative it
> may theoretically be registered in socket, but it'll quickly turn into
> a huge mess, consider that it's a many to many relation b/w io_uring and
> sockets. The fact that io_uring holds refs to files will only complicate
> it.
Make sense to me, thanks.
> 
> It will also limit API. For instance, we won't be able to use a single
> ubuf with several different sockets.
Is there any use cases for this multiple sockets with single
notification?
> 
> Another problem is performance, registration or some other tricks
> would some additional sync. It'd also need sync on use, say it's
> just one rcu_read, but the problem that it only adds up to complexity
> and prevents some other optimisations. E.g. we amortise to ~0 atomics
> getting refs on skb setups based on guarantees io_uring provides, and
> not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled
> by the issuer, and so it needs some context as currently provided by
> ubuf. io_uring also caches ubufs, which relies on io_uring locking, so
> it removes kmalloc/free for almost zero overhead.
> 
> 
>> The second one, my understanding about the buffer
>> lifecycle is that the kernel side informs
>> the userspace by a cqe generated by the ubuf_info
>> callback that all the buffers attaching to the
>> same notifier is now free to use when all the data
>> is sent, then why is the flush in 13/19 needed as
>> it is at the submission period?
> 
> Probably I wasn't clear enough. A user has to flush a notifier, only
> then it's expected to post an CQE after all buffers attached to it
> are freed. io_uring holds one ubuf ref, which will be release on flush.
I see, I saw another ref inc in skb_zcopy_set() which I previously
misunderstood and thus thought there was only one refcount. Thanks!
> I also need to add a way to flush without send.
> 
> Will spend some time documenting for next iteration.
>

Pavel Begunkov Jan. 12, 2022, 4:53 p.m. UTC | #4

On 1/12/22 03:39, Hao Xu wrote:
> 在 2022/1/11 下午11:50, Pavel Begunkov 写道:
>> On 1/11/22 13:51, Hao Xu wrote:
>>> 在 2021/12/21 下午11:35, Pavel Begunkov 写道:
>>>> Instead of the net stack managing ubuf_info, allow to pass it in from
>>>> outside in a struct msghdr (in-kernel structure), so io_uring can make
>>>> use of it.
>>>>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> ---
>>> Hi Pavel,
>>> I've some confusions here since I have a lack of
>>> network knowledge.
>>> The first one is why do we make ubuf_info visible
>>> for io_uring. Why not just follow the old MSG_ZEROCOPY
>>> logic?
>>
>> I assume you mean leaving allocation up and so in socket awhile the
>> patchset let's io_uring to manage and control ubufs. In short,
>> performance and out convenience
>>
>> TL;DR;
>> First, we want a nice and uniform API with io_uring, i.e. posting
>> an CQE instead of polling an err queue/etc., and for that the network
>> will need to know about io_uring ctx in some way. As an alternative it
>> may theoretically be registered in socket, but it'll quickly turn into
>> a huge mess, consider that it's a many to many relation b/w io_uring and
>> sockets. The fact that io_uring holds refs to files will only complicate
>> it.
> Make sense to me, thanks.
>>
>> It will also limit API. For instance, we won't be able to use a single
>> ubuf with several different sockets.
> Is there any use cases for this multiple sockets with single
> notification?

Don't know, scatter send maybe? It's just one of those moments when
a design that feels right (to me) yields more flexibility than was
initially planned, which is definitely a good thing


>> Another problem is performance, registration or some other tricks
>> would some additional sync. It'd also need sync on use, say it's
>> just one rcu_read, but the problem that it only adds up to complexity
>> and prevents some other optimisations. E.g. we amortise to ~0 atomics
>> getting refs on skb setups based on guarantees io_uring provides, and
>> not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled
>> by the issuer, and so it needs some context as currently provided by
>> ubuf. io_uring also caches ubufs, which relies on io_uring locking, so
>> it removes kmalloc/free for almost zero overhead.
>>
>>
>>> The second one, my understanding about the buffer
>>> lifecycle is that the kernel side informs
>>> the userspace by a cqe generated by the ubuf_info
>>> callback that all the buffers attaching to the
>>> same notifier is now free to use when all the data
>>> is sent, then why is the flush in 13/19 needed as
>>> it is at the submission period?
>>
>> Probably I wasn't clear enough. A user has to flush a notifier, only
>> then it's expected to post an CQE after all buffers attached to it
>> are freed. io_uring holds one ubuf ref, which will be release on flush.
> I see, I saw another ref inc in skb_zcopy_set() which I previously
> misunderstood and thus thought there was only one refcount. Thanks!
>> I also need to add a way to flush without send.
>>
>> Will spend some time documenting for next iteration.

[RFC,v2,02/19] skbuff: pass a struct ubuf_info in msghdr

Checks

Commit Message

Comments

Patch