diff mbox series

[v1,06/15] net: page_pool: add ->scrub mem provider callback

Message ID 20241007221603.1703699-7-dw@davidwei.uk (mailing list archive)
State New
Headers show
Series io_uring zero copy rx | expand

Commit Message

David Wei Oct. 7, 2024, 10:15 p.m. UTC
From: Pavel Begunkov <asml.silence@gmail.com>

page pool is now waiting for all ppiovs to return before destroying
itself, and for that to happen the memory provider might need to push
some buffers, flush caches and so on.

todo: we'll try to get by without it before the final release

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
---
 include/net/page_pool/types.h | 1 +
 net/core/page_pool.c          | 3 +++
 2 files changed, 4 insertions(+)

Comments

Mina Almasry Oct. 9, 2024, 9 p.m. UTC | #1
On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
>
> From: Pavel Begunkov <asml.silence@gmail.com>
>
> page pool is now waiting for all ppiovs to return before destroying
> itself, and for that to happen the memory provider might need to push
> some buffers, flush caches and so on.
>
> todo: we'll try to get by without it before the final release
>

Is the intention to drop this todo and stick with this patch, or to
move ahead with this patch?

To be honest, I think I read in a follow up patch that you want to
unref all the memory on page_pool_destory, which is not how the
page_pool is used today. Tdoay page_pool_destroy does not reclaim
memory. Changing that may be OK.

But I'm not sure this is generic change that should be put in the
page_pool providers. I don't envision other providers implementing
this. I think they'll be more interested in using the page_pool the
way it's used today.

I would suggest that instead of making this a page_pool provider
thing, to instead have your iouring code listen to a notification that
a new generic notificatino that page_pool is being destroyed or an
rx-queue is being destroyed or something like that, and doing the
scrubbing based on that, maybe?
Pavel Begunkov Oct. 9, 2024, 9:59 p.m. UTC | #2
On 10/9/24 22:00, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
>>
>> From: Pavel Begunkov <asml.silence@gmail.com>
>>
>> page pool is now waiting for all ppiovs to return before destroying
>> itself, and for that to happen the memory provider might need to push
>> some buffers, flush caches and so on.
>>
>> todo: we'll try to get by without it before the final release
>>
> 
> Is the intention to drop this todo and stick with this patch, or to
> move ahead with this patch?

Heh, I overlooked this todo. The plan is to actually leave it
as is, it's by far the simplest way and doesn't really gets
into anyone's way as it's a slow path.

> To be honest, I think I read in a follow up patch that you want to
> unref all the memory on page_pool_destory, which is not how the
> page_pool is used today. Tdoay page_pool_destroy does not reclaim
> memory. Changing that may be OK.

It doesn't because it can't (not breaking anything), which is a
problem as the page pool might never get destroyed. io_uring
doesn't change that, a buffer can't be reclaimed while anything
in the kernel stack holds it. It's only when it's given to the
user we can force it back out of there.

And it has to happen one way or another, we can't trust the
user to put buffers back, it's just devmem does that by temporarily
attaching the lifetime of such buffers to a socket.

> But I'm not sure this is generic change that should be put in the
> page_pool providers. I don't envision other providers implementing
> this. I think they'll be more interested in using the page_pool the
> way it's used today.

If the pp/net maintainers abhor it, I could try to replace it
with some "inventive" solution, which most likely would need
referencing all io_uring zcrx requests, but otherwise I'd
prefer to leave it as is.

> I would suggest that instead of making this a page_pool provider
> thing, to instead have your iouring code listen to a notification that
> a new generic notificatino that page_pool is being destroyed or an
> rx-queue is being destroyed or something like that, and doing the
> scrubbing based on that, maybe?

You can say it listens to the page pool being destroyed, exactly
what it's interesting in. Trying to catch the destruction of an
rx-queue is the same thing but with jumping more hops and indirectly
deriving that the page pool is killed.
Mina Almasry Oct. 10, 2024, 5:54 p.m. UTC | #3
On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 10/9/24 22:00, Mina Almasry wrote:
> > On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
> >>
> >> From: Pavel Begunkov <asml.silence@gmail.com>
> >>
> >> page pool is now waiting for all ppiovs to return before destroying
> >> itself, and for that to happen the memory provider might need to push
> >> some buffers, flush caches and so on.
> >>
> >> todo: we'll try to get by without it before the final release
> >>
> >
> > Is the intention to drop this todo and stick with this patch, or to
> > move ahead with this patch?
>
> Heh, I overlooked this todo. The plan is to actually leave it
> as is, it's by far the simplest way and doesn't really gets
> into anyone's way as it's a slow path.
>
> > To be honest, I think I read in a follow up patch that you want to
> > unref all the memory on page_pool_destory, which is not how the
> > page_pool is used today. Tdoay page_pool_destroy does not reclaim
> > memory. Changing that may be OK.
>
> It doesn't because it can't (not breaking anything), which is a
> problem as the page pool might never get destroyed. io_uring
> doesn't change that, a buffer can't be reclaimed while anything
> in the kernel stack holds it. It's only when it's given to the
> user we can force it back out of there.
>
> And it has to happen one way or another, we can't trust the
> user to put buffers back, it's just devmem does that by temporarily
> attaching the lifetime of such buffers to a socket.
>

(noob question) does io_uring not have a socket equivalent that you
can tie the lifetime of the buffers to? I'm thinking there must be
one, because in your patches IIRC you have the fill queues and the
memory you bind from the userspace, there should be something that
tells you that the userspace has exited/crashed and it's time to now
destroy the fill queue and unbind the memory, right?

I'm thinking you may want to bind the lifetime of the buffers to that,
instead of the lifetime of the pool. The pool will not be destroyed
until the next driver/reset reconfiguration happens, right? That could
be long long after the userspace has stopped using the memory.
David Wei Oct. 13, 2024, 5:25 p.m. UTC | #4
On 2024-10-10 10:54, Mina Almasry wrote:
> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 10/9/24 22:00, Mina Almasry wrote:
>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
>>>>
>>>> From: Pavel Begunkov <asml.silence@gmail.com>
>>>>
>>>> page pool is now waiting for all ppiovs to return before destroying
>>>> itself, and for that to happen the memory provider might need to push
>>>> some buffers, flush caches and so on.
>>>>
>>>> todo: we'll try to get by without it before the final release
>>>>
>>>
>>> Is the intention to drop this todo and stick with this patch, or to
>>> move ahead with this patch?
>>
>> Heh, I overlooked this todo. The plan is to actually leave it
>> as is, it's by far the simplest way and doesn't really gets
>> into anyone's way as it's a slow path.
>>
>>> To be honest, I think I read in a follow up patch that you want to
>>> unref all the memory on page_pool_destory, which is not how the
>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
>>> memory. Changing that may be OK.
>>
>> It doesn't because it can't (not breaking anything), which is a
>> problem as the page pool might never get destroyed. io_uring
>> doesn't change that, a buffer can't be reclaimed while anything
>> in the kernel stack holds it. It's only when it's given to the
>> user we can force it back out of there.

The page pool will definitely be destroyed, the call to
netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
core will ensure that.

>>
>> And it has to happen one way or another, we can't trust the
>> user to put buffers back, it's just devmem does that by temporarily
>> attaching the lifetime of such buffers to a socket.
>>
> 
> (noob question) does io_uring not have a socket equivalent that you
> can tie the lifetime of the buffers to? I'm thinking there must be
> one, because in your patches IIRC you have the fill queues and the
> memory you bind from the userspace, there should be something that
> tells you that the userspace has exited/crashed and it's time to now
> destroy the fill queue and unbind the memory, right?
> 
> I'm thinking you may want to bind the lifetime of the buffers to that,
> instead of the lifetime of the pool. The pool will not be destroyed
> until the next driver/reset reconfiguration happens, right? That could
> be long long after the userspace has stopped using the memory.
> 

Yes, there are io_uring objects e.g. interface queue that hold
everything together. IIRC page pool destroy doesn't unref but it waits
for all pages that are handed out to skbs to be returned. So for us,
below might work:

1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
   queue and tries to free the old pp
2. At this point we're guaranteed that any packets hitting this rx queue
   will not go to user pages from our memory provider
3. Assume userspace is gone (either crash or gracefully terminating),
   unref the uref for all pages, same as what scrub() is doing today
4. Any pages that are still in skb frags will get freed when the sockets
   etc are closed
5. Rely on the pp delay release to eventually terminate and clean up

Let me know what you think Pavel.
Pavel Begunkov Oct. 14, 2024, 1:37 p.m. UTC | #5
On 10/13/24 18:25, David Wei wrote:
> On 2024-10-10 10:54, Mina Almasry wrote:
>> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>
>>> On 10/9/24 22:00, Mina Almasry wrote:
>>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
>>>>>
>>>>> From: Pavel Begunkov <asml.silence@gmail.com>
>>>>>
>>>>> page pool is now waiting for all ppiovs to return before destroying
>>>>> itself, and for that to happen the memory provider might need to push
>>>>> some buffers, flush caches and so on.
>>>>>
>>>>> todo: we'll try to get by without it before the final release
>>>>>
>>>>
>>>> Is the intention to drop this todo and stick with this patch, or to
>>>> move ahead with this patch?
>>>
>>> Heh, I overlooked this todo. The plan is to actually leave it
>>> as is, it's by far the simplest way and doesn't really gets
>>> into anyone's way as it's a slow path.
>>>
>>>> To be honest, I think I read in a follow up patch that you want to
>>>> unref all the memory on page_pool_destory, which is not how the
>>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
>>>> memory. Changing that may be OK.
>>>
>>> It doesn't because it can't (not breaking anything), which is a
>>> problem as the page pool might never get destroyed. io_uring
>>> doesn't change that, a buffer can't be reclaimed while anything
>>> in the kernel stack holds it. It's only when it's given to the
>>> user we can force it back out of there.
> 
> The page pool will definitely be destroyed, the call to
> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
> core will ensure that.
> 
>>>
>>> And it has to happen one way or another, we can't trust the
>>> user to put buffers back, it's just devmem does that by temporarily
>>> attaching the lifetime of such buffers to a socket.
>>>
>>
>> (noob question) does io_uring not have a socket equivalent that you
>> can tie the lifetime of the buffers to? I'm thinking there must be
>> one, because in your patches IIRC you have the fill queues and the
>> memory you bind from the userspace, there should be something that
>> tells you that the userspace has exited/crashed and it's time to now
>> destroy the fill queue and unbind the memory, right?
>>
>> I'm thinking you may want to bind the lifetime of the buffers to that,
>> instead of the lifetime of the pool. The pool will not be destroyed
>> until the next driver/reset reconfiguration happens, right? That could
>> be long long after the userspace has stopped using the memory.
>>
> 
> Yes, there are io_uring objects e.g. interface queue that hold
> everything together. IIRC page pool destroy doesn't unref but it waits
> for all pages that are handed out to skbs to be returned. So for us,
> below might work:
> 
> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
>     queue and tries to free the old pp
> 2. At this point we're guaranteed that any packets hitting this rx queue
>     will not go to user pages from our memory provider
> 3. Assume userspace is gone (either crash or gracefully terminating),
>     unref the uref for all pages, same as what scrub() is doing today
> 4. Any pages that are still in skb frags will get freed when the sockets
>     etc are closed
> 5. Rely on the pp delay release to eventually terminate and clean up
> 
> Let me know what you think Pavel.

I'll get to this comment a bit later when I get some time to
remember what races we have to deal with without the callback.
Mina Almasry Oct. 14, 2024, 10:58 p.m. UTC | #6
On Sun, Oct 13, 2024 at 8:25 PM David Wei <dw@davidwei.uk> wrote:
>
> On 2024-10-10 10:54, Mina Almasry wrote:
> > On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> On 10/9/24 22:00, Mina Almasry wrote:
> >>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
> >>>>
> >>>> From: Pavel Begunkov <asml.silence@gmail.com>
> >>>>
> >>>> page pool is now waiting for all ppiovs to return before destroying
> >>>> itself, and for that to happen the memory provider might need to push
> >>>> some buffers, flush caches and so on.
> >>>>
> >>>> todo: we'll try to get by without it before the final release
> >>>>
> >>>
> >>> Is the intention to drop this todo and stick with this patch, or to
> >>> move ahead with this patch?
> >>
> >> Heh, I overlooked this todo. The plan is to actually leave it
> >> as is, it's by far the simplest way and doesn't really gets
> >> into anyone's way as it's a slow path.
> >>
> >>> To be honest, I think I read in a follow up patch that you want to
> >>> unref all the memory on page_pool_destory, which is not how the
> >>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
> >>> memory. Changing that may be OK.
> >>
> >> It doesn't because it can't (not breaking anything), which is a
> >> problem as the page pool might never get destroyed. io_uring
> >> doesn't change that, a buffer can't be reclaimed while anything
> >> in the kernel stack holds it. It's only when it's given to the
> >> user we can force it back out of there.
>
> The page pool will definitely be destroyed, the call to
> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
> core will ensure that.
>
> >>
> >> And it has to happen one way or another, we can't trust the
> >> user to put buffers back, it's just devmem does that by temporarily
> >> attaching the lifetime of such buffers to a socket.
> >>
> >
> > (noob question) does io_uring not have a socket equivalent that you
> > can tie the lifetime of the buffers to? I'm thinking there must be
> > one, because in your patches IIRC you have the fill queues and the
> > memory you bind from the userspace, there should be something that
> > tells you that the userspace has exited/crashed and it's time to now
> > destroy the fill queue and unbind the memory, right?
> >
> > I'm thinking you may want to bind the lifetime of the buffers to that,
> > instead of the lifetime of the pool. The pool will not be destroyed
> > until the next driver/reset reconfiguration happens, right? That could
> > be long long after the userspace has stopped using the memory.
> >
>
> Yes, there are io_uring objects e.g. interface queue that hold
> everything together. IIRC page pool destroy doesn't unref but it waits
> for all pages that are handed out to skbs to be returned. So for us,
> below might work:
>
> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
>    queue and tries to free the old pp
> 2. At this point we're guaranteed that any packets hitting this rx queue
>    will not go to user pages from our memory provider
> 3. Assume userspace is gone (either crash or gracefully terminating),
>    unref the uref for all pages, same as what scrub() is doing today
> 4. Any pages that are still in skb frags will get freed when the sockets
>    etc are closed
> 5. Rely on the pp delay release to eventually terminate and clean up
>
> Let me know what you think Pavel.

Something roughly along those lines sounds more reasonable to me.

The critical point is as I said above, if you free the memory only
when the pp is destroyed, then the memory lives from 1 io_uring ZC
instance to the next. The next instance will see a reduced address
space because the previously destroyed io_uring ZC connection did not
free the memory. You could have users in production opening thousands
of io_uring ZC connections between rxq resets, and not cleaning up
those connections. In that case I think eventually they'll run out of
memory as the memory leaks until it's cleaned up with a pp destroy
(driver reset?).
Pavel Begunkov Oct. 16, 2024, 5:42 p.m. UTC | #7
On 10/14/24 23:58, Mina Almasry wrote:
> On Sun, Oct 13, 2024 at 8:25 PM David Wei <dw@davidwei.uk> wrote:
>>
>> On 2024-10-10 10:54, Mina Almasry wrote:
>>> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>
>>>> On 10/9/24 22:00, Mina Almasry wrote:
>>>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
>>>>>>
>>>>>> From: Pavel Begunkov <asml.silence@gmail.com>
>>>>>>
>>>>>> page pool is now waiting for all ppiovs to return before destroying
>>>>>> itself, and for that to happen the memory provider might need to push
>>>>>> some buffers, flush caches and so on.
>>>>>>
>>>>>> todo: we'll try to get by without it before the final release
>>>>>>
>>>>>
>>>>> Is the intention to drop this todo and stick with this patch, or to
>>>>> move ahead with this patch?
>>>>
>>>> Heh, I overlooked this todo. The plan is to actually leave it
>>>> as is, it's by far the simplest way and doesn't really gets
>>>> into anyone's way as it's a slow path.
>>>>
>>>>> To be honest, I think I read in a follow up patch that you want to
>>>>> unref all the memory on page_pool_destory, which is not how the
>>>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
>>>>> memory. Changing that may be OK.
>>>>
>>>> It doesn't because it can't (not breaking anything), which is a
>>>> problem as the page pool might never get destroyed. io_uring
>>>> doesn't change that, a buffer can't be reclaimed while anything
>>>> in the kernel stack holds it. It's only when it's given to the
>>>> user we can force it back out of there.
>>
>> The page pool will definitely be destroyed, the call to
>> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
>> core will ensure that.
>>
>>>>
>>>> And it has to happen one way or another, we can't trust the
>>>> user to put buffers back, it's just devmem does that by temporarily
>>>> attaching the lifetime of such buffers to a socket.
>>>>
>>>
>>> (noob question) does io_uring not have a socket equivalent that you
>>> can tie the lifetime of the buffers to? I'm thinking there must be

You can say it is bound to io_uring / io_uring's object
representing the queue.

>>> one, because in your patches IIRC you have the fill queues and the
>>> memory you bind from the userspace, there should be something that
>>> tells you that the userspace has exited/crashed and it's time to now
>>> destroy the fill queue and unbind the memory, right?
>>>
>>> I'm thinking you may want to bind the lifetime of the buffers to that,
>>> instead of the lifetime of the pool. The pool will not be destroyed
>>> until the next driver/reset reconfiguration happens, right? That could
>>> be long long after the userspace has stopped using the memory.

io_uring will reset the queue if it dies / requested to release
the queue.

>> Yes, there are io_uring objects e.g. interface queue that hold
>> everything together. IIRC page pool destroy doesn't unref but it waits
>> for all pages that are handed out to skbs to be returned. So for us,
>> below might work:
>>
>> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
>>     queue and tries to free the old pp
>> 2. At this point we're guaranteed that any packets hitting this rx queue
>>     will not go to user pages from our memory provider

It's reasonable to assume that the driver will start destroying
the page pool, but I wouldn't rely on it when it comes to the
kernel state correctness, i.e. not crashing the kernel. It's a bit
fragile, drivers always tend to do all kinds of interesting stuff,
I'd rather deal with a loud io_uring / page pool leak in case of
some weirdness. And that means we can't really guarantee the above
and need to care about not racing with allocations.

>> 3. Assume userspace is gone (either crash or gracefully terminating),
>>     unref the uref for all pages, same as what scrub() is doing today
>> 4. Any pages that are still in skb frags will get freed when the sockets
>>     etc are closed

And we need to prevent from requests receiving netmem that are
already pushed to sockets.

>> 5. Rely on the pp delay release to eventually terminate and clean up
>>
>> Let me know what you think Pavel.

I think it's reasonable to leave it as is for now, I don't believe
anyone cares much about a simple slow path memory provider-only
callback. And we can always kill it later on if we find a good way
to synchronise pieces, which will be more apparent when we add some
more registration dynamism on top, when/if this patchset is merged.

In short, let's resend the series with the callback, see if
maintainers have a strong opinion, and otherwise I'd say it
should be fine as is.

> Something roughly along those lines sounds more reasonable to me.
> 
> The critical point is as I said above, if you free the memory only
> when the pp is destroyed, then the memory lives from 1 io_uring ZC
> instance to the next. The next instance will see a reduced address
> space because the previously destroyed io_uring ZC connection did not
> free the memory. You could have users in production opening thousands
> of io_uring ZC connections between rxq resets, and not cleaning up
> those connections. In that case I think eventually they'll run out of
> memory as the memory leaks until it's cleaned up with a pp destroy
> (driver reset?).

Not sure what giving memory from one io_uring zc instance to
another means. And it's perfectly valid to receive a buffer, close
the socket and only after use the data, it logically belongs to
the user, not the socket. It's only bound to io_uring zcrx/queue
object for clean up purposes if io_uring goes down, it's different
from devmem TCP.
Mina Almasry Nov. 1, 2024, 5:18 p.m. UTC | #8
On Wed, Oct 16, 2024 at 10:42 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 10/14/24 23:58, Mina Almasry wrote:
> > On Sun, Oct 13, 2024 at 8:25 PM David Wei <dw@davidwei.uk> wrote:
> >>
> >> On 2024-10-10 10:54, Mina Almasry wrote:
> >>> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>>>
> >>>> On 10/9/24 22:00, Mina Almasry wrote:
> >>>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
> >>>>>>
> >>>>>> From: Pavel Begunkov <asml.silence@gmail.com>
> >>>>>>
> >>>>>> page pool is now waiting for all ppiovs to return before destroying
> >>>>>> itself, and for that to happen the memory provider might need to push
> >>>>>> some buffers, flush caches and so on.
> >>>>>>
> >>>>>> todo: we'll try to get by without it before the final release
> >>>>>>
> >>>>>
> >>>>> Is the intention to drop this todo and stick with this patch, or to
> >>>>> move ahead with this patch?
> >>>>
> >>>> Heh, I overlooked this todo. The plan is to actually leave it
> >>>> as is, it's by far the simplest way and doesn't really gets
> >>>> into anyone's way as it's a slow path.
> >>>>
> >>>>> To be honest, I think I read in a follow up patch that you want to
> >>>>> unref all the memory on page_pool_destory, which is not how the
> >>>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
> >>>>> memory. Changing that may be OK.
> >>>>
> >>>> It doesn't because it can't (not breaking anything), which is a
> >>>> problem as the page pool might never get destroyed. io_uring
> >>>> doesn't change that, a buffer can't be reclaimed while anything
> >>>> in the kernel stack holds it. It's only when it's given to the
> >>>> user we can force it back out of there.
> >>
> >> The page pool will definitely be destroyed, the call to
> >> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
> >> core will ensure that.
> >>
> >>>>
> >>>> And it has to happen one way or another, we can't trust the
> >>>> user to put buffers back, it's just devmem does that by temporarily
> >>>> attaching the lifetime of such buffers to a socket.
> >>>>
> >>>
> >>> (noob question) does io_uring not have a socket equivalent that you
> >>> can tie the lifetime of the buffers to? I'm thinking there must be
>
> You can say it is bound to io_uring / io_uring's object
> representing the queue.
>
> >>> one, because in your patches IIRC you have the fill queues and the
> >>> memory you bind from the userspace, there should be something that
> >>> tells you that the userspace has exited/crashed and it's time to now
> >>> destroy the fill queue and unbind the memory, right?
> >>>
> >>> I'm thinking you may want to bind the lifetime of the buffers to that,
> >>> instead of the lifetime of the pool. The pool will not be destroyed
> >>> until the next driver/reset reconfiguration happens, right? That could
> >>> be long long after the userspace has stopped using the memory.
>
> io_uring will reset the queue if it dies / requested to release
> the queue.
>
> >> Yes, there are io_uring objects e.g. interface queue that hold
> >> everything together. IIRC page pool destroy doesn't unref but it waits
> >> for all pages that are handed out to skbs to be returned. So for us,
> >> below might work:
> >>
> >> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
> >>     queue and tries to free the old pp
> >> 2. At this point we're guaranteed that any packets hitting this rx queue
> >>     will not go to user pages from our memory provider
>
> It's reasonable to assume that the driver will start destroying
> the page pool, but I wouldn't rely on it when it comes to the
> kernel state correctness, i.e. not crashing the kernel. It's a bit
> fragile, drivers always tend to do all kinds of interesting stuff,
> I'd rather deal with a loud io_uring / page pool leak in case of
> some weirdness. And that means we can't really guarantee the above
> and need to care about not racing with allocations.
>
> >> 3. Assume userspace is gone (either crash or gracefully terminating),
> >>     unref the uref for all pages, same as what scrub() is doing today
> >> 4. Any pages that are still in skb frags will get freed when the sockets
> >>     etc are closed
>
> And we need to prevent from requests receiving netmem that are
> already pushed to sockets.
>
> >> 5. Rely on the pp delay release to eventually terminate and clean up
> >>
> >> Let me know what you think Pavel.
>
> I think it's reasonable to leave it as is for now, I don't believe
> anyone cares much about a simple slow path memory provider-only
> callback. And we can always kill it later on if we find a good way
> to synchronise pieces, which will be more apparent when we add some
> more registration dynamism on top, when/if this patchset is merged.
>
> In short, let's resend the series with the callback, see if
> maintainers have a strong opinion, and otherwise I'd say it
> should be fine as is.
>
> > Something roughly along those lines sounds more reasonable to me.
> >
> > The critical point is as I said above, if you free the memory only
> > when the pp is destroyed, then the memory lives from 1 io_uring ZC
> > instance to the next. The next instance will see a reduced address
> > space because the previously destroyed io_uring ZC connection did not
> > free the memory. You could have users in production opening thousands
> > of io_uring ZC connections between rxq resets, and not cleaning up
> > those connections. In that case I think eventually they'll run out of
> > memory as the memory leaks until it's cleaned up with a pp destroy
> > (driver reset?).
>
> Not sure what giving memory from one io_uring zc instance to
> another means. And it's perfectly valid to receive a buffer, close
> the socket and only after use the data, it logically belongs to
> the user, not the socket. It's only bound to io_uring zcrx/queue
> object for clean up purposes if io_uring goes down, it's different
> from devmem TCP.
>

(responding here because I'm looking at the latest iteration after
vacation, but the discussion is here)

Huh, interesting. For devmem TCP we bind a region of memory to the
queue once, and after that we can create N connections all reusing the
same memory region. Is that not the case for io_uring? There are no
docs or selftest with the series to show sample code using this, but
the cover letter mentions that RSS + flow steering needs to be
configured for io ZC to work. The configuration of flow steering
implies that the user is responsible for initiating the connection. If
the user is initiating 1 connection then they can initiate many
without reconfiguring the memory binding, right?

When the user initiates the second connection, any pages not cleaned
up from the previous connection (because we're waiting for the scrub
callback to be hit), will be occupied when they should not be, right?

--
Thanks,
Mina
Pavel Begunkov Nov. 1, 2024, 6:35 p.m. UTC | #9
On 11/1/24 17:18, Mina Almasry wrote:
> On Wed, Oct 16, 2024 at 10:42 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
...
>>> The critical point is as I said above, if you free the memory only
>>> when the pp is destroyed, then the memory lives from 1 io_uring ZC
>>> instance to the next. The next instance will see a reduced address
>>> space because the previously destroyed io_uring ZC connection did not
>>> free the memory. You could have users in production opening thousands
>>> of io_uring ZC connections between rxq resets, and not cleaning up
>>> those connections. In that case I think eventually they'll run out of
>>> memory as the memory leaks until it's cleaned up with a pp destroy
>>> (driver reset?).
>>
>> Not sure what giving memory from one io_uring zc instance to
>> another means. And it's perfectly valid to receive a buffer, close
>> the socket and only after use the data, it logically belongs to
>> the user, not the socket. It's only bound to io_uring zcrx/queue
>> object for clean up purposes if io_uring goes down, it's different
>> from devmem TCP.
>>
> 
> (responding here because I'm looking at the latest iteration after
> vacation, but the discussion is here)
> 
> Huh, interesting. For devmem TCP we bind a region of memory to the
> queue once, and after that we can create N connections all reusing the
> same memory region. Is that not the case for io_uring? There are no

Hmm, I think we already discussed the same question before. Yes, it
does indeed support arbitrary number of connections. For what I was
saying above, the devmem TCP analogy would be attaching buffers to the
netlink socket instead of a tcp socket (that new xarray you added) when
you give it to user space. Then, you can close the connection after a
receive and the buffer you've got would still be alive.

That's pretty intuitive as well, with normal receives the kernel
doesn't nuke the buffer you got data into from a normal recv(2) just
because the connection got closed.

> docs or selftest with the series to show sample code using this, but

There should be a good bunch of tests in liburing if you follow
links in the cover letter, as well as added support to some
benchmark tools, kperf and netbench. Also, as mentioned, need to
add a simpler example to liburing, not sure why it was removed.
There will also be man pages, that's better to be done after
merging it since things could change.


> the cover letter mentions that RSS + flow steering needs to be
> configured for io ZC to work. The configuration of flow steering
> implies that the user is responsible for initiating the connection. If
> the user is initiating 1 connection then they can initiate many
> without reconfiguring the memory binding, right?

Right

> When the user initiates the second connection, any pages not cleaned
> up from the previous connection (because we're waiting for the scrub
> callback to be hit), will be occupied when they should not be, right?

I'm not sure what you mean, but seems like the question comes from
the assumptions that it supports only one connection at a time,
which is not the case.
Mina Almasry Nov. 1, 2024, 7:24 p.m. UTC | #10
On Fri, Nov 1, 2024 at 11:34 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 11/1/24 17:18, Mina Almasry wrote:
> > On Wed, Oct 16, 2024 at 10:42 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> ...
> >>> The critical point is as I said above, if you free the memory only
> >>> when the pp is destroyed, then the memory lives from 1 io_uring ZC
> >>> instance to the next. The next instance will see a reduced address
> >>> space because the previously destroyed io_uring ZC connection did not
> >>> free the memory. You could have users in production opening thousands
> >>> of io_uring ZC connections between rxq resets, and not cleaning up
> >>> those connections. In that case I think eventually they'll run out of
> >>> memory as the memory leaks until it's cleaned up with a pp destroy
> >>> (driver reset?).
> >>
> >> Not sure what giving memory from one io_uring zc instance to
> >> another means. And it's perfectly valid to receive a buffer, close
> >> the socket and only after use the data, it logically belongs to
> >> the user, not the socket. It's only bound to io_uring zcrx/queue
> >> object for clean up purposes if io_uring goes down, it's different
> >> from devmem TCP.
> >>
> >
> > (responding here because I'm looking at the latest iteration after
> > vacation, but the discussion is here)
> >
> > Huh, interesting. For devmem TCP we bind a region of memory to the
> > queue once, and after that we can create N connections all reusing the
> > same memory region. Is that not the case for io_uring? There are no
>
> Hmm, I think we already discussed the same question before. Yes, it
> does indeed support arbitrary number of connections. For what I was
> saying above, the devmem TCP analogy would be attaching buffers to the
> netlink socket instead of a tcp socket (that new xarray you added) when
> you give it to user space. Then, you can close the connection after a
> receive and the buffer you've got would still be alive.
>

Ah, I see. You're making a tradeoff here. You leave the buffers alive
after each connection so the userspace can still use them if it wishes
but they are of course unavailable for other connections.

But in our case (and I'm guessing yours) the process that will set up
the io_uring memory provider/RSS/flow steering will be a different
process from the one that sends/receive data, no? Because the former
requires CAP_NET_ADMIN privileges while the latter will not. If they
are 2 different processes, what happens when the latter process doing
the send/receive crashes? Does the memory stay unavailable until the
CAP_NET_ADMIN process exits? Wouldn't it be better to tie the lifetime
of the buffers of the connection? Sure, the buffers will become
unavailable after the connection is closed, but at least you don't
'leak' memory on send/receive process crashes.

Unless of course you're saying that only CAP_NET_ADMIN processes will
run io_rcrx connections. Then they can do their own mp setup/RSS/flow
steering and there is no concern when the process crashes because
everything will be cleaned up. But that's a big limitation to put on
the usage of the feature no?
Pavel Begunkov Nov. 1, 2024, 9:38 p.m. UTC | #11
On 11/1/24 19:24, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 11:34 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
...
>>> Huh, interesting. For devmem TCP we bind a region of memory to the
>>> queue once, and after that we can create N connections all reusing the
>>> same memory region. Is that not the case for io_uring? There are no
>>
>> Hmm, I think we already discussed the same question before. Yes, it
>> does indeed support arbitrary number of connections. For what I was
>> saying above, the devmem TCP analogy would be attaching buffers to the
>> netlink socket instead of a tcp socket (that new xarray you added) when
>> you give it to user space. Then, you can close the connection after a
>> receive and the buffer you've got would still be alive.
>>
> 
> Ah, I see. You're making a tradeoff here. You leave the buffers alive
> after each connection so the userspace can still use them if it wishes
> but they are of course unavailable for other connections.
> 
> But in our case (and I'm guessing yours) the process that will set up
> the io_uring memory provider/RSS/flow steering will be a different
> process from the one that sends/receive data, no? Because the former
> requires CAP_NET_ADMIN privileges while the latter will not. If they
> are 2 different processes, what happens when the latter process doing
> the send/receive crashes? Does the memory stay unavailable until the
> CAP_NET_ADMIN process exits? Wouldn't it be better to tie the lifetime
> of the buffers of the connection? Sure, the buffers will become

That's the tradeoff google is willing to do in the framework,
which is fine, but it's not without cost, e.g. you need to
store/erase into the xarray, and it's a design choice in other
aspects, like you can't release the page pool if the socket you
got a buffer from is still alive but the net_iov hasn't been
returned.

> unavailable after the connection is closed, but at least you don't
> 'leak' memory on send/receive process crashes.
> 
> Unless of course you're saying that only CAP_NET_ADMIN processes will

The user can pass io_uring instance itself

> run io_rcrx connections. Then they can do their own mp setup/RSS/flow
> steering and there is no concern when the process crashes because
> everything will be cleaned up. But that's a big limitation to put on
> the usage of the feature no?
Mina Almasry Nov. 4, 2024, 8:42 p.m. UTC | #12
On Fri, Nov 1, 2024 at 2:38 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 11/1/24 19:24, Mina Almasry wrote:
> > On Fri, Nov 1, 2024 at 11:34 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> ...
> >>> Huh, interesting. For devmem TCP we bind a region of memory to the
> >>> queue once, and after that we can create N connections all reusing the
> >>> same memory region. Is that not the case for io_uring? There are no
> >>
> >> Hmm, I think we already discussed the same question before. Yes, it
> >> does indeed support arbitrary number of connections. For what I was
> >> saying above, the devmem TCP analogy would be attaching buffers to the
> >> netlink socket instead of a tcp socket (that new xarray you added) when
> >> you give it to user space. Then, you can close the connection after a
> >> receive and the buffer you've got would still be alive.
> >>
> >
> > Ah, I see. You're making a tradeoff here. You leave the buffers alive
> > after each connection so the userspace can still use them if it wishes
> > but they are of course unavailable for other connections.
> >
> > But in our case (and I'm guessing yours) the process that will set up
> > the io_uring memory provider/RSS/flow steering will be a different
> > process from the one that sends/receive data, no? Because the former
> > requires CAP_NET_ADMIN privileges while the latter will not. If they
> > are 2 different processes, what happens when the latter process doing
> > the send/receive crashes? Does the memory stay unavailable until the
> > CAP_NET_ADMIN process exits? Wouldn't it be better to tie the lifetime
> > of the buffers of the connection? Sure, the buffers will become
>
> That's the tradeoff google is willing to do in the framework,
> which is fine, but it's not without cost, e.g. you need to
> store/erase into the xarray, and it's a design choice in other
> aspects, like you can't release the page pool if the socket you
> got a buffer from is still alive but the net_iov hasn't been
> returned.
>
> > unavailable after the connection is closed, but at least you don't
> > 'leak' memory on send/receive process crashes.
> >
> > Unless of course you're saying that only CAP_NET_ADMIN processes will
>
> The user can pass io_uring instance itself
>

Thanks, but sorry, my point still stands. If the CAP_NET_ADMIN passes
the io_uring instance to the process doing the send/receive, then the
latter process crashes, do the io_uring netmems leak until the
page_pool is destroyed? Or are they cleaned up because the io_uring
instance is destroyed with the process crashing, and the io_uring will
destroy the page_pool on exit?
Pavel Begunkov Nov. 4, 2024, 9:27 p.m. UTC | #13
On 11/4/24 20:42, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 2:38 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 11/1/24 19:24, Mina Almasry wrote:
>>> On Fri, Nov 1, 2024 at 11:34 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> ...
>>>>> Huh, interesting. For devmem TCP we bind a region of memory to the
>>>>> queue once, and after that we can create N connections all reusing the
>>>>> same memory region. Is that not the case for io_uring? There are no
>>>>
>>>> Hmm, I think we already discussed the same question before. Yes, it
>>>> does indeed support arbitrary number of connections. For what I was
>>>> saying above, the devmem TCP analogy would be attaching buffers to the
>>>> netlink socket instead of a tcp socket (that new xarray you added) when
>>>> you give it to user space. Then, you can close the connection after a
>>>> receive and the buffer you've got would still be alive.
>>>>
>>>
>>> Ah, I see. You're making a tradeoff here. You leave the buffers alive
>>> after each connection so the userspace can still use them if it wishes
>>> but they are of course unavailable for other connections.
>>>
>>> But in our case (and I'm guessing yours) the process that will set up
>>> the io_uring memory provider/RSS/flow steering will be a different
>>> process from the one that sends/receive data, no? Because the former
>>> requires CAP_NET_ADMIN privileges while the latter will not. If they
>>> are 2 different processes, what happens when the latter process doing
>>> the send/receive crashes? Does the memory stay unavailable until the
>>> CAP_NET_ADMIN process exits? Wouldn't it be better to tie the lifetime
>>> of the buffers of the connection? Sure, the buffers will become
>>
>> That's the tradeoff google is willing to do in the framework,
>> which is fine, but it's not without cost, e.g. you need to
>> store/erase into the xarray, and it's a design choice in other
>> aspects, like you can't release the page pool if the socket you
>> got a buffer from is still alive but the net_iov hasn't been
>> returned.
>>
>>> unavailable after the connection is closed, but at least you don't
>>> 'leak' memory on send/receive process crashes.
>>>
>>> Unless of course you're saying that only CAP_NET_ADMIN processes will
>>
>> The user can pass io_uring instance itself
>>
> 
> Thanks, but sorry, my point still stands. If the CAP_NET_ADMIN passes
> the io_uring instance to the process doing the send/receive, then the
> latter process crashes, do the io_uring netmems leak until the
> page_pool is destroyed? Or are they cleaned up because the io_uring
> instance is destroyed with the process crashing, and the io_uring will
> destroy the page_pool on exit?

It'll be killed when io_uring dies / closed.
diff mbox series

Patch

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 8a35fe474adb..fd0376ad0d26 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -157,6 +157,7 @@  struct memory_provider_ops {
 	bool (*release_netmem)(struct page_pool *pool, netmem_ref netmem);
 	int (*init)(struct page_pool *pool);
 	void (*destroy)(struct page_pool *pool);
+	void (*scrub)(struct page_pool *pool);
 };
 
 struct pp_memory_provider_params {
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index c21c5b9edc68..9a675e16e6a4 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1038,6 +1038,9 @@  static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 
 static void page_pool_scrub(struct page_pool *pool)
 {
+	if (pool->mp_ops && pool->mp_ops->scrub)
+		pool->mp_ops->scrub(pool);
+
 	page_pool_empty_alloc_cache_once(pool);
 	pool->destroy_cnt++;