Message ID | 20210409223801.104657-1-mcroce@linux.microsoft.com (mailing list archive) |
---|---|
Headers | show |
Series | page_pool: recycle buffers | expand |
On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > This is needed by the page_pool to avoid recycling a page not allocated > via page_pool. Is the PageType mechanism more appropriate to your needs? It wouldn't be if you use page->_mapcount (ie mapping it to userspace). > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > --- > include/linux/mm_types.h | 1 + > include/net/page_pool.h | 2 ++ > net/core/page_pool.c | 4 ++++ > 3 files changed, 7 insertions(+) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 6613b26a8894..ef2d0d5f62e4 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -101,6 +101,7 @@ struct page { > * 32-bit architectures. > */ > dma_addr_t dma_addr; > + unsigned long signature; > }; > struct { /* slab, slob and slub */ > union { > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > index b5b195305346..b30405e84b5e 100644 > --- a/include/net/page_pool.h > +++ b/include/net/page_pool.h > @@ -63,6 +63,8 @@ > */ > #define PP_ALLOC_CACHE_SIZE 128 > #define PP_ALLOC_CACHE_REFILL 64 > +#define PP_SIGNATURE 0x20210303 > + > struct pp_alloc_cache { > u32 count; > void *cache[PP_ALLOC_CACHE_SIZE]; > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index ad8b0707af04..2ae9b554ef98 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, > page_pool_dma_sync_for_device(pool, page, pool->p.max_len); > > skip_dma_map: > + page->signature = PP_SIGNATURE; > + > /* Track how many pages are held 'in-flight' */ > pool->pages_state_hold_cnt++; > > @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) > DMA_ATTR_SKIP_CPU_SYNC); > page->dma_addr = 0; > skip_dma_unmap: > + page->signature = 0; > + > /* This may be the last page returned, releasing the pool, so > * it is not safe to reference pool afterwards. > */ > -- > 2.30.2 >
Hi Matthew On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote: > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > > This is needed by the page_pool to avoid recycling a page not allocated > > via page_pool. > > Is the PageType mechanism more appropriate to your needs? It wouldn't > be if you use page->_mapcount (ie mapping it to userspace). Interesting! Please keep in mind this was written ~2018 and was stale on my branches for quite some time. So back then I did try to use PageType, but had not free bits. Looking at it again though, it's cleaned up. So yes I think this can be much much cleaner. Should we go and define a new PG_pagepool? Thanks! /Ilias > > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > > --- > > include/linux/mm_types.h | 1 + > > include/net/page_pool.h | 2 ++ > > net/core/page_pool.c | 4 ++++ > > 3 files changed, 7 insertions(+) > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 6613b26a8894..ef2d0d5f62e4 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -101,6 +101,7 @@ struct page { > > * 32-bit architectures. > > */ > > dma_addr_t dma_addr; > > + unsigned long signature; > > }; > > struct { /* slab, slob and slub */ > > union { > > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > > index b5b195305346..b30405e84b5e 100644 > > --- a/include/net/page_pool.h > > +++ b/include/net/page_pool.h > > @@ -63,6 +63,8 @@ > > */ > > #define PP_ALLOC_CACHE_SIZE 128 > > #define PP_ALLOC_CACHE_REFILL 64 > > +#define PP_SIGNATURE 0x20210303 > > + > > struct pp_alloc_cache { > > u32 count; > > void *cache[PP_ALLOC_CACHE_SIZE]; > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > index ad8b0707af04..2ae9b554ef98 100644 > > --- a/net/core/page_pool.c > > +++ b/net/core/page_pool.c > > @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, > > page_pool_dma_sync_for_device(pool, page, pool->p.max_len); > > > > skip_dma_map: > > + page->signature = PP_SIGNATURE; > > + > > /* Track how many pages are held 'in-flight' */ > > pool->pages_state_hold_cnt++; > > > > @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) > > DMA_ATTR_SKIP_CPU_SYNC); > > page->dma_addr = 0; > > skip_dma_unmap: > > + page->signature = 0; > > + > > /* This may be the last page returned, releasing the pool, so > > * it is not safe to reference pool afterwards. > > */ > > -- > > 2.30.2 > >
On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > > Hi Matthew > > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote: > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > > > This is needed by the page_pool to avoid recycling a page not allocated > > > via page_pool. > > > > Is the PageType mechanism more appropriate to your needs? It wouldn't > > be if you use page->_mapcount (ie mapping it to userspace). > > Interesting! > Please keep in mind this was written ~2018 and was stale on my branches for > quite some time. So back then I did try to use PageType, but had not free > bits. Looking at it again though, it's cleaned up. So yes I think this can > be much much cleaner. Should we go and define a new PG_pagepool? > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType can not be used. There is a recent discussion [1] on memcg accounting of TCP RX zerocopy and I am wondering if this work can somehow help in that regard. I will take a look at the series. [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/
Hi Shakeel, On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote: > On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas > <ilias.apalodimas@linaro.org> wrote: > > > > Hi Matthew > > > > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote: > > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > > > > This is needed by the page_pool to avoid recycling a page not allocated > > > > via page_pool. > > > > > > Is the PageType mechanism more appropriate to your needs? It wouldn't > > > be if you use page->_mapcount (ie mapping it to userspace). > > > > Interesting! > > Please keep in mind this was written ~2018 and was stale on my branches for > > quite some time. So back then I did try to use PageType, but had not free > > bits. Looking at it again though, it's cleaned up. So yes I think this can > > be much much cleaner. Should we go and define a new PG_pagepool? > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > can not be used. Yes it can, since it's going to be used as your default allocator for payloads, which might end up on an SKB. So we have to keep the extra added field on struct page for our mark. Matthew had an intersting idea. He suggested keeping it, but changing the magic number, so it can't be a kernel address, but I'll let him follow up on the details. > > There is a recent discussion [1] on memcg accounting of TCP RX > zerocopy and I am wondering if this work can somehow help in that > regard. I will take a look at the series. > I'll try having a look on this as well. The idea behind the patchset is to allow lower speed NICs that use the API already, gain recycling 'easily'. Using page_pool for the driver comes with a penalty to begin with. Allocating pages instead of SKBs has a measurable difference. By enabling them to recycle they'll get better performance, since you skip the reallocation/remapping and only care for syncing the buffers correctly. > [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/
On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote: > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > can not be used. > > Yes it can, since it's going to be used as your default allocator for > payloads, which might end up on an SKB. > So we have to keep the extra added field on struct page for our mark. > Matthew had an intersting idea. He suggested keeping it, but changing the > magic number, so it can't be a kernel address, but I'll let him follow > up on the details. Sure! So, given the misalignment problem I discovered yesterday [1], we probably want a page_pool page to look like: unsigned long flags; unsigned long pp_magic; unsigned long xmi; unsigned long _pp_mapping_pad; dma_addr_t dma_addr; /* might be one or two words */ The only real restriction here is that pp_magic should not be a valid pointer, and it must have the bottom bit clear. I'd recommend something like: #define PP_MAGIC (0x20 + POISON_POINTER_DELTA) This leaves page->mapping as NULL, so you don't have to worry about clearing it before free. [1] https://lore.kernel.org/linux-mm/20210410024313.GX2531743@casper.infradead.org/
On Sat, 10 Apr 2021 20:39:55 +0100 Matthew Wilcox <willy@infradead.org> wrote: > On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote: > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > can not be used. > > > > Yes it can, since it's going to be used as your default allocator for > > payloads, which might end up on an SKB. > > So we have to keep the extra added field on struct page for our mark. > > Matthew had an intersting idea. He suggested keeping it, but changing the > > magic number, so it can't be a kernel address, but I'll let him follow > > up on the details. > > Sure! So, given the misalignment problem I discovered yesterday [1], > we probably want a page_pool page to look like: > > unsigned long flags; > unsigned long pp_magic; > unsigned long xmi; > unsigned long _pp_mapping_pad; > dma_addr_t dma_addr; /* might be one or two words */ > > The only real restriction here is that pp_magic should not be a valid > pointer, and it must have the bottom bit clear. I'd recommend something > like: > > #define PP_MAGIC (0x20 + POISON_POINTER_DELTA) > > This leaves page->mapping as NULL, so you don't have to worry about > clearing it before free. > > [1] https://lore.kernel.org/linux-mm/20210410024313.GX2531743@casper.infradead.org/ I didn't see this, before asking[2] for explaining your intent. I still worry about page->index, see [2]. [2] https://lore.kernel.org/netdev/20210411114307.5087f958@carbon/
On Sat, 10 Apr 2021 21:27:31 +0300 Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote: > > > On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas > > <ilias.apalodimas@linaro.org> wrote: > > > > > > Hi Matthew > > > > > > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote: > > > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > > > > > This is needed by the page_pool to avoid recycling a page not allocated > > > > > via page_pool. > > > > > > > > Is the PageType mechanism more appropriate to your needs? It wouldn't > > > > be if you use page->_mapcount (ie mapping it to userspace). > > > > > > Interesting! > > > Please keep in mind this was written ~2018 and was stale on my branches for > > > quite some time. So back then I did try to use PageType, but had not free > > > bits. Looking at it again though, it's cleaned up. So yes I think this can > > > be much much cleaner. Should we go and define a new PG_pagepool? > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > can not be used. > > Yes it can, since it's going to be used as your default allocator for > payloads, which might end up on an SKB. I'm not sure we want or should "allow" page_pool be used for TCP RX zerocopy. For several reasons. (1) This implies mapping these pages page to userspace, which AFAIK means using page->mapping and page->index members (right?). (2) It feels wrong (security wise) to keep the DMA-mapping (for the device) and also map this page into userspace. (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX zerocopy will bump the refcnt, which means the page_pool will not recycle the page when it see the elevated refcnt (it will instead release its DMA-mapping). (4) I remember vaguely that this code path for (TCP RX zerocopy) uses page->private for tricks. And our patch [3/5] use page->private for storing xdp_mem_info. IMHO when the SKB travel into this TCP RX zerocopy code path, we should call page_pool_release_page() to release its DMA-mapping. > > [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/
On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer <brouer@redhat.com> wrote: > [...] > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > can not be used. > > > > Yes it can, since it's going to be used as your default allocator for > > payloads, which might end up on an SKB. > > I'm not sure we want or should "allow" page_pool be used for TCP RX > zerocopy. > For several reasons. > > (1) This implies mapping these pages page to userspace, which AFAIK > means using page->mapping and page->index members (right?). > No, only page->_mapcount is used. > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > device) and also map this page into userspace. > I think this is already the case i.e pages still DMA-mapped and also mapped into userspace. > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > zerocopy will bump the refcnt, which means the page_pool will not > recycle the page when it see the elevated refcnt (it will instead > release its DMA-mapping). Yes this is right but the userspace might have already consumed and unmapped the page before the driver considers to recycle the page. > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > page->private for tricks. And our patch [3/5] use page->private for > storing xdp_mem_info. > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > call page_pool_release_page() to release its DMA-mapping. > I will let TCP RX zerocopy experts respond to this but from my high level code inspection, I didn't see page->private usage.
On Wed, Apr 14, 2021 at 10:09 PM Shakeel Butt <shakeelb@google.com> wrote: > > I will let TCP RX zerocopy experts respond to this but from my high > level code inspection, I didn't see page->private usage. Indeed, we do not use page->private, since we do not own the page(s).
On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote: > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > > [...] > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > can not be used. > > > > > > Yes it can, since it's going to be used as your default allocator for > > > payloads, which might end up on an SKB. > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > zerocopy. > > For several reasons. > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > means using page->mapping and page->index members (right?). > > > > No, only page->_mapcount is used. > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to adopt the recycling mechanism we should try preserving the current functionality of the network stack. The question is how does it work with the current drivers that already have an internal page recycling mechanism. > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > device) and also map this page into userspace. > > > > I think this is already the case i.e pages still DMA-mapped and also > mapped into userspace. > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > zerocopy will bump the refcnt, which means the page_pool will not > > recycle the page when it see the elevated refcnt (it will instead > > release its DMA-mapping). > > Yes this is right but the userspace might have already consumed and > unmapped the page before the driver considers to recycle the page. Same question here. I'll have a closer look in a few days and make sure we are not breaking anything wrt zerocopy. > > > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > > page->private for tricks. And our patch [3/5] use page->private for > > storing xdp_mem_info. > > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > > call page_pool_release_page() to release its DMA-mapping. > > > > I will let TCP RX zerocopy experts respond to this but from my high > level code inspection, I didn't see page->private usage. Shakeel are you aware of any 'easy' way I can have rx zerocopy running? Thanks! /Ilias
On Wed, 14 Apr 2021 13:09:47 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > > [...] > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > can not be used. > > > > > > Yes it can, since it's going to be used as your default allocator for > > > payloads, which might end up on an SKB. > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > zerocopy. > > For several reasons. > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > means using page->mapping and page->index members (right?). > > > > No, only page->_mapcount is used. Good to know. I will admit that I don't fully understand the usage of page->mapping and page->index members. > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > device) and also map this page into userspace. > > > > I think this is already the case i.e pages still DMA-mapped and also > mapped into userspace. True, other drivers are doing the same. > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > zerocopy will bump the refcnt, which means the page_pool will not > > recycle the page when it see the elevated refcnt (it will instead > > release its DMA-mapping). > > Yes this is right but the userspace might have already consumed and > unmapped the page before the driver considers to recycle the page. That is a good point. So, there is a race window where it is possible to gain recycling. It seems my page_pool co-maintainer Ilias is interested in taking up the challenge to get this working with TCP RX zerocopy. So, lets see how this is doable. > > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > > page->private for tricks. And our patch [3/5] use page->private for > > storing xdp_mem_info. > > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > > call page_pool_release_page() to release its DMA-mapping. > > > > I will let TCP RX zerocopy experts respond to this but from my high > level code inspection, I didn't see page->private usage. I trust when Eric says page->private isn't used in this code path. So, it might actually be possible :-) I will challenge Ilias and Matteo to pull this off. (But I know that both of them are busy for personal reasons, so be patient with them).
On Mon, Apr 19, 2021 at 01:22:04PM +0200, Jesper Dangaard Brouer wrote: > On Wed, 14 Apr 2021 13:09:47 -0700 > Shakeel Butt <shakeelb@google.com> wrote: > > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > > <brouer@redhat.com> wrote: > > > > > [...] > > > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > > can not be used. > > > > > > > > Yes it can, since it's going to be used as your default allocator for > > > > payloads, which might end up on an SKB. > > > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > > zerocopy. > > > For several reasons. > > > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > > means using page->mapping and page->index members (right?). > > > > > > > No, only page->_mapcount is used. > > Good to know. > I will admit that I don't fully understand the usage of page->mapping > and page->index members. That's fair. It's not well-documented, and it's complicated. For a page mapped into userspace, page->mapping is one of: - NULL - A pointer to a file's address_space - A pointer to an anonymous page's anon_vma If a page isn't mapped into userspace, you can use the space in page->mapping for anything you like (eg slab uses it) page->index is only used for indicating pfmemalloc today (and I want to move that indicator). I think it can also be used to merge VMAs (if some other conditions are also true), but failing to merge VMAs isn't a big deal for this kind of situation. > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > > device) and also map this page into userspace. > > > > I think this is already the case i.e pages still DMA-mapped and also > > mapped into userspace. > > True, other drivers are doing the same. And the contents of this page already came from that device ... if it wanted to write bad data, it could already have done so. > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > > zerocopy will bump the refcnt, which means the page_pool will not > > > recycle the page when it see the elevated refcnt (it will instead > > > release its DMA-mapping). > > > > Yes this is right but the userspace might have already consumed and > > unmapped the page before the driver considers to recycle the page. > > That is a good point. So, there is a race window where it is possible > to gain recycling. > > It seems my page_pool co-maintainer Ilias is interested in taking up the > challenge to get this working with TCP RX zerocopy. So, lets see how > this is doable. You could also check page_ref_count() - page_mapcount() instead of just checking page_ref_count(). Assuming mapping/unmapping can't race with recycling?
On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > > On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote: > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > > <brouer@redhat.com> wrote: > > > > > [...] > > > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > > can not be used. > > > > > > > > Yes it can, since it's going to be used as your default allocator for > > > > payloads, which might end up on an SKB. > > > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > > zerocopy. > > > For several reasons. > > > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > > means using page->mapping and page->index members (right?). > > > > > > > No, only page->_mapcount is used. > > > > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to > adopt the recycling mechanism we should try preserving the current > functionality of the network stack. > > The question is how does it work with the current drivers that already have an > internal page recycling mechanism. > I think the current drivers check page_ref_count(page) to decide to reuse (or not) the already allocated pages. Some examples from the drivers: drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page() drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page() drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get() > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > > device) and also map this page into userspace. > > > > > > > I think this is already the case i.e pages still DMA-mapped and also > > mapped into userspace. > > > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > > zerocopy will bump the refcnt, which means the page_pool will not > > > recycle the page when it see the elevated refcnt (it will instead > > > release its DMA-mapping). > > > > Yes this is right but the userspace might have already consumed and > > unmapped the page before the driver considers to recycle the page. > > Same question here. I'll have a closer look in a few days and make sure we are > not breaking anything wrt zerocopy. > Pages mapped into the userspace have their refcnt elevated, so the page_ref_count() check by the drivers indicates to not reuse such pages. > > > > > > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > > > page->private for tricks. And our patch [3/5] use page->private for > > > storing xdp_mem_info. > > > > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > > > call page_pool_release_page() to release its DMA-mapping. > > > > > > > I will let TCP RX zerocopy experts respond to this but from my high > > level code inspection, I didn't see page->private usage. > > Shakeel are you aware of any 'easy' way I can have rx zerocopy running? > I would recommend tools/testing/selftests/net/tcp_mmap.c.
Hi Shakeel, On Mon, Apr 19, 2021 at 07:57:03AM -0700, Shakeel Butt wrote: > On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas > <ilias.apalodimas@linaro.org> wrote: > > > > On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote: > > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > > > <brouer@redhat.com> wrote: > > > > > > > [...] > > > > > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > > > can not be used. > > > > > > > > > > Yes it can, since it's going to be used as your default allocator for > > > > > payloads, which might end up on an SKB. > > > > > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > > > zerocopy. > > > > For several reasons. > > > > > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > > > means using page->mapping and page->index members (right?). > > > > > > > > > > No, only page->_mapcount is used. > > > > > > > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to > > adopt the recycling mechanism we should try preserving the current > > functionality of the network stack. > > > > The question is how does it work with the current drivers that already have an > > internal page recycling mechanism. > > > > I think the current drivers check page_ref_count(page) to decide to > reuse (or not) the already allocated pages. > > Some examples from the drivers: > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page() > drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page() > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get() > Yes, that's how internal recycling is done in drivers. As Jesper mentioned the refcnt of the page is 1 for the page_pool owned pages and that's how we decide what to do with the page. > > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > > > device) and also map this page into userspace. > > > > > > > > > > I think this is already the case i.e pages still DMA-mapped and also > > > mapped into userspace. > > > > > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > > > zerocopy will bump the refcnt, which means the page_pool will not > > > > recycle the page when it see the elevated refcnt (it will instead > > > > release its DMA-mapping). > > > > > > Yes this is right but the userspace might have already consumed and > > > unmapped the page before the driver considers to recycle the page. > > > > Same question here. I'll have a closer look in a few days and make sure we are > > not breaking anything wrt zerocopy. > > > > Pages mapped into the userspace have their refcnt elevated, so the > page_ref_count() check by the drivers indicates to not reuse such > pages. > When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch() which will end up doing a get_page(). What you are saying is that once the zerocopy is done though, skb_release_data() won't be called, but instead put_page() will be? If that's the case then we are indeed leaking DMA mappings and memory. That sounds weird though, since the refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who eventually frees the page? If kfree_skb() (or any wrapper that calls skb_release_data()) is called eventually, we'll end up properly recycling the page into our pool. > > > > > > > > > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > > > > page->private for tricks. And our patch [3/5] use page->private for > > > > storing xdp_mem_info. > > > > > > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > > > > call page_pool_release_page() to release its DMA-mapping. > > > > > > > > > > I will let TCP RX zerocopy experts respond to this but from my high > > > level code inspection, I didn't see page->private usage. > > > > Shakeel are you aware of any 'easy' way I can have rx zerocopy running? > > > > I would recommend tools/testing/selftests/net/tcp_mmap.c. Ok, thanks I'll have a look. Cheers /Ilias
On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > [...] > > Pages mapped into the userspace have their refcnt elevated, so the > > page_ref_count() check by the drivers indicates to not reuse such > > pages. > > > > When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch() > which will end up doing a get_page(). > What you are saying is that once the zerocopy is done though, skb_release_data() > won't be called, but instead put_page() will be? If that's the case then we are > indeed leaking DMA mappings and memory. That sounds weird though, since the > refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who > eventually frees the page? > If kfree_skb() (or any wrapper that calls skb_release_data()) is called > eventually, we'll end up properly recycling the page into our pool. > From what I understand (Eric, please correct me if I'm wrong) for simple cases there are 3 page references taken. One by the driver, second by skb and third by page table. In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one page ref through insert_page_into_pte_locked(). However before returning from tcp_zerocopy_receive(), the skb references are dropped through tcp_recv_skb(). So, whenever the user unmaps the page and drops the page ref only then that page can be reused by the driver. In my understanding, for zerocopy rx the skb_release_data() is called on the pages while they are still mapped into the userspace. So, skb_release_data() might not be the right place to recycle the page for zerocopy. The email chain at [1] has some discussion on how to bundle the recycling of pages with their lifetime. [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/
On Mon, Apr 19, 2021 at 09:21:55AM -0700, Shakeel Butt wrote: > On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas > <ilias.apalodimas@linaro.org> wrote: > > > [...] > > > Pages mapped into the userspace have their refcnt elevated, so the > > > page_ref_count() check by the drivers indicates to not reuse such > > > pages. > > > > > > > When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch() > > which will end up doing a get_page(). > > What you are saying is that once the zerocopy is done though, skb_release_data() > > won't be called, but instead put_page() will be? If that's the case then we are > > indeed leaking DMA mappings and memory. That sounds weird though, since the > > refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who > > eventually frees the page? > > If kfree_skb() (or any wrapper that calls skb_release_data()) is called > > eventually, we'll end up properly recycling the page into our pool. > > > > From what I understand (Eric, please correct me if I'm wrong) for > simple cases there are 3 page references taken. One by the driver, > second by skb and third by page table. > > In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one > page ref through insert_page_into_pte_locked(). However before > returning from tcp_zerocopy_receive(), the skb references are dropped > through tcp_recv_skb(). So, whenever the user unmaps the page and > drops the page ref only then that page can be reused by the driver. > > In my understanding, for zerocopy rx the skb_release_data() is called > on the pages while they are still mapped into the userspace. So, > skb_release_data() might not be the right place to recycle the page > for zerocopy. The email chain at [1] has some discussion on how to > bundle the recycling of pages with their lifetime. > > [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/ Ah right, you mentioned the same email before and I completely forgot about it! In the past we had thoughts of 'stealing' the page on put_page instead of skb_release_data(). We were afraid that this would cause a measurable performance hit, so we tried to limit it within the skb lifecycle. However I don't think this will be a problem. Assuming we are right here and skb_release_data() is called while the userspace holds an extra reference from the mapping here's what will happen: skb_release_data() -> skb_free_head() -> page_pool_return_skb_page() -> set_page_private() -> xdp_return_skb_frame() -> __xdp_return() -> page_pool_put_full_page() -> page_pool_put_page() -> __page_pool_put_page() When we call __page_pool_put_page(), the refcnt will be != 1 (because a user mapping is still active), so we won't try to recycle it. Instead we'll remove the DMA mappings and decrease the refcnt. So although the recycling won't 'work', nothing bad will happen (famous last words). In any case, I'll double check with the test you pointed out before v4. Thanks! /Ilias
Hi Matthew, [...] > > And the contents of this page already came from that device ... if it > wanted to write bad data, it could already have done so. > > > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > > > zerocopy will bump the refcnt, which means the page_pool will not > > > > recycle the page when it see the elevated refcnt (it will instead > > > > release its DMA-mapping). > > > > > > Yes this is right but the userspace might have already consumed and > > > unmapped the page before the driver considers to recycle the page. > > > > That is a good point. So, there is a race window where it is possible > > to gain recycling. > > > > It seems my page_pool co-maintainer Ilias is interested in taking up the > > challenge to get this working with TCP RX zerocopy. So, lets see how > > this is doable. > > You could also check page_ref_count() - page_mapcount() instead of > just checking page_ref_count(). Assuming mapping/unmapping can't > race with recycling? > That's not a bad idea. As I explained on my last reply to Shakeel, I don't think the current patch will blow up anywhere. If the page is unmapped prior to kfree_skb() it will be recycled. If it's done in a reverse order, we'll just free the page entirely and will have to re-allocate it. The only thing I need to test is potential races (assuming those can even happen?). Trying to recycle the page outside of kfree_skb() means we'd have to 'steal' the page, during put_page() (or some function that's outside the networking scope). I think this is going to have a measurable performance penalty though not in networking, but in general. In any case, that should be orthogonal to the current patchset. So unless someone feels strongly about it, I'd prefer keeping the current code and trying to enable recycling in the skb zc case, when we have enough users of the API. Thanks /Ilias
On 2021/4/10 6:37, Matteo Croce wrote: > From: Matteo Croce <mcroce@microsoft.com> > > This is a respin of [1] > > This patchset shows the plans for allowing page_pool to handle and > maintain DMA map/unmap of the pages it serves to the driver. For this > to work a return hook in the network core is introduced. > > The overall purpose is to simplify drivers, by providing a page > allocation API that does recycling, such that each driver doesn't have > to reinvent its own recycling scheme. Using page_pool in a driver > does not require implementing XDP support, but it makes it trivially > easy to do so. Instead of allocating buffers specifically for SKBs > we now allocate a generic buffer and either wrap it on an SKB > (via build_skb) or create an XDP frame. > The recycling code leverages the XDP recycle APIs. > > The Marvell mvpp2 and mvneta drivers are used in this patchset to > demonstrate how to use the API, and tested on a MacchiatoBIN > and EspressoBIN boards respectively. > Hi, Matteo I added the skb frag page recycling in hns3 based on this patchset, and it has above 10%~20% performance improvement for one thread iperf TCP flow(IOMMU is off, there may be more performance improvement if considering the DMA map/unmap avoiding for IOMMU), thanks for the job. The skb frag page recycling support in hns3 driver is not so simple as the mvpp2 and mvneta driver, because: 1. the hns3 driver do not have XDP support yet, so "struct xdp_rxq_info" is added to assist relation binding between the "struct page" and "struct page_pool". 2. the hns3 driver has already a page reusing based on page spliting and page reference count, but it may not work if the upper stack can not handle skb and release the corresponding page fast enough. 3. the hns3 driver support page reference count updating batching, see: aeda9bf87a45 ("net: hns3: batch the page reference count updates") So it would be better if: 1. skb frag page recycling do not need "struct xdp_rxq_info" or "struct xdp_mem_info" to bond the relation between "struct page" and "struct page_pool", which seems uncessary at this point if bonding a "struct page_pool" pointer directly in "struct page" does not cause space increasing. 2. it would be good to do the page reference count updating batching in page pool instead of specific driver. page_pool_atomic_sub_if_positive() is added to decide who can call page_pool_put_full_page(), because the driver and stack may hold reference to the same page, only if last one which hold complete reference to a page can call page_pool_put_full_page() to decide if recycling is possible, if not, the page is released, so I am wondering if a similar page_pool_atomic_sub_if_positive() can added to specific user space address unmapping path to allow skb recycling for RX zerocopy too? diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index c21dd11..8b01a7d 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2566,7 +2566,10 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring, unsigned int order = hns3_page_order(ring); struct page *p; - p = dev_alloc_pages(order); + if (ring->page_pool) + p = page_pool_dev_alloc_pages(ring->page_pool); + else + p = dev_alloc_pages(order); if (!p) return -ENOMEM; @@ -2582,13 +2585,32 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring, return 0; } +static void hns3_page_frag_cache_drain(struct hns3_enet_ring *ring, + struct hns3_desc_cb *cb) +{ + if (ring->page_pool) { + struct page *p = cb->priv; + + if (page_pool_atomic_sub_if_positive(p, cb->pagecnt_bias)) + return; + + if (cb->pagecnt_bias > 1) + page_ref_sub(p, cb->pagecnt_bias - 1); + + page_pool_put_full_page(ring->page_pool, p, false); + return; + } + + __page_frag_cache_drain(cb->priv, cb->pagecnt_bias); +} + static void hns3_free_buffer(struct hns3_enet_ring *ring, struct hns3_desc_cb *cb, int budget) { if (cb->type == DESC_TYPE_SKB) napi_consume_skb(cb->priv, budget); else if (!HNAE3_IS_TX_RING(ring) && cb->pagecnt_bias) - __page_frag_cache_drain(cb->priv, cb->pagecnt_bias); + hns3_page_frag_cache_drain(ring, cb); memset(cb, 0, sizeof(*cb)); } @@ -2892,13 +2914,15 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i, skb_add_rx_frag(skb, i, desc_cb->priv, desc_cb->page_offset + pull_len, size - pull_len, truesize); + skb_mark_for_recycle(skb, desc_cb->priv, &ring->rxq_info.mem); + /* Avoid re-using remote and pfmemalloc pages, or the stack is still * using the page when page_offset rollback to zero, flag default * unreuse */ if (!dev_page_is_reusable(desc_cb->priv) || (!desc_cb->page_offset && !hns3_can_reuse_page(desc_cb))) { - __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias); + hns3_page_frag_cache_drain(ring, desc_cb); return; } @@ -2911,7 +2935,7 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i, desc_cb->reuse_flag = 1; desc_cb->page_offset = 0; } else if (desc_cb->pagecnt_bias) { - __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias); + hns3_page_frag_cache_drain(ring, desc_cb); return; } @@ -3156,8 +3180,7 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length, if (dev_page_is_reusable(desc_cb->priv)) desc_cb->reuse_flag = 1; else /* This page cannot be reused so discard it */ - __page_frag_cache_drain(desc_cb->priv, - desc_cb->pagecnt_bias); + hns3_page_frag_cache_drain(ring, desc_cb); hns3_rx_ring_move_fw(ring); return 0; @@ -4028,6 +4051,33 @@ static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring) goto out_with_desc_cb; if (!HNAE3_IS_TX_RING(ring)) { + struct page_pool_params pp_params = { + /* internal DMA mapping in page_pool */ + .flags = 0, + .order = 0, + .pool_size = 1024, + .nid = dev_to_node(ring_to_dev(ring)), + .dev = ring_to_dev(ring), + .dma_dir = DMA_FROM_DEVICE, + .offset = 0, + .max_len = 0, + }; + + ring->page_pool = page_pool_create(&pp_params); + if (IS_ERR(ring->page_pool)) { + dev_err(ring_to_dev(ring), "page pool creation failed\n"); + ring->page_pool = NULL; + } + + ret = xdp_rxq_info_reg(&ring->rxq_info, ring_to_netdev(ring), ring->queue_index, 0); + if (ret) + dev_err(ring_to_dev(ring), "xdp_rxq_info_reg failed\n"); + + ret = xdp_rxq_info_reg_mem_model(&ring->rxq_info, MEM_TYPE_PAGE_POOL, + ring->page_pool); + if (ret) + dev_err(ring_to_dev(ring), "xdp_rxq_info_reg_mem_model failed\n"); + ret = hns3_alloc_ring_buffers(ring); if (ret) goto out_with_desc; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index daa04ae..fd53fcc 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -6,6 +6,9 @@ #include <linux/if_vlan.h> +#include <net/page_pool.h> +#include <net/xdp.h> + #include "hnae3.h" enum hns3_nic_state { @@ -408,6 +411,8 @@ struct hns3_enet_ring { struct hnae3_queue *tqp; int queue_index; struct device *dev; /* will be used for DMA mapping of descriptors */ + struct page_pool *page_pool; struct hnae3_queue *tqp; int queue_index; struct device *dev; /* will be used for DMA mapping of descriptors */ + struct page_pool *page_pool; + struct xdp_rxq_info rxq_info; /* statistic */ struct ring_stats stats; diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 75fffc1..70c310e 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -195,6 +195,8 @@ static inline void page_pool_put_full_page(struct page_pool *pool, #endif } +bool page_pool_atomic_sub_if_positive(struct page *page, int i); + /* Same as above but the caller must guarantee safe context. e.g NAPI */ static inline void page_pool_recycle_direct(struct page_pool *pool, struct page *page) diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 43bfd2e..8bc8b7e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -596,6 +596,26 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid) } EXPORT_SYMBOL(page_pool_update_nid); +bool page_pool_atomic_sub_if_positive(struct page *page, int i) +{ + atomic_t *v = &page->_refcount; + int dec, c; + + do { + c = atomic_read(v); + + dec = c - i; + if (unlikely(dec == 0)) + return false; + else if (unlikely(dec < 0)) { + pr_err("c: %d, dec: %d, i: %d\n", c, dec, i); + return false; + } + } while (!atomic_try_cmpxchg(v, &c, dec)); + + return true; +} + bool page_pool_return_skb_page(void *data) { struct xdp_mem_info mem_info; @@ -606,6 +626,9 @@ bool page_pool_return_skb_page(void *data) if (unlikely(page->signature != PP_SIGNATURE)) return false; + if (page_pool_atomic_sub_if_positive(page, 1)) + return true; + info.raw = page_private(page); mem_info = info.mem_info;
Hi Yunsheng, On Thu, Apr 29, 2021 at 04:27:21PM +0800, Yunsheng Lin wrote: > On 2021/4/10 6:37, Matteo Croce wrote: > > From: Matteo Croce <mcroce@microsoft.com> > > > > This is a respin of [1] > > > > This patchset shows the plans for allowing page_pool to handle and > > maintain DMA map/unmap of the pages it serves to the driver. For this > > to work a return hook in the network core is introduced. > > > > The overall purpose is to simplify drivers, by providing a page > > allocation API that does recycling, such that each driver doesn't have > > to reinvent its own recycling scheme. Using page_pool in a driver > > does not require implementing XDP support, but it makes it trivially > > easy to do so. Instead of allocating buffers specifically for SKBs > > we now allocate a generic buffer and either wrap it on an SKB > > (via build_skb) or create an XDP frame. > > The recycling code leverages the XDP recycle APIs. > > > > The Marvell mvpp2 and mvneta drivers are used in this patchset to > > demonstrate how to use the API, and tested on a MacchiatoBIN > > and EspressoBIN boards respectively. > > > > Hi, Matteo > I added the skb frag page recycling in hns3 based on this patchset, > and it has above 10%~20% performance improvement for one thread iperf > TCP flow(IOMMU is off, there may be more performance improvement if > considering the DMA map/unmap avoiding for IOMMU), thanks for the job. > > The skb frag page recycling support in hns3 driver is not so simple > as the mvpp2 and mvneta driver, because: > > 1. the hns3 driver do not have XDP support yet, so "struct xdp_rxq_info" > is added to assist relation binding between the "struct page" and > "struct page_pool". > > 2. the hns3 driver has already a page reusing based on page spliting and > page reference count, but it may not work if the upper stack can not > handle skb and release the corresponding page fast enough. > > 3. the hns3 driver support page reference count updating batching, see: > aeda9bf87a45 ("net: hns3: batch the page reference count updates") > > So it would be better if: > > 1. skb frag page recycling do not need "struct xdp_rxq_info" or > "struct xdp_mem_info" to bond the relation between "struct page" and > "struct page_pool", which seems uncessary at this point if bonding > a "struct page_pool" pointer directly in "struct page" does not cause > space increasing. We can't do that. The reason we need those structs is that we rely on the existing XDP code, which already recycles it's buffers, to enable recycling. Since we allocate a page per packet when using page_pool for a driver , the same ideas apply to an SKB and XDP frame. We just recycle the payload and we don't really care what's in that. We could rename the functions to something more generic in the future though ? > > 2. it would be good to do the page reference count updating batching > in page pool instead of specific driver. > > > page_pool_atomic_sub_if_positive() is added to decide who can call > page_pool_put_full_page(), because the driver and stack may hold > reference to the same page, only if last one which hold complete > reference to a page can call page_pool_put_full_page() to decide if > recycling is possible, if not, the page is released, so I am wondering > if a similar page_pool_atomic_sub_if_positive() can added to specific > user space address unmapping path to allow skb recycling for RX zerocopy > too? > I would prefer a different page pool type if we wanted to support the split page model. The changes as is are quite intrusive, since they change the entire skb return path. So I would prefer introducing the changes one at a time. The fundamental difference between having the recycling in the driver vs having it in a generic API is pretty straightforward. When a driver holds the extra page references he is free to decide what to reuse, when he is about to refill his Rx descriptors. So TCP zerocopy might work even if the userspace applications hold the buffers for an X amount of time. On this proposal though we *need* to decide what to do with the buffer when we are about to free the skb. [...] Cheers /Ilias
On 2021/4/30 2:51, Ilias Apalodimas wrote: > Hi Yunsheng, > > On Thu, Apr 29, 2021 at 04:27:21PM +0800, Yunsheng Lin wrote: >> On 2021/4/10 6:37, Matteo Croce wrote: >>> From: Matteo Croce <mcroce@microsoft.com> [...] >> >> 1. skb frag page recycling do not need "struct xdp_rxq_info" or >> "struct xdp_mem_info" to bond the relation between "struct page" and >> "struct page_pool", which seems uncessary at this point if bonding >> a "struct page_pool" pointer directly in "struct page" does not cause >> space increasing. > > We can't do that. The reason we need those structs is that we rely on the > existing XDP code, which already recycles it's buffers, to enable > recycling. Since we allocate a page per packet when using page_pool for a > driver , the same ideas apply to an SKB and XDP frame. We just recycle the I am not really familar with XDP here, but a packet from hw is either a "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack, a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at the same time, right? What does not really make sense to me is that the page has to be from page pool when a skb's frag page can be recycled, right? If it is ture, the switch case in __xdp_return() does not really make sense for skb recycling, why go all the trouble of checking the mem->type and mem->id to find the page_pool pointer when recyclable page for skb can only be from page pool? > payload and we don't really care what's in that. We could rename the functions > to something more generic in the future though ? > >> >> 2. it would be good to do the page reference count updating batching >> in page pool instead of specific driver. >> >> >> page_pool_atomic_sub_if_positive() is added to decide who can call >> page_pool_put_full_page(), because the driver and stack may hold >> reference to the same page, only if last one which hold complete >> reference to a page can call page_pool_put_full_page() to decide if >> recycling is possible, if not, the page is released, so I am wondering >> if a similar page_pool_atomic_sub_if_positive() can added to specific >> user space address unmapping path to allow skb recycling for RX zerocopy >> too? >> > > I would prefer a different page pool type if we wanted to support the split > page model. The changes as is are quite intrusive, since they change the > entire skb return path. So I would prefer introducing the changes one at a > time. I understand there may be fundamental semantic change when split page model is supported by page pool, but the split page support change mainly affect the skb recycling path and the driver that uses page pool(XDP too) if we are careful enough, not the entire skb return path as my understanding. Anyway, one changes at a time is always prefered if supporting split page is proved to be non-trivel and intrusive. > > The fundamental difference between having the recycling in the driver vs > having it in a generic API is pretty straightforward. When a driver holds > the extra page references he is free to decide what to reuse, when he is about > to refill his Rx descriptors. So TCP zerocopy might work even if the > userspace applications hold the buffers for an X amount of time. > On this proposal though we *need* to decide what to do with the buffer when we > are about to free the skb. I am not sure I understand what you meant by "free the skb", does it mean that kfree_skb() is called to free the skb. As my understanding, if the skb completely own the page(which means page_count() == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive() try to handle it atomically. > > [...] > > > Cheers > /Ilias > > . >
[...] > >> > >> 1. skb frag page recycling do not need "struct xdp_rxq_info" or > >> "struct xdp_mem_info" to bond the relation between "struct page" and > >> "struct page_pool", which seems uncessary at this point if bonding > >> a "struct page_pool" pointer directly in "struct page" does not cause > >> space increasing. > > > > We can't do that. The reason we need those structs is that we rely on the > > existing XDP code, which already recycles it's buffers, to enable > > recycling. Since we allocate a page per packet when using page_pool for a > > driver , the same ideas apply to an SKB and XDP frame. We just recycle the > > I am not really familar with XDP here, but a packet from hw is either a > "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack, > a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at > the same time, right? > Yes, but the payload is irrelevant in both cases and that's what we use page_pool for. You can't use this patchset unless your driver usues build_skb(). So in both cases you just allocate memory for the payload and decide what the wrap the buffer with (XDP or SKB) later. > What does not really make sense to me is that the page has to be from page > pool when a skb's frag page can be recycled, right? If it is ture, the switch > case in __xdp_return() does not really make sense for skb recycling, why go > all the trouble of checking the mem->type and mem->id to find the page_pool > pointer when recyclable page for skb can only be from page pool? In any case you need to find in which pool the buffer you try to recycle belongs. In order to make the whole idea generic and be able to recycle skb fragments instead of just the skb head you need to store some information on struct page. That's the fundamental difference of this patchset compared to the RFC we sent a few years back [1] which was just storing information on the skb. The way this is done on the current patchset is that we store the struct xdp_mem_info in page->private and then look it up on xdp_return(). Now that being said Matthew recently reworked struct page, so we could see if we can store the page pool pointer directly instead of the struct xdp_mem_info. That would allow us to call into page pool functions directly. But we'll have to agree if that makes sense to go into struct page to begin with and make sure the pointer is still valid when we take the recycling path. > > payload and we don't really care what's in that. We could rename the functions > > to something more generic in the future though ? > > > >> > >> 2. it would be good to do the page reference count updating batching > >> in page pool instead of specific driver. > >> > >> > >> page_pool_atomic_sub_if_positive() is added to decide who can call > >> page_pool_put_full_page(), because the driver and stack may hold > >> reference to the same page, only if last one which hold complete > >> reference to a page can call page_pool_put_full_page() to decide if > >> recycling is possible, if not, the page is released, so I am wondering > >> if a similar page_pool_atomic_sub_if_positive() can added to specific > >> user space address unmapping path to allow skb recycling for RX zerocopy > >> too? > >> > > > > I would prefer a different page pool type if we wanted to support the split > > page model. The changes as is are quite intrusive, since they change the > > entire skb return path. So I would prefer introducing the changes one at a > > time. > > I understand there may be fundamental semantic change when split page model > is supported by page pool, but the split page support change mainly affect the > skb recycling path and the driver that uses page pool(XDP too) if we are careful > enough, not the entire skb return path as my understanding. It affects those drivers only, but in order to do so is intercepts the packet in skb_free_head(), which pretty much affects the entire network path. > > Anyway, one changes at a time is always prefered if supporting split page is > proved to be non-trivel and intrusive. > > > > > The fundamental difference between having the recycling in the driver vs > > having it in a generic API is pretty straightforward. When a driver holds > > the extra page references he is free to decide what to reuse, when he is about > > to refill his Rx descriptors. So TCP zerocopy might work even if the > > userspace applications hold the buffers for an X amount of time. > > On this proposal though we *need* to decide what to do with the buffer when we > > are about to free the skb. > > I am not sure I understand what you meant by "free the skb", does it mean > that kfree_skb() is called to free the skb. Yes > > As my understanding, if the skb completely own the page(which means page_count() > == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise > page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive() > try to handle it atomically. > Not really, the opposite is happening here. If the pp_recycle bit is set we will always call page_pool_return_skb_page(). If the page signature matches the 'magic' set by page pool we will always call xdp_return_skb_frame() will end up calling __page_pool_put_page(). If the refcnt is 1 we'll try to recycle the page. If it's not we'll release it from page_pool (releasing some internal references we keep) unmap the buffer and decrement the refcnt. [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/ Cheers /Ilias
(-cc invalid emails) Replying to my self here but.... [...] > > > > > > We can't do that. The reason we need those structs is that we rely on the > > > existing XDP code, which already recycles it's buffers, to enable > > > recycling. Since we allocate a page per packet when using page_pool for a > > > driver , the same ideas apply to an SKB and XDP frame. We just recycle the > > > > I am not really familar with XDP here, but a packet from hw is either a > > "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack, > > a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at > > the same time, right? > > > > Yes, but the payload is irrelevant in both cases and that's what we use > page_pool for. You can't use this patchset unless your driver usues > build_skb(). So in both cases you just allocate memory for the payload and > decide what the wrap the buffer with (XDP or SKB) later. > > > What does not really make sense to me is that the page has to be from page > > pool when a skb's frag page can be recycled, right? If it is ture, the switch > > case in __xdp_return() does not really make sense for skb recycling, why go > > all the trouble of checking the mem->type and mem->id to find the page_pool > > pointer when recyclable page for skb can only be from page pool? > > In any case you need to find in which pool the buffer you try to recycle > belongs. In order to make the whole idea generic and be able to recycle skb > fragments instead of just the skb head you need to store some information on > struct page. That's the fundamental difference of this patchset compared to > the RFC we sent a few years back [1] which was just storing information on the > skb. The way this is done on the current patchset is that we store the > struct xdp_mem_info in page->private and then look it up on xdp_return(). > > Now that being said Matthew recently reworked struct page, so we could see if > we can store the page pool pointer directly instead of the struct > xdp_mem_info. That would allow us to call into page pool functions directly. > But we'll have to agree if that makes sense to go into struct page to begin > with and make sure the pointer is still valid when we take the recycling path. > Thinking more about it the reason that prevented us from storing a page pool pointer directly is not there anymore. Jesper fixed that already a while back. So we might as well store the page_pool ptr in page->private and call into the functions directly. I'll have a look before v4. [...] Thanks /Ilias
On Fri, 30 Apr 2021 20:32:07 +0300 Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > (-cc invalid emails) > Replying to my self here but.... > > [...] > > > > > > > > We can't do that. The reason we need those structs is that we rely on the > > > > existing XDP code, which already recycles it's buffers, to enable > > > > recycling. Since we allocate a page per packet when using page_pool for a > > > > driver , the same ideas apply to an SKB and XDP frame. We just recycle the > > > > > > I am not really familar with XDP here, but a packet from hw is either a > > > "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack, > > > a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at > > > the same time, right? > > > > > > > Yes, but the payload is irrelevant in both cases and that's what we use > > page_pool for. You can't use this patchset unless your driver usues > > build_skb(). So in both cases you just allocate memory for the payload and > > decide what the wrap the buffer with (XDP or SKB) later. > > > > > What does not really make sense to me is that the page has to be from page > > > pool when a skb's frag page can be recycled, right? If it is ture, the switch > > > case in __xdp_return() does not really make sense for skb recycling, why go > > > all the trouble of checking the mem->type and mem->id to find the page_pool > > > pointer when recyclable page for skb can only be from page pool? > > > > In any case you need to find in which pool the buffer you try to recycle > > belongs. In order to make the whole idea generic and be able to recycle skb > > fragments instead of just the skb head you need to store some information on > > struct page. That's the fundamental difference of this patchset compared to > > the RFC we sent a few years back [1] which was just storing information on the > > skb. The way this is done on the current patchset is that we store the > > struct xdp_mem_info in page->private and then look it up on xdp_return(). > > > > Now that being said Matthew recently reworked struct page, so we could see if > > we can store the page pool pointer directly instead of the struct > > xdp_mem_info. That would allow us to call into page pool functions directly. > > But we'll have to agree if that makes sense to go into struct page to begin > > with and make sure the pointer is still valid when we take the recycling path. > > > > Thinking more about it the reason that prevented us from storing a > page pool pointer directly is not there anymore. Jesper fixed that > already a while back. So we might as well store the page_pool ptr in > page->private and call into the functions directly. I'll have a look > before v4. I want to give credit to Jonathan Lemon whom came up with the idea of storing the page_pool object that "owns" the page directly in struct page. I see this as an optimization that we can add later, so it doesn't block this patchset. As Ilias mention, it required some work/changes[1]+[2] to guarantee that the page_pool object life-time were longer than all the outstanding in-flight page-objects, but that have been stable for some/many kernel releases now. This is already need/used for making sure the DMA-mappings can be safely released[1], but I on-purpose enabled the same in-flight tracking for page_pool users that doesn't use the DMA-mapping feature (making sure the code is exercised). [1] 99c07c43c4ea ("xdp: tracking page_pool resources and safe removal") [2] c3f812cea0d7 ("page_pool: do not release pool until inflight == 0.")
On 2021/5/1 0:24, Ilias Apalodimas wrote: > [...] >>>> >>>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or >>>> "struct xdp_mem_info" to bond the relation between "struct page" and >>>> "struct page_pool", which seems uncessary at this point if bonding >>>> a "struct page_pool" pointer directly in "struct page" does not cause >>>> space increasing. >>> >>> We can't do that. The reason we need those structs is that we rely on the >>> existing XDP code, which already recycles it's buffers, to enable >>> recycling. Since we allocate a page per packet when using page_pool for a >>> driver , the same ideas apply to an SKB and XDP frame. We just recycle the >> >> I am not really familar with XDP here, but a packet from hw is either a >> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack, >> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at >> the same time, right? >> > > Yes, but the payload is irrelevant in both cases and that's what we use > page_pool for. You can't use this patchset unless your driver usues > build_skb(). So in both cases you just allocate memory for the payload and I am not sure I understood why build_skb() matters here. If the head data of a skb is a page frag and is from page pool, then it's page->signature should be PP_SIGNATURE, otherwise it's page->signature is zero, so a recyclable skb does not require it's head data being from a page pool, right? > decide what the wrap the buffer with (XDP or SKB) later. [...] >> >> I am not sure I understand what you meant by "free the skb", does it mean >> that kfree_skb() is called to free the skb. > > Yes > >> >> As my understanding, if the skb completely own the page(which means page_count() >> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise >> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive() >> try to handle it atomically. >> > > Not really, the opposite is happening here. If the pp_recycle bit is set we > will always call page_pool_return_skb_page(). If the page signature matches > the 'magic' set by page pool we will always call xdp_return_skb_frame() will > end up calling __page_pool_put_page(). If the refcnt is 1 we'll try > to recycle the page. If it's not we'll release it from page_pool (releasing > some internal references we keep) unmap the buffer and decrement the refcnt. Yes, I understood the above is what the page pool do now. But the question is who is still holding an extral reference to the page when kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral reference to the same page? So why not just do a page_ref_dec() if the orginal skb is freed first, and call __page_pool_put_page() when the cloned skb is freed later? So that we can always reuse the recyclable page from a recyclable skb. This may make the page_pool_destroy() process delays longer than before, I am supposed the page_pool_destroy() delaying for cloned skb case does not really matters here. If the above works, I think the samiliar handling can be added to RX zerocopy if the RX zerocopy also hold extral references to the recyclable page from a recyclable skb too? > > [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/ > > Cheers > /Ilias > > . >
On Thu, May 06, 2021 at 08:34:48PM +0800, Yunsheng Lin wrote: > On 2021/5/1 0:24, Ilias Apalodimas wrote: > > [...] > >>>> > >>>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or > >>>> "struct xdp_mem_info" to bond the relation between "struct page" and > >>>> "struct page_pool", which seems uncessary at this point if bonding > >>>> a "struct page_pool" pointer directly in "struct page" does not cause > >>>> space increasing. > >>> > >>> We can't do that. The reason we need those structs is that we rely on the > >>> existing XDP code, which already recycles it's buffers, to enable > >>> recycling. Since we allocate a page per packet when using page_pool for a > >>> driver , the same ideas apply to an SKB and XDP frame. We just recycle the > >> > >> I am not really familar with XDP here, but a packet from hw is either a > >> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack, > >> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at > >> the same time, right? > >> > > > > Yes, but the payload is irrelevant in both cases and that's what we use > > page_pool for. You can't use this patchset unless your driver usues > > build_skb(). So in both cases you just allocate memory for the payload and > > I am not sure I understood why build_skb() matters here. If the head data of > a skb is a page frag and is from page pool, then it's page->signature should be > PP_SIGNATURE, otherwise it's page->signature is zero, so a recyclable skb does > not require it's head data being from a page pool, right? > Correct, and that's the big improvement compared to the original RFC. The wording was a bit off in my initial response. I was trying to point out you can recycle *any* buffer coming from page_pool and one of the ways you can do that in your driver, is use build_skb() while the payload is allocated by page_pool. > > decide what the wrap the buffer with (XDP or SKB) later. > > [...] > > >> > >> I am not sure I understand what you meant by "free the skb", does it mean > >> that kfree_skb() is called to free the skb. > > > > Yes > > > >> > >> As my understanding, if the skb completely own the page(which means page_count() > >> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise > >> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive() > >> try to handle it atomically. > >> > > > > Not really, the opposite is happening here. If the pp_recycle bit is set we > > will always call page_pool_return_skb_page(). If the page signature matches > > the 'magic' set by page pool we will always call xdp_return_skb_frame() will > > end up calling __page_pool_put_page(). If the refcnt is 1 we'll try > > to recycle the page. If it's not we'll release it from page_pool (releasing > > some internal references we keep) unmap the buffer and decrement the refcnt. > > Yes, I understood the above is what the page pool do now. > > But the question is who is still holding an extral reference to the page when > kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral > reference to the same page? So why not just do a page_ref_dec() if the orginal skb > is freed first, and call __page_pool_put_page() when the cloned skb is freed later? > So that we can always reuse the recyclable page from a recyclable skb. This may > make the page_pool_destroy() process delays longer than before, I am supposed the > page_pool_destroy() delaying for cloned skb case does not really matters here. > > If the above works, I think the samiliar handling can be added to RX zerocopy if > the RX zerocopy also hold extral references to the recyclable page from a recyclable > skb too? > Right, this sounds doable, but I'll have to go back code it and see if it really makes sense. However I'd still prefer the support to go in as-is (including the struct xdp_mem_info in struct page, instead of a page_pool pointer). There's a couple of reasons for that. If we keep the struct xdp_mem_info we can in the future recycle different kind of buffers using __xdp_return(). And this is a non intrusive change if we choose to store the page pool address directly in the future. It just affects the internal contract between the page_pool code and struct page. So it won't affect any drivers that already use the feature. Regarding the page_ref_dec(), which as I said sounds doable, I'd prefer playing it safe for now and getting rid of the buffers that somehow ended up holding an extra reference. Once this gets approved we can go back and try to save the extra space. I hope I am not wrong but the changes required to support a few extra refcounts should not change the current patches much. Thanks for taking the time on this! /Ilias > > > > [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/ > > > > Cheers > > /Ilias > > > > . > > >
On 2021/5/6 20:58, Ilias Apalodimas wrote: >>>> >>> >>> Not really, the opposite is happening here. If the pp_recycle bit is set we >>> will always call page_pool_return_skb_page(). If the page signature matches >>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will >>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try >>> to recycle the page. If it's not we'll release it from page_pool (releasing >>> some internal references we keep) unmap the buffer and decrement the refcnt. >> >> Yes, I understood the above is what the page pool do now. >> >> But the question is who is still holding an extral reference to the page when >> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral >> reference to the same page? So why not just do a page_ref_dec() if the orginal skb >> is freed first, and call __page_pool_put_page() when the cloned skb is freed later? >> So that we can always reuse the recyclable page from a recyclable skb. This may >> make the page_pool_destroy() process delays longer than before, I am supposed the >> page_pool_destroy() delaying for cloned skb case does not really matters here. >> >> If the above works, I think the samiliar handling can be added to RX zerocopy if >> the RX zerocopy also hold extral references to the recyclable page from a recyclable >> skb too? >> > > Right, this sounds doable, but I'll have to go back code it and see if it > really makes sense. However I'd still prefer the support to go in as-is > (including the struct xdp_mem_info in struct page, instead of a page_pool > pointer). > > There's a couple of reasons for that. If we keep the struct xdp_mem_info we > can in the future recycle different kind of buffers using __xdp_return(). > And this is a non intrusive change if we choose to store the page pool address > directly in the future. It just affects the internal contract between the > page_pool code and struct page. So it won't affect any drivers that already > use the feature. This patchset has embeded a signature field in "struct page", and xdp_mem_info is stored in page_private(), which seems not considering the case for associating the page pool with "struct page" directly yet? Is the page pool also stored in page_private() and a different signature is used to indicate that? I am not saying we have to do it in this patchset, but we have to consider it while we are adding new signature field to "struct page", right? > Regarding the page_ref_dec(), which as I said sounds doable, I'd prefer > playing it safe for now and getting rid of the buffers that somehow ended up > holding an extra reference. Once this gets approved we can go back and try to > save the extra space. I hope I am not wrong but the changes required to > support a few extra refcounts should not change the current patches much. > > Thanks for taking the time on this! Thanks all invovled in the effort improving page pool too:) > /Ilias > >>> >>> [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/ >>> >>> Cheers >>> /Ilias >>> >>> . >>> >> > > . >
On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote: > On 2021/5/6 20:58, Ilias Apalodimas wrote: > >>>> > >>> > >>> Not really, the opposite is happening here. If the pp_recycle bit is set we > >>> will always call page_pool_return_skb_page(). If the page signature matches > >>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will > >>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try > >>> to recycle the page. If it's not we'll release it from page_pool (releasing > >>> some internal references we keep) unmap the buffer and decrement the refcnt. > >> > >> Yes, I understood the above is what the page pool do now. > >> > >> But the question is who is still holding an extral reference to the page when > >> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral > >> reference to the same page? So why not just do a page_ref_dec() if the orginal skb > >> is freed first, and call __page_pool_put_page() when the cloned skb is freed later? > >> So that we can always reuse the recyclable page from a recyclable skb. This may > >> make the page_pool_destroy() process delays longer than before, I am supposed the > >> page_pool_destroy() delaying for cloned skb case does not really matters here. > >> > >> If the above works, I think the samiliar handling can be added to RX zerocopy if > >> the RX zerocopy also hold extral references to the recyclable page from a recyclable > >> skb too? > >> > > > > Right, this sounds doable, but I'll have to go back code it and see if it > > really makes sense. However I'd still prefer the support to go in as-is > > (including the struct xdp_mem_info in struct page, instead of a page_pool > > pointer). > > > > There's a couple of reasons for that. If we keep the struct xdp_mem_info we > > can in the future recycle different kind of buffers using __xdp_return(). > > And this is a non intrusive change if we choose to store the page pool address > > directly in the future. It just affects the internal contract between the > > page_pool code and struct page. So it won't affect any drivers that already > > use the feature. > > This patchset has embeded a signature field in "struct page", and xdp_mem_info > is stored in page_private(), which seems not considering the case for associating > the page pool with "struct page" directly yet? Correct > Is the page pool also stored in > page_private() and a different signature is used to indicate that? No only struct xdp_mem_info as you mentioned before > > I am not saying we have to do it in this patchset, but we have to consider it > while we are adding new signature field to "struct page", right? We won't need a new signature. The signature in both cases is there to guarantee the page you are trying to recycle was indeed allocated by page_pool. Basically we got two design choices here: - We store the page_pool ptr address directly in page->private and then, we call into page_pool APIs directly to do the recycling. That would eliminate the lookup through xdp_mem_info and the XDP helpers to locate page pool pointer (through __xdp_return). - You store the xdp_mem_info on page_private. In that case you need to go through __xdp_return() to locate the page_pool pointer. Although we might loose some performance that would allow us to recycle additional memory types and not only MEM_TYPE_PAGE_POOL (in case we ever need it). I think both choices are sane. What I am trying to explain here, is regardless of what we choose now, we can change it in the future without affecting the API consumers at all. What will change internally is the way we lookup the page pool pointer we are trying to recycle. [...] Cheers /Ilias
On 2021/5/7 15:06, Ilias Apalodimas wrote: > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote: >> On 2021/5/6 20:58, Ilias Apalodimas wrote: >>>>>> >>>>> >>>>> Not really, the opposite is happening here. If the pp_recycle bit is set we >>>>> will always call page_pool_return_skb_page(). If the page signature matches >>>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will >>>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try >>>>> to recycle the page. If it's not we'll release it from page_pool (releasing >>>>> some internal references we keep) unmap the buffer and decrement the refcnt. >>>> >>>> Yes, I understood the above is what the page pool do now. >>>> >>>> But the question is who is still holding an extral reference to the page when >>>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral >>>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb >>>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later? >>>> So that we can always reuse the recyclable page from a recyclable skb. This may >>>> make the page_pool_destroy() process delays longer than before, I am supposed the >>>> page_pool_destroy() delaying for cloned skb case does not really matters here. >>>> >>>> If the above works, I think the samiliar handling can be added to RX zerocopy if >>>> the RX zerocopy also hold extral references to the recyclable page from a recyclable >>>> skb too? >>>> >>> >>> Right, this sounds doable, but I'll have to go back code it and see if it >>> really makes sense. However I'd still prefer the support to go in as-is >>> (including the struct xdp_mem_info in struct page, instead of a page_pool >>> pointer). >>> >>> There's a couple of reasons for that. If we keep the struct xdp_mem_info we >>> can in the future recycle different kind of buffers using __xdp_return(). >>> And this is a non intrusive change if we choose to store the page pool address >>> directly in the future. It just affects the internal contract between the >>> page_pool code and struct page. So it won't affect any drivers that already >>> use the feature. >> >> This patchset has embeded a signature field in "struct page", and xdp_mem_info >> is stored in page_private(), which seems not considering the case for associating >> the page pool with "struct page" directly yet? > > Correct > >> Is the page pool also stored in >> page_private() and a different signature is used to indicate that? > > No only struct xdp_mem_info as you mentioned before > >> >> I am not saying we have to do it in this patchset, but we have to consider it >> while we are adding new signature field to "struct page", right? > > We won't need a new signature. The signature in both cases is there to > guarantee the page you are trying to recycle was indeed allocated by page_pool. > > Basically we got two design choices here: > - We store the page_pool ptr address directly in page->private and then, > we call into page_pool APIs directly to do the recycling. > That would eliminate the lookup through xdp_mem_info and the > XDP helpers to locate page pool pointer (through __xdp_return). > - You store the xdp_mem_info on page_private. In that case you need to go > through __xdp_return() to locate the page_pool pointer. Although we might > loose some performance that would allow us to recycle additional memory types > and not only MEM_TYPE_PAGE_POOL (in case we ever need it). So the signature field in "struct page" is used to only indicate a page is from a page pool, then how do we tell the content of page_private() if both of the above choices are needed, we might still need an extra indicator to tell page_private() is page_pool ptr or xdp_mem_info. It seems storing the page pool ptr in page_private() is clear for recyclable page from a recyclable skb use case, and the use case for storing xdp_mem_info in page_private() is unclear yet? As XDP seems to have the xdp_mem_info in the "struct xdp_frame", so it does not need the xdp_mem_info from page_private(). If the above is true, what does not really makes sense to me here is that: why do we first implement a unclear use case for storing xdp_mem_info in page_private(), why not implement the clear use case for storing page pool ptr in page_private() first? > > > I think both choices are sane. What I am trying to explain here, is > regardless of what we choose now, we can change it in the future without > affecting the API consumers at all. What will change internally is the way we > lookup the page pool pointer we are trying to recycle. It seems the below API need changing? +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page, + struct xdp_mem_info *mem) > > [...] > > > Cheers > /Ilias > > . >
On Fri, 7 May 2021 16:28:30 +0800 Yunsheng Lin <linyunsheng@huawei.com> wrote: > On 2021/5/7 15:06, Ilias Apalodimas wrote: > > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote: > >> On 2021/5/6 20:58, Ilias Apalodimas wrote: > >>>>>> > >>>>> > >>>>> Not really, the opposite is happening here. If the pp_recycle bit is set we > >>>>> will always call page_pool_return_skb_page(). If the page signature matches > >>>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will > >>>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try > >>>>> to recycle the page. If it's not we'll release it from page_pool (releasing > >>>>> some internal references we keep) unmap the buffer and decrement the refcnt. > >>>> > >>>> Yes, I understood the above is what the page pool do now. > >>>> > >>>> But the question is who is still holding an extral reference to the page when > >>>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral > >>>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb > >>>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later? > >>>> So that we can always reuse the recyclable page from a recyclable skb. This may > >>>> make the page_pool_destroy() process delays longer than before, I am supposed the > >>>> page_pool_destroy() delaying for cloned skb case does not really matters here. > >>>> > >>>> If the above works, I think the samiliar handling can be added to RX zerocopy if > >>>> the RX zerocopy also hold extral references to the recyclable page from a recyclable > >>>> skb too? > >>>> > >>> > >>> Right, this sounds doable, but I'll have to go back code it and see if it > >>> really makes sense. However I'd still prefer the support to go in as-is > >>> (including the struct xdp_mem_info in struct page, instead of a page_pool > >>> pointer). > >>> > >>> There's a couple of reasons for that. If we keep the struct xdp_mem_info we > >>> can in the future recycle different kind of buffers using __xdp_return(). > >>> And this is a non intrusive change if we choose to store the page pool address > >>> directly in the future. It just affects the internal contract between the > >>> page_pool code and struct page. So it won't affect any drivers that already > >>> use the feature. > >> > >> This patchset has embeded a signature field in "struct page", and xdp_mem_info > >> is stored in page_private(), which seems not considering the case for associating > >> the page pool with "struct page" directly yet? > > > > Correct > > > >> Is the page pool also stored in > >> page_private() and a different signature is used to indicate that? > > > > No only struct xdp_mem_info as you mentioned before > > > >> > >> I am not saying we have to do it in this patchset, but we have to consider it > >> while we are adding new signature field to "struct page", right? > > > > We won't need a new signature. The signature in both cases is there to > > guarantee the page you are trying to recycle was indeed allocated by page_pool. > > > > Basically we got two design choices here: > > - We store the page_pool ptr address directly in page->private and then, > > we call into page_pool APIs directly to do the recycling. > > That would eliminate the lookup through xdp_mem_info and the > > XDP helpers to locate page pool pointer (through __xdp_return). > > - You store the xdp_mem_info on page_private. In that case you need to go > > through __xdp_return() to locate the page_pool pointer. Although we might > > loose some performance that would allow us to recycle additional memory types > > and not only MEM_TYPE_PAGE_POOL (in case we ever need it). > > So the signature field in "struct page" is used to only indicate a page is > from a page pool, then how do we tell the content of page_private() if both of > the above choices are needed, we might still need an extra indicator to tell > page_private() is page_pool ptr or xdp_mem_info. The signature field in "struct page" and "xdp_mem_info" is a double construct that was introduced in this patchset. AFAIK Matteo took the idea from Jonathan's patchset. I'm not convinced we need both, maybe later we do (when someone introduce a new mem model ala NetGPU). I think Jonathan's use-case was NetGPU[1] (which got shutdown due to Nvidia[2] being involved which I think was unfair). The general idea behind NetGPU makes sense to me, to allow packet headers to live in first page, and second page belongs to hardware. This implies that an SKB can can point to two different pages with different memory types, which need to be handled correctly when freeing the SKB and the pages it points to. Thus, placing (xdp_)mem_info in SKB is wrong as it implies all pages belong the same mem_info.type. The point is, when designing this I want us to think about how our design can handle other memory models than just page_pool. In this patchset design, we use a single bit in SKB to indicate that the pages pointed comes from another memory model, in this case page_pool is the only user of this bit. The remaining info about the memory model (page_pool) is stored in struct-page, which we look at when freeing the pages that the SKB points to (that is at the layer above the MM-calls that would free the page for real). [1] https://linuxplumbersconf.org/event/7/contributions/670/ [2] https://lwn.net/Articles/827596/ > It seems storing the page pool ptr in page_private() is clear for recyclable > page from a recyclable skb use case, and the use case for storing xdp_mem_info > in page_private() is unclear yet? As XDP seems to have the xdp_mem_info in the > "struct xdp_frame", so it does not need the xdp_mem_info from page_private(). > > If the above is true, what does not really makes sense to me here is that: > why do we first implement a unclear use case for storing xdp_mem_info in > page_private(), why not implement the clear use case for storing page pool ptr > in page_private() first? I'm actually not against storing page_pool object ptr directly in struct-page. It is a nice optimization. Perhaps we should implement this optimization outside this patchset first, and let __xdp_return() for XDP-redirected packets also take advantage to this optimization? Then it would feel more natural to also used this optimization in this patchset, right? > > > > > > I think both choices are sane. What I am trying to explain here, is > > regardless of what we choose now, we can change it in the future without > > affecting the API consumers at all. What will change internally is the way we > > lookup the page pool pointer we are trying to recycle. > > It seems the below API need changing? > +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page, > + struct xdp_mem_info *mem) I don't think we need to change this API, to support future memory models. Notice that xdp_mem_info have a 'type' member. Naming in Computer Science is a hard problem ;-). Something that seems to confuse a lot of people is the naming of the struct "xdp_mem_info". Maybe we should have named it "mem_info" instead or "net_mem_info", as it doesn't indicate that the device is running XDP. I see XDP as the RX-layer before the network stack, that helps drivers to support different memory models, also for handling normal packets that doesn't get process by XDP, and the drivers doesn't even need to support XDP to use the "xdp_mem_info" type.
On Fri, May 07, 2021 at 12:19:53PM +0200, Jesper Dangaard Brouer wrote: > Nvidia[2] being involved which I think was unfair). The general idea > behind NetGPU makes sense to me, Sorry, but that is utter bullshit. It was rejected because it did nothing but injecting hooks for an out of tree module while ignoring the existing kernel infrastructure for much of what it tries to archive.
Jesper Dangaard Brouer <brouer@redhat.com> writes: > On Fri, 7 May 2021 16:28:30 +0800 > Yunsheng Lin <linyunsheng@huawei.com> wrote: > >> On 2021/5/7 15:06, Ilias Apalodimas wrote: >> > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote: >> >> On 2021/5/6 20:58, Ilias Apalodimas wrote: >> >>>>>> >> >>>>> > ... >> > >> > >> > I think both choices are sane. What I am trying to explain >> > here, is >> > regardless of what we choose now, we can change it in the >> > future without >> > affecting the API consumers at all. What will change >> > internally is the way we >> > lookup the page pool pointer we are trying to recycle. >> >> It seems the below API need changing? >> +static inline void skb_mark_for_recycle(struct sk_buff *skb, >> struct page *page, >> + struct xdp_mem_info *mem) > > I don't think we need to change this API, to support future > memory > models. Notice that xdp_mem_info have a 'type' member. Hi, Providing that we will (possibly as a future optimization) store the pointer to the page pool in struct page instead of strcut xdp_mem_info, passing xdp_mem_info * instead of struct page_pool * would mean that for every packet we'll need to call xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params); xa->page_pool; which might pressure the Dcache to fetch a pointer that might be present already in cache as part of driver's data-structures. I tend to agree with Yunsheng that it makes more sense to adjust the API for the clear use-case now rather than using xdp_mem_info indirection. It seems to me like the page signature provides the same information anyway and allows to support different memory types. Shay > > Naming in Computer Science is a hard problem ;-). Something that > seems > to confuse a lot of people is the naming of the struct > "xdp_mem_info". > Maybe we should have named it "mem_info" instead or > "net_mem_info", as > it doesn't indicate that the device is running XDP. > > I see XDP as the RX-layer before the network stack, that helps > drivers > to support different memory models, also for handling normal > packets > that doesn't get process by XDP, and the drivers doesn't even > need to > support XDP to use the "xdp_mem_info" type.
On 2021/5/7 18:19, Jesper Dangaard Brouer wrote: > On Fri, 7 May 2021 16:28:30 +0800 > Yunsheng Lin <linyunsheng@huawei.com> wrote: > [...] > > I'm actually not against storing page_pool object ptr directly in > struct-page. It is a nice optimization. Perhaps we should implement > this optimization outside this patchset first, and let __xdp_return() > for XDP-redirected packets also take advantage to this optimization? > > Then it would feel more natural to also used this optimization in this > patchset, right? Yes, it would be good if XDP can take advantage of this optimization too. Then it seems we can remove the "mem_id_ht"? if we want to support different type of page pool in the future, the type field is in the page pool too when page_pool object ptr directly in struct-page. >
Hi Shay, On Sun, May 09, 2021 at 08:11:35AM +0300, Shay Agroskin wrote: > > Jesper Dangaard Brouer <brouer@redhat.com> writes: > > > On Fri, 7 May 2021 16:28:30 +0800 > > Yunsheng Lin <linyunsheng@huawei.com> wrote: > > > > > On 2021/5/7 15:06, Ilias Apalodimas wrote: > > > > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote: >> > > > On 2021/5/6 20:58, Ilias Apalodimas wrote: >>>>>> >>>>> > > ... > > > > > > I think both choices are sane. What I am trying to explain > > > > here, is > > > > regardless of what we choose now, we can change it in the > future > > > without > > > > affecting the API consumers at all. What will change > internally > > > is the way we > > > > lookup the page pool pointer we are trying to recycle. > > > > > > It seems the below API need changing? > > > +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct > > > page *page, > > > + struct xdp_mem_info *mem) > > > > I don't think we need to change this API, to support future memory > > models. Notice that xdp_mem_info have a 'type' member. > > Hi, > Providing that we will (possibly as a future optimization) store the pointer > to the page pool in struct page instead of strcut xdp_mem_info, passing > xdp_mem_info * instead of struct page_pool * would mean that for every > packet we'll need to call > xa = rhashtable_lookup(mem_id_ht, &mem->id, > mem_id_rht_params); > xa->page_pool; > > which might pressure the Dcache to fetch a pointer that might be present > already in cache as part of driver's data-structures. > > I tend to agree with Yunsheng that it makes more sense to adjust the API for > the clear use-case now rather than using xdp_mem_info indirection. It seems > to me like > the page signature provides the same information anyway and allows to > support different memory types. We've switched the patches already. We didn't notice any performance boost by doing so (tested on a machiattobin), but I agree as well. As I explained the only thing that will change if we ever the need the struct xdp_mem_info in struct page is the internal contract between struct page and the recycling function, so let's start clean and see if we ever need that. Cheers /Ilias > > Shay > > > > > Naming in Computer Science is a hard problem ;-). Something that seems > > to confuse a lot of people is the naming of the struct "xdp_mem_info". > > Maybe we should have named it "mem_info" instead or "net_mem_info", as > > it doesn't indicate that the device is running XDP. > > > > I see XDP as the RX-layer before the network stack, that helps drivers > > to support different memory models, also for handling normal packets > > that doesn't get process by XDP, and the drivers doesn't even need to > > support XDP to use the "xdp_mem_info" type. >
From: Matteo Croce <mcroce@microsoft.com> This is a respin of [1] This patchset shows the plans for allowing page_pool to handle and maintain DMA map/unmap of the pages it serves to the driver. For this to work a return hook in the network core is introduced. The overall purpose is to simplify drivers, by providing a page allocation API that does recycling, such that each driver doesn't have to reinvent its own recycling scheme. Using page_pool in a driver does not require implementing XDP support, but it makes it trivially easy to do so. Instead of allocating buffers specifically for SKBs we now allocate a generic buffer and either wrap it on an SKB (via build_skb) or create an XDP frame. The recycling code leverages the XDP recycle APIs. The Marvell mvpp2 and mvneta drivers are used in this patchset to demonstrate how to use the API, and tested on a MacchiatoBIN and EspressoBIN boards respectively. Please let this going in on a future -rc1 so to allow enough time to have wider tests. [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/ v2 -> v3: - added missing SOBs - CCed the MM people v1 -> v2: - fix a commit message - avoid setting pp_recycle multiple times on mvneta - squash two patches to avoid breaking bisect Ilias Apalodimas (1): page_pool: Allow drivers to hint on SKB recycling Jesper Dangaard Brouer (1): xdp: reduce size of struct xdp_mem_info Matteo Croce (3): mm: add a signature in struct page mvpp2: recycle buffers mvneta: recycle buffers .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +- drivers/net/ethernet/marvell/mvneta.c | 7 ++- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++---- drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- include/linux/mm_types.h | 1 + include/linux/skbuff.h | 35 ++++++++++++-- include/net/page_pool.h | 15 ++++++ include/net/xdp.h | 5 +- net/core/page_pool.c | 47 +++++++++++++++++++ net/core/skbuff.c | 20 +++++++- net/core/xdp.c | 14 ++++-- net/tls/tls_device.c | 2 +- 13 files changed, 142 insertions(+), 27 deletions(-)