[net-next,v3,0/5] page_pool: recycle buffers

Message ID	20210409223801.104657-1-mcroce@linux.microsoft.com (mailing list archive)
Headers	show Return-Path: <SRS0=QLeX=JG=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2F7F261178 From: Matteo Croce <mcroce@linux.microsoft.com> To: netdev@vger.kernel.org, linux-mm@kvack.org Cc: Ayush Sawal <ayush.sawal@chelsio.com>, Vinay Kumar Yadav <vinay.yadav@chelsio.com>, Rohit Maheshwari <rohitm@chelsio.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Thomas Petazzoni <thomas.petazzoni@bootlin.com>, Marcin Wojtas <mw@semihalf.com>, Russell King <linux@armlinux.org.uk>, Mirko Lindner <mlindner@marvell.com>, Stephen Hemminger <stephen@networkplumber.org>, Tariq Toukan <tariqt@nvidia.com>, Jesper Dangaard Brouer <hawk@kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.fastabend@gmail.com>, Boris Pismenny <borisp@nvidia.com>, Arnd Bergmann <arnd@arndb.de>, Andrew Morton <akpm@linux-foundation.org>, "Peter Zijlstra (Intel)" <peterz@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Yu Zhao <yuzhao@google.com>, Will Deacon <will@kernel.org>, Michel Lespinasse <walken@google.com>, Fenghua Yu <fenghua.yu@intel.com>, Roman Gushchin <guro@fb.com>, Hugh Dickins <hughd@google.com>, Peter Xu <peterx@redhat.com>, Jason Gunthorpe <jgg@ziepe.ca>, Guoqing Jiang <guoqing.jiang@cloud.ionos.com>, Jonathan Lemon <jonathan.lemon@gmail.com>, Alexander Lobakin <alobakin@pm.me>, Cong Wang <cong.wang@bytedance.com>, wenxu <wenxu@ucloud.cn>, Kevin Hao <haokexin@gmail.com>, Aleksandr Nogikh <nogikh@google.com>, Jakub Sitnicki <jakub@cloudflare.com>, Marco Elver <elver@google.com>, Willem de Bruijn <willemb@google.com>, Miaohe Lin <linmiaohe@huawei.com>, Yunsheng Lin <linyunsheng@huawei.com>, Guillaume Nault <gnault@redhat.com>, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, bpf@vger.kernel.org, Matthew Wilcox <willy@infradead.org>, Eric Dumazet <edumazet@google.com>, David Ahern <dsahern@gmail.com>, Lorenzo Bianconi <lorenzo@kernel.org>, Saeed Mahameed <saeedm@nvidia.com>, Andrew Lunn <andrew@lunn.ch>, Paolo Abeni <pabeni@redhat.com> Subject: [PATCH net-next v3 0/5] page_pool: recycle buffers Date: Sat, 10 Apr 2021 00:37:56 +0200 Message-Id: <20210409223801.104657-1-mcroce@linux.microsoft.com> MIME-Version: 1.0 Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf27; identity=mailfrom; envelope-from="<technoboy85@gmail.com>"; helo=mail-ej1-f52.google.com; client-ip=209.85.218.52 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	page_pool: recycle buffers \| expand [net-next,v3,0/5] page_pool: recycle buffers [net-next,v3,1/5] xdp: reduce size of struct xdp_mem_info [net-next,v3,3/5] page_pool: Allow drivers to hint on SKB recycling [net-next,v3,4/5] mvpp2: recycle buffers [net-next,v3,5/5] mvneta: recycle buffers

Matteo Croce April 9, 2021, 10:37 p.m. UTC

From: Matteo Croce <mcroce@microsoft.com>

This is a respin of [1]

This  patchset shows the plans for allowing page_pool to handle and
maintain DMA map/unmap of the pages it serves to the driver.  For this
to work a return hook in the network core is introduced.

The overall purpose is to simplify drivers, by providing a page
allocation API that does recycling, such that each driver doesn't have
to reinvent its own recycling scheme.  Using page_pool in a driver
does not require implementing XDP support, but it makes it trivially
easy to do so.  Instead of allocating buffers specifically for SKBs
we now allocate a generic buffer and either wrap it on an SKB
(via build_skb) or create an XDP frame.
The recycling code leverages the XDP recycle APIs.

The Marvell mvpp2 and mvneta drivers are used in this patchset to
demonstrate how to use the API, and tested on a MacchiatoBIN
and EspressoBIN boards respectively.

Please let this going in on a future -rc1 so to allow enough time
to have wider tests.

[1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/

v2 -> v3:
- added missing SOBs
- CCed the MM people

v1 -> v2:
- fix a commit message
- avoid setting pp_recycle multiple times on mvneta
- squash two patches to avoid breaking bisect

Ilias Apalodimas (1):
  page_pool: Allow drivers to hint on SKB recycling

Jesper Dangaard Brouer (1):
  xdp: reduce size of struct xdp_mem_info

Matteo Croce (3):
  mm: add a signature in struct page
  mvpp2: recycle buffers
  mvneta: recycle buffers

 .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
 drivers/net/ethernet/marvell/mvneta.c         |  7 ++-
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++----
 drivers/net/ethernet/marvell/sky2.c           |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  2 +-
 include/linux/mm_types.h                      |  1 +
 include/linux/skbuff.h                        | 35 ++++++++++++--
 include/net/page_pool.h                       | 15 ++++++
 include/net/xdp.h                             |  5 +-
 net/core/page_pool.c                          | 47 +++++++++++++++++++
 net/core/skbuff.c                             | 20 +++++++-
 net/core/xdp.c                                | 14 ++++--
 net/tls/tls_device.c                          |  2 +-
 13 files changed, 142 insertions(+), 27 deletions(-)

Matthew Wilcox April 10, 2021, 3:48 p.m. UTC | #1

On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> This is needed by the page_pool to avoid recycling a page not allocated
> via page_pool.

Is the PageType mechanism more appropriate to your needs?  It wouldn't
be if you use page->_mapcount (ie mapping it to userspace).

> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>  include/linux/mm_types.h | 1 +
>  include/net/page_pool.h  | 2 ++
>  net/core/page_pool.c     | 4 ++++
>  3 files changed, 7 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6613b26a8894..ef2d0d5f62e4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -101,6 +101,7 @@ struct page {
>  			 * 32-bit architectures.
>  			 */
>  			dma_addr_t dma_addr;
> +			unsigned long signature;
>  		};
>  		struct {	/* slab, slob and slub */
>  			union {
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index b5b195305346..b30405e84b5e 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -63,6 +63,8 @@
>   */
>  #define PP_ALLOC_CACHE_SIZE	128
>  #define PP_ALLOC_CACHE_REFILL	64
> +#define PP_SIGNATURE		0x20210303
> +
>  struct pp_alloc_cache {
>  	u32 count;
>  	void *cache[PP_ALLOC_CACHE_SIZE];
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ad8b0707af04..2ae9b554ef98 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>  		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>  
>  skip_dma_map:
> +	page->signature = PP_SIGNATURE;
> +
>  	/* Track how many pages are held 'in-flight' */
>  	pool->pages_state_hold_cnt++;
>  
> @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
>  			     DMA_ATTR_SKIP_CPU_SYNC);
>  	page->dma_addr = 0;
>  skip_dma_unmap:
> +	page->signature = 0;
> +
>  	/* This may be the last page returned, releasing the pool, so
>  	 * it is not safe to reference pool afterwards.
>  	 */
> -- 
> 2.30.2
>

Ilias Apalodimas April 10, 2021, 4:16 p.m. UTC | #2

Hi Matthew 

On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > This is needed by the page_pool to avoid recycling a page not allocated
> > via page_pool.
> 
> Is the PageType mechanism more appropriate to your needs?  It wouldn't
> be if you use page->_mapcount (ie mapping it to userspace).

Interesting!
Please keep in mind this was written ~2018 and was stale on my branches for
quite some time.  So back then I did try to use PageType, but had not free
bits.  Looking at it again though, it's cleaned up.  So yes I think this can
be much much cleaner.  Should we go and define a new PG_pagepool?


Thanks!
/Ilias
> 
> > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > ---
> >  include/linux/mm_types.h | 1 +
> >  include/net/page_pool.h  | 2 ++
> >  net/core/page_pool.c     | 4 ++++
> >  3 files changed, 7 insertions(+)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6613b26a8894..ef2d0d5f62e4 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -101,6 +101,7 @@ struct page {
> >  			 * 32-bit architectures.
> >  			 */
> >  			dma_addr_t dma_addr;
> > +			unsigned long signature;
> >  		};
> >  		struct {	/* slab, slob and slub */
> >  			union {
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index b5b195305346..b30405e84b5e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -63,6 +63,8 @@
> >   */
> >  #define PP_ALLOC_CACHE_SIZE	128
> >  #define PP_ALLOC_CACHE_REFILL	64
> > +#define PP_SIGNATURE		0x20210303
> > +
> >  struct pp_alloc_cache {
> >  	u32 count;
> >  	void *cache[PP_ALLOC_CACHE_SIZE];
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index ad8b0707af04..2ae9b554ef98 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> >  		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> >  
> >  skip_dma_map:
> > +	page->signature = PP_SIGNATURE;
> > +
> >  	/* Track how many pages are held 'in-flight' */
> >  	pool->pages_state_hold_cnt++;
> >  
> > @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
> >  			     DMA_ATTR_SKIP_CPU_SYNC);
> >  	page->dma_addr = 0;
> >  skip_dma_unmap:
> > +	page->signature = 0;
> > +
> >  	/* This may be the last page returned, releasing the pool, so
> >  	 * it is not safe to reference pool afterwards.
> >  	 */
> > -- 
> > 2.30.2
> >

Shakeel Butt April 10, 2021, 5:42 p.m. UTC | #3

On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> Hi Matthew
>
> On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > > This is needed by the page_pool to avoid recycling a page not allocated
> > > via page_pool.
> >
> > Is the PageType mechanism more appropriate to your needs?  It wouldn't
> > be if you use page->_mapcount (ie mapping it to userspace).
>
> Interesting!
> Please keep in mind this was written ~2018 and was stale on my branches for
> quite some time.  So back then I did try to use PageType, but had not free
> bits.  Looking at it again though, it's cleaned up.  So yes I think this can
> be much much cleaner.  Should we go and define a new PG_pagepool?
>
>

Can this page_pool be used for TCP RX zerocopy? If yes then PageType
can not be used.

There is a recent discussion [1] on memcg accounting of TCP RX
zerocopy and I am wondering if this work can somehow help in that
regard. I will take a look at the series.

[1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Ilias Apalodimas April 10, 2021, 6:27 p.m. UTC | #4

Hi Shakeel, 

On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote:
> On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
> <ilias.apalodimas@linaro.org> wrote:
> >
> > Hi Matthew
> >
> > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > > > This is needed by the page_pool to avoid recycling a page not allocated
> > > > via page_pool.
> > >
> > > Is the PageType mechanism more appropriate to your needs?  It wouldn't
> > > be if you use page->_mapcount (ie mapping it to userspace).
> >
> > Interesting!
> > Please keep in mind this was written ~2018 and was stale on my branches for
> > quite some time.  So back then I did try to use PageType, but had not free
> > bits.  Looking at it again though, it's cleaned up.  So yes I think this can
> > be much much cleaner.  Should we go and define a new PG_pagepool?
> >
> >
> 
> Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> can not be used.

Yes it can, since it's going to be used as your default allocator for
payloads, which might end up on an SKB.
So we have to keep the extra added field on struct page for our mark.
Matthew had an intersting idea.  He suggested keeping it, but changing the 
magic number, so it can't be a kernel address, but I'll let him follow 
up on the details.

> 
> There is a recent discussion [1] on memcg accounting of TCP RX
> zerocopy and I am wondering if this work can somehow help in that
> regard. I will take a look at the series.
> 

I'll try having a look on this as well. The idea behind the patchset is to
allow lower speed NICs that use the API already, gain recycling 'easily'.  
Using page_pool for the driver comes with a penalty to begin with.
Allocating pages instead of SKBs has a measurable difference. By enabling them
to recycle they'll get better performance, since you skip the
reallocation/remapping and only care for syncing the buffers correctly.

> [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Matthew Wilcox April 10, 2021, 7:39 p.m. UTC | #5

On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote:
> > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > can not be used.
> 
> Yes it can, since it's going to be used as your default allocator for
> payloads, which might end up on an SKB.
> So we have to keep the extra added field on struct page for our mark.
> Matthew had an intersting idea.  He suggested keeping it, but changing the 
> magic number, so it can't be a kernel address, but I'll let him follow 
> up on the details.

Sure!  So, given the misalignment problem I discovered yesterday [1],
we probably want a page_pool page to look like:

unsigned long	flags;
unsigned long	pp_magic;
unsigned long	xmi;
unsigned long	_pp_mapping_pad;
dma_addr_t	dma_addr;	/* might be one or two words */

The only real restriction here is that pp_magic should not be a valid
pointer, and it must have the bottom bit clear.  I'd recommend something
like:

#define PP_MAGIC	(0x20 + POISON_POINTER_DELTA)

This leaves page->mapping as NULL, so you don't have to worry about
clearing it before free.

[1] https://lore.kernel.org/linux-mm/20210410024313.GX2531743@casper.infradead.org/

Jesper Dangaard Brouer April 11, 2021, 10:05 a.m. UTC | #6

On Sat, 10 Apr 2021 20:39:55 +0100
Matthew Wilcox <willy@infradead.org> wrote:

> On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote:
> > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > can not be used.  
> > 
> > Yes it can, since it's going to be used as your default allocator for
> > payloads, which might end up on an SKB.
> > So we have to keep the extra added field on struct page for our mark.
> > Matthew had an intersting idea.  He suggested keeping it, but changing the 
> > magic number, so it can't be a kernel address, but I'll let him follow 
> > up on the details.  
> 
> Sure!  So, given the misalignment problem I discovered yesterday [1],
> we probably want a page_pool page to look like:
> 
> unsigned long	flags;
> unsigned long	pp_magic;
> unsigned long	xmi;
> unsigned long	_pp_mapping_pad;
> dma_addr_t	dma_addr;	/* might be one or two words */
> 
> The only real restriction here is that pp_magic should not be a valid
> pointer, and it must have the bottom bit clear.  I'd recommend something
> like:
> 
> #define PP_MAGIC	(0x20 + POISON_POINTER_DELTA)
> 
> This leaves page->mapping as NULL, so you don't have to worry about
> clearing it before free.
>
> [1] https://lore.kernel.org/linux-mm/20210410024313.GX2531743@casper.infradead.org/

I didn't see this, before asking[2] for explaining your intent.
I still worry about page->index, see [2].

[2] https://lore.kernel.org/netdev/20210411114307.5087f958@carbon/

Jesper Dangaard Brouer April 14, 2021, 7:41 p.m. UTC | #7

On Sat, 10 Apr 2021 21:27:31 +0300
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote:
>
> > On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
> > <ilias.apalodimas@linaro.org> wrote:  
> > >
> > > Hi Matthew
> > >
> > > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:  
> > > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:  
> > > > > This is needed by the page_pool to avoid recycling a page not allocated
> > > > > via page_pool.  
> > > >
> > > > Is the PageType mechanism more appropriate to your needs?  It wouldn't
> > > > be if you use page->_mapcount (ie mapping it to userspace).  
> > >
> > > Interesting!
> > > Please keep in mind this was written ~2018 and was stale on my branches for
> > > quite some time.  So back then I did try to use PageType, but had not free
> > > bits.  Looking at it again though, it's cleaned up.  So yes I think this can
> > > be much much cleaner.  Should we go and define a new PG_pagepool?
> > >
> > 
> > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > can not be used.  
> 
> Yes it can, since it's going to be used as your default allocator for
> payloads, which might end up on an SKB.

I'm not sure we want or should "allow" page_pool be used for TCP RX
zerocopy.
For several reasons.

(1) This implies mapping these pages page to userspace, which AFAIK
means using page->mapping and page->index members (right?).

(2) It feels wrong (security wise) to keep the DMA-mapping (for the
device) and also map this page into userspace.

(3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
zerocopy will bump the refcnt, which means the page_pool will not
recycle the page when it see the elevated refcnt (it will instead
release its DMA-mapping).

(4) I remember vaguely that this code path for (TCP RX zerocopy) uses
page->private for tricks.  And our patch [3/5] use page->private for
storing xdp_mem_info.

IMHO when the SKB travel into this TCP RX zerocopy code path, we should
call page_pool_release_page() to release its DMA-mapping.

> > [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Shakeel Butt April 14, 2021, 8:09 p.m. UTC | #8

On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
[...]
> > >
> > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > can not be used.
> >
> > Yes it can, since it's going to be used as your default allocator for
> > payloads, which might end up on an SKB.
>
> I'm not sure we want or should "allow" page_pool be used for TCP RX
> zerocopy.
> For several reasons.
>
> (1) This implies mapping these pages page to userspace, which AFAIK
> means using page->mapping and page->index members (right?).
>

No, only page->_mapcount is used.

> (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> device) and also map this page into userspace.
>

I think this is already the case i.e pages still DMA-mapped and also
mapped into userspace.

> (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> zerocopy will bump the refcnt, which means the page_pool will not
> recycle the page when it see the elevated refcnt (it will instead
> release its DMA-mapping).

Yes this is right but the userspace might have already consumed and
unmapped the page before the driver considers to recycle the page.

>
> (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> page->private for tricks.  And our patch [3/5] use page->private for
> storing xdp_mem_info.
>
> IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> call page_pool_release_page() to release its DMA-mapping.
>

I will let TCP RX zerocopy experts respond to this but from my high
level code inspection, I didn't see page->private usage.

Eric Dumazet April 14, 2021, 8:51 p.m. UTC | #9

On Wed, Apr 14, 2021 at 10:09 PM Shakeel Butt <shakeelb@google.com> wrote:

>
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.

Indeed, we do not use page->private, since we do not own the page(s).

Ilias Apalodimas April 19, 2021, 5:12 a.m. UTC | #10

On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> [...]
> > > >
> > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > can not be used.
> > >
> > > Yes it can, since it's going to be used as your default allocator for
> > > payloads, which might end up on an SKB.
> >
> > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > zerocopy.
> > For several reasons.
> >
> > (1) This implies mapping these pages page to userspace, which AFAIK
> > means using page->mapping and page->index members (right?).
> >
> 
> No, only page->_mapcount is used.
> 

I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
adopt the recycling mechanism we should try preserving the current
functionality of the network stack.

The question is how does it work with the current drivers that already have an
internal page recycling mechanism.

> > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > device) and also map this page into userspace.
> >
> 
> I think this is already the case i.e pages still DMA-mapped and also
> mapped into userspace.
> 
> > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > zerocopy will bump the refcnt, which means the page_pool will not
> > recycle the page when it see the elevated refcnt (it will instead
> > release its DMA-mapping).
> 
> Yes this is right but the userspace might have already consumed and
> unmapped the page before the driver considers to recycle the page.

Same question here. I'll have a closer look in a few days and make sure we are
not breaking anything wrt zerocopy.

> 
> >
> > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > page->private for tricks.  And our patch [3/5] use page->private for
> > storing xdp_mem_info.
> >
> > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > call page_pool_release_page() to release its DMA-mapping.
> >
> 
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.

Shakeel are you aware of any 'easy' way I can have rx zerocopy running?


Thanks!
/Ilias

Jesper Dangaard Brouer April 19, 2021, 11:22 a.m. UTC | #11

On Wed, 14 Apr 2021 13:09:47 -0700
Shakeel Butt <shakeelb@google.com> wrote:

> On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >  
> [...]
> > > >
> > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > can not be used.  
> > >
> > > Yes it can, since it's going to be used as your default allocator for
> > > payloads, which might end up on an SKB.  
> >
> > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > zerocopy.
> > For several reasons.
> >
> > (1) This implies mapping these pages page to userspace, which AFAIK
> > means using page->mapping and page->index members (right?).
> >  
> 
> No, only page->_mapcount is used.

Good to know.
I will admit that I don't fully understand the usage of page->mapping
and page->index members.

> > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > device) and also map this page into userspace.
> >  
> 
> I think this is already the case i.e pages still DMA-mapped and also
> mapped into userspace.

True, other drivers are doing the same.

> > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > zerocopy will bump the refcnt, which means the page_pool will not
> > recycle the page when it see the elevated refcnt (it will instead
> > release its DMA-mapping).  
> 
> Yes this is right but the userspace might have already consumed and
> unmapped the page before the driver considers to recycle the page.

That is a good point.  So, there is a race window where it is possible
to gain recycling.

It seems my page_pool co-maintainer Ilias is interested in taking up the
challenge to get this working with TCP RX zerocopy.  So, lets see how
this is doable.

 
> >
> > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > page->private for tricks.  And our patch [3/5] use page->private for
> > storing xdp_mem_info.
> >
> > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > call page_pool_release_page() to release its DMA-mapping.
> >  
> 
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.

I trust when Eric says page->private isn't used in this code path.
So, it might actually be possible :-)

I will challenge Ilias and Matteo to pull this off. (But I know that
both of them are busy for personal reasons, so be patient with them).

Matthew Wilcox April 19, 2021, 1:01 p.m. UTC | #12

On Mon, Apr 19, 2021 at 01:22:04PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 14 Apr 2021 13:09:47 -0700
> Shakeel Butt <shakeelb@google.com> wrote:
> 
> > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > >  
> > [...]
> > > > >
> > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > can not be used.  
> > > >
> > > > Yes it can, since it's going to be used as your default allocator for
> > > > payloads, which might end up on an SKB.  
> > >
> > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > zerocopy.
> > > For several reasons.
> > >
> > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > means using page->mapping and page->index members (right?).
> > >  
> > 
> > No, only page->_mapcount is used.
> 
> Good to know.
> I will admit that I don't fully understand the usage of page->mapping
> and page->index members.

That's fair.  It's not well-documented, and it's complicated.

For a page mapped into userspace, page->mapping is one of:
 - NULL
 - A pointer to a file's address_space
 - A pointer to an anonymous page's anon_vma
If a page isn't mapped into userspace, you can use the space in page->mapping
for anything you like (eg slab uses it)

page->index is only used for indicating pfmemalloc today (and I want to
move that indicator).  I think it can also be used to merge VMAs (if
some other conditions are also true), but failing to merge VMAs isn't
a big deal for this kind of situation.

> > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > device) and also map this page into userspace.
> > 
> > I think this is already the case i.e pages still DMA-mapped and also
> > mapped into userspace.
> 
> True, other drivers are doing the same.

And the contents of this page already came from that device ... if it
wanted to write bad data, it could already have done so.

> > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > zerocopy will bump the refcnt, which means the page_pool will not
> > > recycle the page when it see the elevated refcnt (it will instead
> > > release its DMA-mapping).  
> > 
> > Yes this is right but the userspace might have already consumed and
> > unmapped the page before the driver considers to recycle the page.
> 
> That is a good point.  So, there is a race window where it is possible
> to gain recycling.
> 
> It seems my page_pool co-maintainer Ilias is interested in taking up the
> challenge to get this working with TCP RX zerocopy.  So, lets see how
> this is doable.

You could also check page_ref_count() - page_mapcount() instead of
just checking page_ref_count().  Assuming mapping/unmapping can't
race with recycling?

Shakeel Butt April 19, 2021, 2:57 p.m. UTC | #13

On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > >
> > [...]
> > > > >
> > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > can not be used.
> > > >
> > > > Yes it can, since it's going to be used as your default allocator for
> > > > payloads, which might end up on an SKB.
> > >
> > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > zerocopy.
> > > For several reasons.
> > >
> > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > means using page->mapping and page->index members (right?).
> > >
> >
> > No, only page->_mapcount is used.
> >
>
> I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
> adopt the recycling mechanism we should try preserving the current
> functionality of the network stack.
>
> The question is how does it work with the current drivers that already have an
> internal page recycling mechanism.
>

I think the current drivers check page_ref_count(page) to decide to
reuse (or not) the already allocated pages.

Some examples from the drivers:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page()
drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page()
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get()

> > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > device) and also map this page into userspace.
> > >
> >
> > I think this is already the case i.e pages still DMA-mapped and also
> > mapped into userspace.
> >
> > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > zerocopy will bump the refcnt, which means the page_pool will not
> > > recycle the page when it see the elevated refcnt (it will instead
> > > release its DMA-mapping).
> >
> > Yes this is right but the userspace might have already consumed and
> > unmapped the page before the driver considers to recycle the page.
>
> Same question here. I'll have a closer look in a few days and make sure we are
> not breaking anything wrt zerocopy.
>

Pages mapped into the userspace have their refcnt elevated, so the
page_ref_count() check by the drivers indicates to not reuse such
pages.

> >
> > >
> > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > > page->private for tricks.  And our patch [3/5] use page->private for
> > > storing xdp_mem_info.
> > >
> > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > > call page_pool_release_page() to release its DMA-mapping.
> > >
> >
> > I will let TCP RX zerocopy experts respond to this but from my high
> > level code inspection, I didn't see page->private usage.
>
> Shakeel are you aware of any 'easy' way I can have rx zerocopy running?
>

I would recommend tools/testing/selftests/net/tcp_mmap.c.

Ilias Apalodimas April 19, 2021, 3:43 p.m. UTC | #14

Hi Shakeel,
On Mon, Apr 19, 2021 at 07:57:03AM -0700, Shakeel Butt wrote:
> On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas
> <ilias.apalodimas@linaro.org> wrote:
> >
> > On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > > <brouer@redhat.com> wrote:
> > > >
> > > [...]
> > > > > >
> > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > > can not be used.
> > > > >
> > > > > Yes it can, since it's going to be used as your default allocator for
> > > > > payloads, which might end up on an SKB.
> > > >
> > > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > > zerocopy.
> > > > For several reasons.
> > > >
> > > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > > means using page->mapping and page->index members (right?).
> > > >
> > >
> > > No, only page->_mapcount is used.
> > >
> >
> > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
> > adopt the recycling mechanism we should try preserving the current
> > functionality of the network stack.
> >
> > The question is how does it work with the current drivers that already have an
> > internal page recycling mechanism.
> >
> 
> I think the current drivers check page_ref_count(page) to decide to
> reuse (or not) the already allocated pages.
> 
> Some examples from the drivers:
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page()
> drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page()
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get()
> 

Yes, that's how internal recycling is done in drivers. As Jesper mentioned the
refcnt of the page is 1 for the page_pool owned pages and that's how we decide 
what to do with the page.

> > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > > device) and also map this page into userspace.
> > > >
> > >
> > > I think this is already the case i.e pages still DMA-mapped and also
> > > mapped into userspace.
> > >
> > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > > zerocopy will bump the refcnt, which means the page_pool will not
> > > > recycle the page when it see the elevated refcnt (it will instead
> > > > release its DMA-mapping).
> > >
> > > Yes this is right but the userspace might have already consumed and
> > > unmapped the page before the driver considers to recycle the page.
> >
> > Same question here. I'll have a closer look in a few days and make sure we are
> > not breaking anything wrt zerocopy.
> >
> 
> Pages mapped into the userspace have their refcnt elevated, so the
> page_ref_count() check by the drivers indicates to not reuse such
> pages.
> 

When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch() 
which will end up doing a get_page().
What you are saying is that once the zerocopy is done though, skb_release_data() 
won't be called, but instead put_page() will be? If that's the case then we are 
indeed leaking DMA mappings and memory. That sounds weird though, since the
refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
eventually frees the page? 
If kfree_skb() (or any wrapper that calls skb_release_data()) is called 
eventually, we'll end up properly recycling the page into our pool.

> > >
> > > >
> > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > > > page->private for tricks.  And our patch [3/5] use page->private for
> > > > storing xdp_mem_info.
> > > >
> > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > > > call page_pool_release_page() to release its DMA-mapping.
> > > >
> > >
> > > I will let TCP RX zerocopy experts respond to this but from my high
> > > level code inspection, I didn't see page->private usage.
> >
> > Shakeel are you aware of any 'easy' way I can have rx zerocopy running?
> >
> 
> I would recommend tools/testing/selftests/net/tcp_mmap.c.

Ok, thanks I'll have a look.

Cheers
/Ilias

Shakeel Butt April 19, 2021, 4:21 p.m. UTC | #15

On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
[...]
> > Pages mapped into the userspace have their refcnt elevated, so the
> > page_ref_count() check by the drivers indicates to not reuse such
> > pages.
> >
>
> When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch()
> which will end up doing a get_page().
> What you are saying is that once the zerocopy is done though, skb_release_data()
> won't be called, but instead put_page() will be? If that's the case then we are
> indeed leaking DMA mappings and memory. That sounds weird though, since the
> refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
> eventually frees the page?
> If kfree_skb() (or any wrapper that calls skb_release_data()) is called
> eventually, we'll end up properly recycling the page into our pool.
>

From what I understand (Eric, please correct me if I'm wrong) for
simple cases there are 3 page references taken. One by the driver,
second by skb and third by page table.

In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one
page ref through insert_page_into_pte_locked(). However before
returning from tcp_zerocopy_receive(), the skb references are dropped
through tcp_recv_skb(). So, whenever the user unmaps the page and
drops the page ref only then that page can be reused by the driver.

In my understanding, for zerocopy rx the skb_release_data() is called
on the pages while they are still mapped into the userspace. So,
skb_release_data() might not be the right place to recycle the page
for zerocopy. The email chain at [1] has some discussion on how to
bundle the recycling of pages with their lifetime.

[1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Ilias Apalodimas April 19, 2021, 6:41 p.m. UTC | #16

On Mon, Apr 19, 2021 at 09:21:55AM -0700, Shakeel Butt wrote:
> On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas
> <ilias.apalodimas@linaro.org> wrote:
> >
> [...]
> > > Pages mapped into the userspace have their refcnt elevated, so the
> > > page_ref_count() check by the drivers indicates to not reuse such
> > > pages.
> > >
> >
> > When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch()
> > which will end up doing a get_page().
> > What you are saying is that once the zerocopy is done though, skb_release_data()
> > won't be called, but instead put_page() will be? If that's the case then we are
> > indeed leaking DMA mappings and memory. That sounds weird though, since the
> > refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
> > eventually frees the page?
> > If kfree_skb() (or any wrapper that calls skb_release_data()) is called
> > eventually, we'll end up properly recycling the page into our pool.
> >
> 
> From what I understand (Eric, please correct me if I'm wrong) for
> simple cases there are 3 page references taken. One by the driver,
> second by skb and third by page table.
> 
> In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one
> page ref through insert_page_into_pte_locked(). However before
> returning from tcp_zerocopy_receive(), the skb references are dropped
> through tcp_recv_skb(). So, whenever the user unmaps the page and
> drops the page ref only then that page can be reused by the driver.
> 
> In my understanding, for zerocopy rx the skb_release_data() is called
> on the pages while they are still mapped into the userspace. So,
> skb_release_data() might not be the right place to recycle the page
> for zerocopy. The email chain at [1] has some discussion on how to
> bundle the recycling of pages with their lifetime.
> 
> [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Ah right, you mentioned the same email before and I completely forgot about
it! In the past we had thoughts of 'stealing' the page on put_page instead of 
skb_release_data().  We were afraid that this would cause a measurable 
performance hit, so we tried to limit it within the skb lifecycle.

However I don't think this will be a problem.  Assuming we are right here and 
skb_release_data() is called while the userspace holds an extra reference from
the mapping here's what will happen:

skb_release_data() -> skb_free_head() -> page_pool_return_skb_page() ->
set_page_private() -> xdp_return_skb_frame() -> __xdp_return() -> 
page_pool_put_full_page() -> page_pool_put_page() -> __page_pool_put_page()

When we call __page_pool_put_page(), the refcnt will be != 1 (because a user
mapping is still active), so we won't try to recycle it. Instead we'll remove 
the DMA mappings and decrease the refcnt.

So although the recycling won't 'work', nothing bad will happen (famous last
words).

In any case, I'll double check with the test you pointed out before v4.

Thanks!
/Ilias

Ilias Apalodimas April 20, 2021, 8:10 a.m. UTC | #17

Hi Matthew,

[...]
> 
> And the contents of this page already came from that device ... if it
> wanted to write bad data, it could already have done so.
> 
> > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > > zerocopy will bump the refcnt, which means the page_pool will not
> > > > recycle the page when it see the elevated refcnt (it will instead
> > > > release its DMA-mapping).  
> > > 
> > > Yes this is right but the userspace might have already consumed and
> > > unmapped the page before the driver considers to recycle the page.
> > 
> > That is a good point.  So, there is a race window where it is possible
> > to gain recycling.
> > 
> > It seems my page_pool co-maintainer Ilias is interested in taking up the
> > challenge to get this working with TCP RX zerocopy.  So, lets see how
> > this is doable.
> 
> You could also check page_ref_count() - page_mapcount() instead of
> just checking page_ref_count().  Assuming mapping/unmapping can't
> race with recycling?
> 

That's not a bad idea.  As I explained on my last reply to Shakeel, I don't
think the current patch will blow up anywhere.  If the page is unmapped prior
to kfree_skb() it will be recycled.  If it's done in a reverse order, we'll
just free the page entirely and will have to re-allocate it.
The only thing I need to test is potential races (assuming those can even
happen?).

Trying to recycle the page outside of kfree_skb() means we'd have to 'steal'
the page, during put_page() (or some function that's outside the networking
scope).  I think this is going to have a measurable performance penalty though
not in networking, but in general.

In any case, that should be orthogonal to the current patchset.  So unless
someone feels strongly about it, I'd prefer keeping the current code and
trying to enable recycling in the skb zc case, when we have enough users of
the API.


Thanks
/Ilias

Yunsheng Lin April 29, 2021, 8:27 a.m. UTC | #18

On 2021/4/10 6:37, Matteo Croce wrote:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> This is a respin of [1]
> 
> This  patchset shows the plans for allowing page_pool to handle and
> maintain DMA map/unmap of the pages it serves to the driver.  For this
> to work a return hook in the network core is introduced.
> 
> The overall purpose is to simplify drivers, by providing a page
> allocation API that does recycling, such that each driver doesn't have
> to reinvent its own recycling scheme.  Using page_pool in a driver
> does not require implementing XDP support, but it makes it trivially
> easy to do so.  Instead of allocating buffers specifically for SKBs
> we now allocate a generic buffer and either wrap it on an SKB
> (via build_skb) or create an XDP frame.
> The recycling code leverages the XDP recycle APIs.
> 
> The Marvell mvpp2 and mvneta drivers are used in this patchset to
> demonstrate how to use the API, and tested on a MacchiatoBIN
> and EspressoBIN boards respectively.
> 

Hi, Matteo
     I added the skb frag page recycling in hns3 based on this patchset,
and it has above 10%~20% performance improvement for one thread iperf
TCP flow(IOMMU is off, there may be more performance improvement if
considering the DMA map/unmap avoiding for IOMMU), thanks for the job.

    The skb frag page recycling support in hns3 driver is not so simple
as the mvpp2 and mvneta driver, because:

1. the hns3 driver do not have XDP support yet, so "struct xdp_rxq_info"
   is added to assist relation binding between the "struct page" and
   "struct page_pool".

2. the hns3 driver has already a page reusing based on page spliting and
   page reference count, but it may not work if the upper stack can not
   handle skb and release the corresponding page fast enough.

3. the hns3 driver support page reference count updating batching, see:
   aeda9bf87a45 ("net: hns3: batch the page reference count updates")

So it would be better if：

1. skb frag page recycling do not need "struct xdp_rxq_info" or
   "struct xdp_mem_info" to bond the relation between "struct page" and
   "struct page_pool", which seems uncessary at this point if bonding
   a "struct page_pool" pointer directly in "struct page" does not cause
   space increasing.

2. it would be good to do the page reference count updating batching
   in page pool instead of specific driver.


page_pool_atomic_sub_if_positive() is added to decide who can call
page_pool_put_full_page(), because the driver and stack may hold
reference to the same page, only if last one which hold complete
reference to a page can call page_pool_put_full_page() to decide if
recycling is possible, if not, the page is released, so I am wondering
if a similar page_pool_atomic_sub_if_positive() can added to specific
user space address unmapping path to allow skb recycling for RX zerocopy
too?

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index c21dd11..8b01a7d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2566,7 +2566,10 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring,
        unsigned int order = hns3_page_order(ring);
        struct page *p;

-       p = dev_alloc_pages(order);
+       if (ring->page_pool)
+               p = page_pool_dev_alloc_pages(ring->page_pool);
+       else
+               p = dev_alloc_pages(order);
        if (!p)
                return -ENOMEM;

@@ -2582,13 +2585,32 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring,
        return 0;
 }

+static void hns3_page_frag_cache_drain(struct hns3_enet_ring *ring,
+                                      struct hns3_desc_cb *cb)
+{
+       if (ring->page_pool) {
+               struct page *p = cb->priv;
+
+               if (page_pool_atomic_sub_if_positive(p, cb->pagecnt_bias))
+                       return;
+
+               if (cb->pagecnt_bias > 1)
+                       page_ref_sub(p, cb->pagecnt_bias - 1);
+
+               page_pool_put_full_page(ring->page_pool, p, false);
+               return;
+       }
+
+       __page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+}
+
 static void hns3_free_buffer(struct hns3_enet_ring *ring,
                             struct hns3_desc_cb *cb, int budget)
 {
        if (cb->type == DESC_TYPE_SKB)
                napi_consume_skb(cb->priv, budget);
        else if (!HNAE3_IS_TX_RING(ring) && cb->pagecnt_bias)
-               __page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+               hns3_page_frag_cache_drain(ring, cb);
        memset(cb, 0, sizeof(*cb));
 }

@@ -2892,13 +2914,15 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i,
        skb_add_rx_frag(skb, i, desc_cb->priv, desc_cb->page_offset + pull_len,
                        size - pull_len, truesize);

+       skb_mark_for_recycle(skb, desc_cb->priv, &ring->rxq_info.mem);
+
        /* Avoid re-using remote and pfmemalloc pages, or the stack is still
         * using the page when page_offset rollback to zero, flag default
         * unreuse
         */
        if (!dev_page_is_reusable(desc_cb->priv) ||
            (!desc_cb->page_offset && !hns3_can_reuse_page(desc_cb))) {
-               __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias);
+               hns3_page_frag_cache_drain(ring, desc_cb);
                return;
        }

@@ -2911,7 +2935,7 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i,
                desc_cb->reuse_flag = 1;
                desc_cb->page_offset = 0;
        } else if (desc_cb->pagecnt_bias) {
-               __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias);
+               hns3_page_frag_cache_drain(ring, desc_cb);
                return;
        }

@@ -3156,8 +3180,7 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length,
                if (dev_page_is_reusable(desc_cb->priv))
                        desc_cb->reuse_flag = 1;
                else /* This page cannot be reused so discard it */
-                       __page_frag_cache_drain(desc_cb->priv,
-                                               desc_cb->pagecnt_bias);
+                       hns3_page_frag_cache_drain(ring, desc_cb);

                hns3_rx_ring_move_fw(ring);
                return 0;
@@ -4028,6 +4051,33 @@ static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring)
                goto out_with_desc_cb;

        if (!HNAE3_IS_TX_RING(ring)) {
+               struct page_pool_params pp_params = {
+               /* internal DMA mapping in page_pool */
+               .flags = 0,
+               .order = 0,
+               .pool_size = 1024,
+               .nid = dev_to_node(ring_to_dev(ring)),
+               .dev = ring_to_dev(ring),
+               .dma_dir = DMA_FROM_DEVICE,
+               .offset = 0,
+               .max_len = 0,
+               };
+
+               ring->page_pool = page_pool_create(&pp_params);
+               if (IS_ERR(ring->page_pool)) {
+                       dev_err(ring_to_dev(ring), "page pool creation failed\n");
+                       ring->page_pool = NULL;
+               }
+
+               ret = xdp_rxq_info_reg(&ring->rxq_info, ring_to_netdev(ring), ring->queue_index, 0);
+               if (ret)
+                       dev_err(ring_to_dev(ring), "xdp_rxq_info_reg failed\n");
+
+               ret = xdp_rxq_info_reg_mem_model(&ring->rxq_info, MEM_TYPE_PAGE_POOL,
+                                                ring->page_pool);
+               if (ret)
+                       dev_err(ring_to_dev(ring), "xdp_rxq_info_reg_mem_model failed\n");
+
                ret = hns3_alloc_ring_buffers(ring);
                if (ret)
                        goto out_with_desc;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index daa04ae..fd53fcc 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -6,6 +6,9 @@

 #include <linux/if_vlan.h>

+#include <net/page_pool.h>
+#include <net/xdp.h>
+
 #include "hnae3.h"

 enum hns3_nic_state {
@@ -408,6 +411,8 @@ struct hns3_enet_ring {
        struct hnae3_queue *tqp;
        int queue_index;
        struct device *dev; /* will be used for DMA mapping of descriptors */
+       struct page_pool *page_pool;
        struct hnae3_queue *tqp;
        int queue_index;
        struct device *dev; /* will be used for DMA mapping of descriptors */
+       struct page_pool *page_pool;
+       struct xdp_rxq_info rxq_info;

        /* statistic */
        struct ring_stats stats;
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 75fffc1..70c310e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -195,6 +195,8 @@ static inline void page_pool_put_full_page(struct page_pool *pool,
 #endif
 }

+bool page_pool_atomic_sub_if_positive(struct page *page, int i);
+
 /* Same as above but the caller must guarantee safe context. e.g NAPI */
 static inline void page_pool_recycle_direct(struct page_pool *pool,
                                            struct page *page)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 43bfd2e..8bc8b7e 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -596,6 +596,26 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 }
 EXPORT_SYMBOL(page_pool_update_nid);

+bool page_pool_atomic_sub_if_positive(struct page *page, int i)
+{
+       atomic_t *v = &page->_refcount;
+       int dec, c;
+
+       do {
+               c = atomic_read(v);
+
+               dec = c - i;
+               if (unlikely(dec == 0))
+                       return false;
+               else if (unlikely(dec < 0)) {
+                       pr_err("c: %d, dec: %d, i: %d\n", c, dec, i);
+                       return false;
+               }
+       } while (!atomic_try_cmpxchg(v, &c, dec));
+
+       return true;
+}
+
 bool page_pool_return_skb_page(void *data)
 {
        struct xdp_mem_info mem_info;
@@ -606,6 +626,9 @@ bool page_pool_return_skb_page(void *data)
        if (unlikely(page->signature != PP_SIGNATURE))
                return false;

+       if (page_pool_atomic_sub_if_positive(page, 1))
+               return true;
+
        info.raw = page_private(page);
        mem_info = info.mem_info;

Ilias Apalodimas April 29, 2021, 6:51 p.m. UTC | #19

Hi Yunsheng,

On Thu, Apr 29, 2021 at 04:27:21PM +0800, Yunsheng Lin wrote:
> On 2021/4/10 6:37, Matteo Croce wrote:
> > From: Matteo Croce <mcroce@microsoft.com>
> > 
> > This is a respin of [1]
> > 
> > This  patchset shows the plans for allowing page_pool to handle and
> > maintain DMA map/unmap of the pages it serves to the driver.  For this
> > to work a return hook in the network core is introduced.
> > 
> > The overall purpose is to simplify drivers, by providing a page
> > allocation API that does recycling, such that each driver doesn't have
> > to reinvent its own recycling scheme.  Using page_pool in a driver
> > does not require implementing XDP support, but it makes it trivially
> > easy to do so.  Instead of allocating buffers specifically for SKBs
> > we now allocate a generic buffer and either wrap it on an SKB
> > (via build_skb) or create an XDP frame.
> > The recycling code leverages the XDP recycle APIs.
> > 
> > The Marvell mvpp2 and mvneta drivers are used in this patchset to
> > demonstrate how to use the API, and tested on a MacchiatoBIN
> > and EspressoBIN boards respectively.
> > 
> 
> Hi, Matteo
>      I added the skb frag page recycling in hns3 based on this patchset,
> and it has above 10%~20% performance improvement for one thread iperf
> TCP flow(IOMMU is off, there may be more performance improvement if
> considering the DMA map/unmap avoiding for IOMMU), thanks for the job.
> 
>     The skb frag page recycling support in hns3 driver is not so simple
> as the mvpp2 and mvneta driver, because:
> 
> 1. the hns3 driver do not have XDP support yet, so "struct xdp_rxq_info"
>    is added to assist relation binding between the "struct page" and
>    "struct page_pool".
> 
> 2. the hns3 driver has already a page reusing based on page spliting and
>    page reference count, but it may not work if the upper stack can not
>    handle skb and release the corresponding page fast enough.
> 
> 3. the hns3 driver support page reference count updating batching, see:
>    aeda9bf87a45 ("net: hns3: batch the page reference count updates")
> 
> So it would be better if：
> 
> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
>    "struct xdp_mem_info" to bond the relation between "struct page" and
>    "struct page_pool", which seems uncessary at this point if bonding
>    a "struct page_pool" pointer directly in "struct page" does not cause
>    space increasing.

We can't do that. The reason we need those structs is that we rely on the
existing XDP code, which already recycles it's buffers, to enable
recycling.  Since we allocate a page per packet when using page_pool for a
driver , the same ideas apply to an SKB and XDP frame. We just recycle the
payload and we don't really care what's in that.  We could rename the functions
to something more generic in the future though ?

> 
> 2. it would be good to do the page reference count updating batching
>    in page pool instead of specific driver.
> 
> 
> page_pool_atomic_sub_if_positive() is added to decide who can call
> page_pool_put_full_page(), because the driver and stack may hold
> reference to the same page, only if last one which hold complete
> reference to a page can call page_pool_put_full_page() to decide if
> recycling is possible, if not, the page is released, so I am wondering
> if a similar page_pool_atomic_sub_if_positive() can added to specific
> user space address unmapping path to allow skb recycling for RX zerocopy
> too?
> 

I would prefer a different page pool type if we wanted to support the split
page model.  The changes as is are quite intrusive, since they change the 
entire skb return path.  So I would prefer introducing the changes one at a 
time. 

The fundamental difference between having the recycling in the driver vs
having it in a generic API is pretty straightforward.  When a driver holds
the extra page references he is free to decide what to reuse, when he is about
to refill his Rx descriptors.  So TCP zerocopy might work even if the
userspace applications hold the buffers for an X amount of time.
On this proposal though we *need* to decide what to do with the buffer when we
are about to free the skb.

[...]


Cheers
/Ilias

Yunsheng Lin April 30, 2021, 3:01 a.m. UTC | #20

On 2021/4/30 2:51, Ilias Apalodimas wrote:
> Hi Yunsheng,
> 
> On Thu, Apr 29, 2021 at 04:27:21PM +0800, Yunsheng Lin wrote:
>> On 2021/4/10 6:37, Matteo Croce wrote:
>>> From: Matteo Croce <mcroce@microsoft.com>

[...]

>>
>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
>>    "struct xdp_mem_info" to bond the relation between "struct page" and
>>    "struct page_pool", which seems uncessary at this point if bonding
>>    a "struct page_pool" pointer directly in "struct page" does not cause
>>    space increasing.
> 
> We can't do that. The reason we need those structs is that we rely on the
> existing XDP code, which already recycles it's buffers, to enable
> recycling.  Since we allocate a page per packet when using page_pool for a
> driver , the same ideas apply to an SKB and XDP frame. We just recycle the

I am not really familar with XDP here, but a packet from hw is either a
"struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
the same time, right?

What does not really make sense to me is that the page has to be from page
pool when a skb's frag page can be recycled, right? If it is ture, the switch
case in __xdp_return() does not really make sense for skb recycling, why go
all the trouble of checking the mem->type and mem->id to find the page_pool
pointer when recyclable page for skb can only be from page pool?

> payload and we don't really care what's in that.  We could rename the functions
> to something more generic in the future though ?
> 
>>
>> 2. it would be good to do the page reference count updating batching
>>    in page pool instead of specific driver.
>>
>>
>> page_pool_atomic_sub_if_positive() is added to decide who can call
>> page_pool_put_full_page(), because the driver and stack may hold
>> reference to the same page, only if last one which hold complete
>> reference to a page can call page_pool_put_full_page() to decide if
>> recycling is possible, if not, the page is released, so I am wondering
>> if a similar page_pool_atomic_sub_if_positive() can added to specific
>> user space address unmapping path to allow skb recycling for RX zerocopy
>> too?
>>
> 
> I would prefer a different page pool type if we wanted to support the split
> page model.  The changes as is are quite intrusive, since they change the 
> entire skb return path.  So I would prefer introducing the changes one at a 
> time. 

I understand there may be fundamental semantic change when split page model
is supported by page pool, but the split page support change mainly affect the
skb recycling path and the driver that uses page pool(XDP too) if we are careful
enough, not the entire skb return path as my understanding.

Anyway, one changes at a time is always prefered if supporting split page is
proved to be non-trivel and intrusive.

> 
> The fundamental difference between having the recycling in the driver vs
> having it in a generic API is pretty straightforward.  When a driver holds
> the extra page references he is free to decide what to reuse, when he is about
> to refill his Rx descriptors.  So TCP zerocopy might work even if the
> userspace applications hold the buffers for an X amount of time.
> On this proposal though we *need* to decide what to do with the buffer when we
> are about to free the skb.

I am not sure I understand what you meant by "free the skb", does it mean
that kfree_skb() is called to free the skb.

As my understanding, if the skb completely own the page(which means page_count()
== 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise
page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()
try to handle it atomically.

> 
> [...]
> 
> 
> Cheers
> /Ilias
> 
> .
>

Ilias Apalodimas April 30, 2021, 4:24 p.m. UTC | #21

[...]
> >>
> >> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
> >>    "struct xdp_mem_info" to bond the relation between "struct page" and
> >>    "struct page_pool", which seems uncessary at this point if bonding
> >>    a "struct page_pool" pointer directly in "struct page" does not cause
> >>    space increasing.
> > 
> > We can't do that. The reason we need those structs is that we rely on the
> > existing XDP code, which already recycles it's buffers, to enable
> > recycling.  Since we allocate a page per packet when using page_pool for a
> > driver , the same ideas apply to an SKB and XDP frame. We just recycle the
> 
> I am not really familar with XDP here, but a packet from hw is either a
> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
> the same time, right?
> 

Yes, but the payload is irrelevant in both cases and that's what we use
page_pool for.  You can't use this patchset unless your driver usues
build_skb().  So in both cases you just allocate memory for the payload and
decide what the wrap the buffer with (XDP or SKB) later.

> What does not really make sense to me is that the page has to be from page
> pool when a skb's frag page can be recycled, right? If it is ture, the switch
> case in __xdp_return() does not really make sense for skb recycling, why go
> all the trouble of checking the mem->type and mem->id to find the page_pool
> pointer when recyclable page for skb can only be from page pool?

In any case you need to find in which pool the buffer you try to recycle
belongs.  In order to make the whole idea generic and be able to recycle skb 
fragments instead of just the skb head you need to store some information on 
struct page.  That's the fundamental difference of this patchset compared to 
the RFC we sent a few years back [1] which was just storing information on the 
skb.  The way this is done on the current patchset is that we store the 
struct xdp_mem_info in page->private and then look it up on xdp_return().

Now that being said Matthew recently reworked struct page, so we could see if
we can store the page pool pointer directly instead of the struct
xdp_mem_info. That would allow us to call into page pool functions directly.
But we'll have to agree if that makes sense to go into struct page to begin
with and make sure the pointer is still valid when we take the recycling path.

> > payload and we don't really care what's in that.  We could rename the functions
> > to something more generic in the future though ?
> > 
> >>
> >> 2. it would be good to do the page reference count updating batching
> >>    in page pool instead of specific driver.
> >>
> >>
> >> page_pool_atomic_sub_if_positive() is added to decide who can call
> >> page_pool_put_full_page(), because the driver and stack may hold
> >> reference to the same page, only if last one which hold complete
> >> reference to a page can call page_pool_put_full_page() to decide if
> >> recycling is possible, if not, the page is released, so I am wondering
> >> if a similar page_pool_atomic_sub_if_positive() can added to specific
> >> user space address unmapping path to allow skb recycling for RX zerocopy
> >> too?
> >>
> > 
> > I would prefer a different page pool type if we wanted to support the split
> > page model.  The changes as is are quite intrusive, since they change the 
> > entire skb return path.  So I would prefer introducing the changes one at a 
> > time. 
> 
> I understand there may be fundamental semantic change when split page model
> is supported by page pool, but the split page support change mainly affect the
> skb recycling path and the driver that uses page pool(XDP too) if we are careful
> enough, not the entire skb return path as my understanding.

It affects those drivers only, but in order to do so is intercepts the
packet in skb_free_head(), which pretty much affects the entire network path.

> 
> Anyway, one changes at a time is always prefered if supporting split page is
> proved to be non-trivel and intrusive.
> 
> > 
> > The fundamental difference between having the recycling in the driver vs
> > having it in a generic API is pretty straightforward.  When a driver holds
> > the extra page references he is free to decide what to reuse, when he is about
> > to refill his Rx descriptors.  So TCP zerocopy might work even if the
> > userspace applications hold the buffers for an X amount of time.
> > On this proposal though we *need* to decide what to do with the buffer when we
> > are about to free the skb.
> 
> I am not sure I understand what you meant by "free the skb", does it mean
> that kfree_skb() is called to free the skb.

Yes

> 
> As my understanding, if the skb completely own the page(which means page_count()
> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise
> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()
> try to handle it atomically.
> 

Not really, the opposite is happening here. If the pp_recycle bit is set we
will always call page_pool_return_skb_page().  If the page signature matches
the 'magic' set by page pool we will always call xdp_return_skb_frame() will
end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
to recycle the page.  If it's not we'll release it from page_pool (releasing
some internal references we keep) unmap the buffer and decrement the refcnt.

[1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/

Cheers
/Ilias

Ilias Apalodimas April 30, 2021, 5:32 p.m. UTC | #22

(-cc invalid emails)
Replying to my self here but....

[...]
> > >
> > > We can't do that. The reason we need those structs is that we rely on the
> > > existing XDP code, which already recycles it's buffers, to enable
> > > recycling.  Since we allocate a page per packet when using page_pool for a
> > > driver , the same ideas apply to an SKB and XDP frame. We just recycle the
> >
> > I am not really familar with XDP here, but a packet from hw is either a
> > "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
> > a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
> > the same time, right?
> >
>
> Yes, but the payload is irrelevant in both cases and that's what we use
> page_pool for.  You can't use this patchset unless your driver usues
> build_skb().  So in both cases you just allocate memory for the payload and
> decide what the wrap the buffer with (XDP or SKB) later.
>
> > What does not really make sense to me is that the page has to be from page
> > pool when a skb's frag page can be recycled, right? If it is ture, the switch
> > case in __xdp_return() does not really make sense for skb recycling, why go
> > all the trouble of checking the mem->type and mem->id to find the page_pool
> > pointer when recyclable page for skb can only be from page pool?
>
> In any case you need to find in which pool the buffer you try to recycle
> belongs.  In order to make the whole idea generic and be able to recycle skb
> fragments instead of just the skb head you need to store some information on
> struct page.  That's the fundamental difference of this patchset compared to
> the RFC we sent a few years back [1] which was just storing information on the
> skb.  The way this is done on the current patchset is that we store the
> struct xdp_mem_info in page->private and then look it up on xdp_return().
>
> Now that being said Matthew recently reworked struct page, so we could see if
> we can store the page pool pointer directly instead of the struct
> xdp_mem_info. That would allow us to call into page pool functions directly.
> But we'll have to agree if that makes sense to go into struct page to begin
> with and make sure the pointer is still valid when we take the recycling path.
>

Thinking more about it the reason that prevented us from storing a
page pool pointer directly is not there anymore. Jesper fixed that
already a while back. So we might as well store the page_pool ptr in
page->private and call into the functions directly.  I'll have a look
before v4.

[...]

Thanks
/Ilias

Jesper Dangaard Brouer May 3, 2021, 7:29 a.m. UTC | #23

On Fri, 30 Apr 2021 20:32:07 +0300
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> (-cc invalid emails)
> Replying to my self here but....
> 
> [...]
> > > >
> > > > We can't do that. The reason we need those structs is that we rely on the
> > > > existing XDP code, which already recycles it's buffers, to enable
> > > > recycling.  Since we allocate a page per packet when using page_pool for a
> > > > driver , the same ideas apply to an SKB and XDP frame. We just recycle the  
> > >
> > > I am not really familar with XDP here, but a packet from hw is either a
> > > "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
> > > a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
> > > the same time, right?
> > >  
> >
> > Yes, but the payload is irrelevant in both cases and that's what we use
> > page_pool for.  You can't use this patchset unless your driver usues
> > build_skb().  So in both cases you just allocate memory for the payload and
> > decide what the wrap the buffer with (XDP or SKB) later.
> >  
> > > What does not really make sense to me is that the page has to be from page
> > > pool when a skb's frag page can be recycled, right? If it is ture, the switch
> > > case in __xdp_return() does not really make sense for skb recycling, why go
> > > all the trouble of checking the mem->type and mem->id to find the page_pool
> > > pointer when recyclable page for skb can only be from page pool?  
> >
> > In any case you need to find in which pool the buffer you try to recycle
> > belongs.  In order to make the whole idea generic and be able to recycle skb
> > fragments instead of just the skb head you need to store some information on
> > struct page.  That's the fundamental difference of this patchset compared to
> > the RFC we sent a few years back [1] which was just storing information on the
> > skb.  The way this is done on the current patchset is that we store the
> > struct xdp_mem_info in page->private and then look it up on xdp_return().
> >
> > Now that being said Matthew recently reworked struct page, so we could see if
> > we can store the page pool pointer directly instead of the struct
> > xdp_mem_info. That would allow us to call into page pool functions directly.
> > But we'll have to agree if that makes sense to go into struct page to begin
> > with and make sure the pointer is still valid when we take the recycling path.
> >  
> 
> Thinking more about it the reason that prevented us from storing a
> page pool pointer directly is not there anymore. Jesper fixed that
> already a while back. So we might as well store the page_pool ptr in
> page->private and call into the functions directly.  I'll have a look
> before v4.

I want to give credit to Jonathan Lemon whom came up with the idea of
storing the page_pool object that "owns" the page directly in struct
page.  I see this as an optimization that we can add later, so it
doesn't block this patchset.  As Ilias mention, it required some
work/changes[1]+[2] to guarantee that the page_pool object life-time
were longer than all the outstanding in-flight page-objects, but that
have been stable for some/many kernel releases now.  This is already
need/used for making sure the DMA-mappings can be safely released[1],
but I on-purpose enabled the same in-flight tracking for page_pool
users that doesn't use the DMA-mapping feature (making sure the code is
exercised).


[1] 99c07c43c4ea ("xdp: tracking page_pool resources and safe removal")
[2] c3f812cea0d7 ("page_pool: do not release pool until inflight == 0.")

Yunsheng Lin May 6, 2021, 12:34 p.m. UTC | #24

On 2021/5/1 0:24, Ilias Apalodimas wrote:
> [...]
>>>>
>>>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
>>>>    "struct xdp_mem_info" to bond the relation between "struct page" and
>>>>    "struct page_pool", which seems uncessary at this point if bonding
>>>>    a "struct page_pool" pointer directly in "struct page" does not cause
>>>>    space increasing.
>>>
>>> We can't do that. The reason we need those structs is that we rely on the
>>> existing XDP code, which already recycles it's buffers, to enable
>>> recycling.  Since we allocate a page per packet when using page_pool for a
>>> driver , the same ideas apply to an SKB and XDP frame. We just recycle the
>>
>> I am not really familar with XDP here, but a packet from hw is either a
>> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
>> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
>> the same time, right?
>>
> 
> Yes, but the payload is irrelevant in both cases and that's what we use
> page_pool for.  You can't use this patchset unless your driver usues
> build_skb().  So in both cases you just allocate memory for the payload and

I am not sure I understood why build_skb() matters here. If the head data of
a skb is a page frag and is from page pool, then it's page->signature should be
PP_SIGNATURE, otherwise it's page->signature is zero, so a recyclable skb does
not require it's head data being from a page pool, right?

> decide what the wrap the buffer with (XDP or SKB) later.

[...]

>>
>> I am not sure I understand what you meant by "free the skb", does it mean
>> that kfree_skb() is called to free the skb.
> 
> Yes
> 
>>
>> As my understanding, if the skb completely own the page(which means page_count()
>> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise
>> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()
>> try to handle it atomically.
>>
> 
> Not really, the opposite is happening here. If the pp_recycle bit is set we
> will always call page_pool_return_skb_page().  If the page signature matches
> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
> to recycle the page.  If it's not we'll release it from page_pool (releasing
> some internal references we keep) unmap the buffer and decrement the refcnt.

Yes, I understood the above is what the page pool do now.

But the question is who is still holding an extral reference to the page when
kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
reference to the same page? So why not just do a page_ref_dec() if the orginal skb
is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
So that we can always reuse the recyclable page from a recyclable skb. This may
make the page_pool_destroy() process delays longer than before, I am supposed the
page_pool_destroy() delaying for cloned skb case does not really matters here.

If the above works, I think the samiliar handling can be added to RX zerocopy if
the RX zerocopy also hold extral references to the recyclable page from a recyclable
skb too?

> 
> [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/
> 
> Cheers
> /Ilias
> 
> .
>

Ilias Apalodimas May 6, 2021, 12:58 p.m. UTC | #25

On Thu, May 06, 2021 at 08:34:48PM +0800, Yunsheng Lin wrote:
> On 2021/5/1 0:24, Ilias Apalodimas wrote:
> > [...]
> >>>>
> >>>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or
> >>>>    "struct xdp_mem_info" to bond the relation between "struct page" and
> >>>>    "struct page_pool", which seems uncessary at this point if bonding
> >>>>    a "struct page_pool" pointer directly in "struct page" does not cause
> >>>>    space increasing.
> >>>
> >>> We can't do that. The reason we need those structs is that we rely on the
> >>> existing XDP code, which already recycles it's buffers, to enable
> >>> recycling.  Since we allocate a page per packet when using page_pool for a
> >>> driver , the same ideas apply to an SKB and XDP frame. We just recycle the
> >>
> >> I am not really familar with XDP here, but a packet from hw is either a
> >> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
> >> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
> >> the same time, right?
> >>
> > 
> > Yes, but the payload is irrelevant in both cases and that's what we use
> > page_pool for.  You can't use this patchset unless your driver usues
> > build_skb().  So in both cases you just allocate memory for the payload and
> 
> I am not sure I understood why build_skb() matters here. If the head data of
> a skb is a page frag and is from page pool, then it's page->signature should be
> PP_SIGNATURE, otherwise it's page->signature is zero, so a recyclable skb does
> not require it's head data being from a page pool, right?
> 

Correct, and that's the big improvement compared to the original RFC.
The wording was a bit off in my initial response.  I was trying to point out
you can recycle *any* buffer coming from page_pool and one of the ways you can
do that in your driver, is use build_skb() while the payload is allocated by
page_pool.

> > decide what the wrap the buffer with (XDP or SKB) later.
> 
> [...]
> 
> >>
> >> I am not sure I understand what you meant by "free the skb", does it mean
> >> that kfree_skb() is called to free the skb.
> > 
> > Yes
> > 
> >>
> >> As my understanding, if the skb completely own the page(which means page_count()
> >> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise
> >> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()
> >> try to handle it atomically.
> >>
> > 
> > Not really, the opposite is happening here. If the pp_recycle bit is set we
> > will always call page_pool_return_skb_page().  If the page signature matches
> > the 'magic' set by page pool we will always call xdp_return_skb_frame() will
> > end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
> > to recycle the page.  If it's not we'll release it from page_pool (releasing
> > some internal references we keep) unmap the buffer and decrement the refcnt.
> 
> Yes, I understood the above is what the page pool do now.
> 
> But the question is who is still holding an extral reference to the page when
> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
> So that we can always reuse the recyclable page from a recyclable skb. This may
> make the page_pool_destroy() process delays longer than before, I am supposed the
> page_pool_destroy() delaying for cloned skb case does not really matters here.
> 
> If the above works, I think the samiliar handling can be added to RX zerocopy if
> the RX zerocopy also hold extral references to the recyclable page from a recyclable
> skb too?
> 

Right, this sounds doable, but I'll have to go back code it and see if it
really makes sense.  However I'd still prefer the support to go in as-is
(including the struct xdp_mem_info in struct page, instead of a page_pool
pointer).

There's a couple of reasons for that.  If we keep the struct xdp_mem_info we
can in the future recycle different kind of buffers using __xdp_return().
And this is a non intrusive change if we choose to store the page pool address
directly in the future.  It just affects the internal contract between the
page_pool code and struct page.  So it won't affect any drivers that already
use the feature.
Regarding the page_ref_dec(), which as I said sounds doable, I'd prefer
playing it safe for now and getting rid of the buffers that somehow ended up
holding an extra reference.  Once this gets approved we can go back and try to
save the extra space.  I hope I am not wrong but the changes required to
support a few extra refcounts should not change the current patches much.

Thanks for taking the time on this!
/Ilias

> > 
> > [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/
> > 
> > Cheers
> > /Ilias
> > 
> > .
> > 
>

Yunsheng Lin May 7, 2021, 3:23 a.m. UTC | #26

On 2021/5/6 20:58, Ilias Apalodimas wrote:
>>>>
>>>
>>> Not really, the opposite is happening here. If the pp_recycle bit is set we
>>> will always call page_pool_return_skb_page().  If the page signature matches
>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
>>> to recycle the page.  If it's not we'll release it from page_pool (releasing
>>> some internal references we keep) unmap the buffer and decrement the refcnt.
>>
>> Yes, I understood the above is what the page pool do now.
>>
>> But the question is who is still holding an extral reference to the page when
>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
>> So that we can always reuse the recyclable page from a recyclable skb. This may
>> make the page_pool_destroy() process delays longer than before, I am supposed the
>> page_pool_destroy() delaying for cloned skb case does not really matters here.
>>
>> If the above works, I think the samiliar handling can be added to RX zerocopy if
>> the RX zerocopy also hold extral references to the recyclable page from a recyclable
>> skb too?
>>
> 
> Right, this sounds doable, but I'll have to go back code it and see if it
> really makes sense.  However I'd still prefer the support to go in as-is
> (including the struct xdp_mem_info in struct page, instead of a page_pool
> pointer).
> 
> There's a couple of reasons for that.  If we keep the struct xdp_mem_info we
> can in the future recycle different kind of buffers using __xdp_return().
> And this is a non intrusive change if we choose to store the page pool address
> directly in the future.  It just affects the internal contract between the
> page_pool code and struct page.  So it won't affect any drivers that already
> use the feature.

This patchset has embeded a signature field in "struct page", and xdp_mem_info
is stored in page_private(), which seems not considering the case for associating
the page pool with "struct page" directly yet? Is the page pool also stored in
page_private() and a different signature is used to indicate that?

I am not saying we have to do it in this patchset, but we have to consider it
while we are adding new signature field to "struct page", right?

> Regarding the page_ref_dec(), which as I said sounds doable, I'd prefer
> playing it safe for now and getting rid of the buffers that somehow ended up
> holding an extra reference.  Once this gets approved we can go back and try to
> save the extra space.  I hope I am not wrong but the changes required to
> support a few extra refcounts should not change the current patches much.
> 
> Thanks for taking the time on this!

Thanks all invovled in the effort improving page pool too:)

> /Ilias
> 
>>>
>>> [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/
>>>
>>> Cheers
>>> /Ilias
>>>
>>> .
>>>
>>
> 
> .
>

Ilias Apalodimas May 7, 2021, 7:06 a.m. UTC | #27

On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:
> On 2021/5/6 20:58, Ilias Apalodimas wrote:
> >>>>
> >>>
> >>> Not really, the opposite is happening here. If the pp_recycle bit is set we
> >>> will always call page_pool_return_skb_page().  If the page signature matches
> >>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
> >>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
> >>> to recycle the page.  If it's not we'll release it from page_pool (releasing
> >>> some internal references we keep) unmap the buffer and decrement the refcnt.
> >>
> >> Yes, I understood the above is what the page pool do now.
> >>
> >> But the question is who is still holding an extral reference to the page when
> >> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
> >> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
> >> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
> >> So that we can always reuse the recyclable page from a recyclable skb. This may
> >> make the page_pool_destroy() process delays longer than before, I am supposed the
> >> page_pool_destroy() delaying for cloned skb case does not really matters here.
> >>
> >> If the above works, I think the samiliar handling can be added to RX zerocopy if
> >> the RX zerocopy also hold extral references to the recyclable page from a recyclable
> >> skb too?
> >>
> > 
> > Right, this sounds doable, but I'll have to go back code it and see if it
> > really makes sense.  However I'd still prefer the support to go in as-is
> > (including the struct xdp_mem_info in struct page, instead of a page_pool
> > pointer).
> > 
> > There's a couple of reasons for that.  If we keep the struct xdp_mem_info we
> > can in the future recycle different kind of buffers using __xdp_return().
> > And this is a non intrusive change if we choose to store the page pool address
> > directly in the future.  It just affects the internal contract between the
> > page_pool code and struct page.  So it won't affect any drivers that already
> > use the feature.
> 
> This patchset has embeded a signature field in "struct page", and xdp_mem_info
> is stored in page_private(), which seems not considering the case for associating
> the page pool with "struct page" directly yet? 

Correct

> Is the page pool also stored in
> page_private() and a different signature is used to indicate that?

No only struct xdp_mem_info as you mentioned before

> 
> I am not saying we have to do it in this patchset, but we have to consider it
> while we are adding new signature field to "struct page", right?

We won't need a new signature.  The signature in both cases is there to 
guarantee the page you are trying to recycle was indeed allocated by page_pool.

Basically we got two design choices here: 
- We store the page_pool ptr address directly in page->private and then,
  we call into page_pool APIs directly to do the recycling.
  That would eliminate the lookup through xdp_mem_info and the
  XDP helpers to locate page pool pointer (through __xdp_return).
- You store the xdp_mem_info on page_private.  In that case you need to go
  through __xdp_return()  to locate the page_pool pointer. Although we might
  loose some performance that would allow us to recycle additional memory types
  and not only MEM_TYPE_PAGE_POOL (in case we ever need it).


I think both choices are sane.  What I am trying to explain here, is
regardless of what we choose now, we can change it in the future without
affecting the API consumers at all.  What will change internally is the way we
lookup the page pool pointer we are trying to recycle.

[...]


Cheers
/Ilias

Yunsheng Lin May 7, 2021, 8:28 a.m. UTC | #28

On 2021/5/7 15:06, Ilias Apalodimas wrote:
> On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:
>> On 2021/5/6 20:58, Ilias Apalodimas wrote:
>>>>>>
>>>>>
>>>>> Not really, the opposite is happening here. If the pp_recycle bit is set we
>>>>> will always call page_pool_return_skb_page().  If the page signature matches
>>>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
>>>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
>>>>> to recycle the page.  If it's not we'll release it from page_pool (releasing
>>>>> some internal references we keep) unmap the buffer and decrement the refcnt.
>>>>
>>>> Yes, I understood the above is what the page pool do now.
>>>>
>>>> But the question is who is still holding an extral reference to the page when
>>>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
>>>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
>>>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
>>>> So that we can always reuse the recyclable page from a recyclable skb. This may
>>>> make the page_pool_destroy() process delays longer than before, I am supposed the
>>>> page_pool_destroy() delaying for cloned skb case does not really matters here.
>>>>
>>>> If the above works, I think the samiliar handling can be added to RX zerocopy if
>>>> the RX zerocopy also hold extral references to the recyclable page from a recyclable
>>>> skb too?
>>>>
>>>
>>> Right, this sounds doable, but I'll have to go back code it and see if it
>>> really makes sense.  However I'd still prefer the support to go in as-is
>>> (including the struct xdp_mem_info in struct page, instead of a page_pool
>>> pointer).
>>>
>>> There's a couple of reasons for that.  If we keep the struct xdp_mem_info we
>>> can in the future recycle different kind of buffers using __xdp_return().
>>> And this is a non intrusive change if we choose to store the page pool address
>>> directly in the future.  It just affects the internal contract between the
>>> page_pool code and struct page.  So it won't affect any drivers that already
>>> use the feature.
>>
>> This patchset has embeded a signature field in "struct page", and xdp_mem_info
>> is stored in page_private(), which seems not considering the case for associating
>> the page pool with "struct page" directly yet? 
> 
> Correct
> 
>> Is the page pool also stored in
>> page_private() and a different signature is used to indicate that?
> 
> No only struct xdp_mem_info as you mentioned before
> 
>>
>> I am not saying we have to do it in this patchset, but we have to consider it
>> while we are adding new signature field to "struct page", right?
> 
> We won't need a new signature.  The signature in both cases is there to 
> guarantee the page you are trying to recycle was indeed allocated by page_pool.
> 
> Basically we got two design choices here: 
> - We store the page_pool ptr address directly in page->private and then,
>   we call into page_pool APIs directly to do the recycling.
>   That would eliminate the lookup through xdp_mem_info and the
>   XDP helpers to locate page pool pointer (through __xdp_return).
> - You store the xdp_mem_info on page_private.  In that case you need to go
>   through __xdp_return()  to locate the page_pool pointer. Although we might
>   loose some performance that would allow us to recycle additional memory types
>   and not only MEM_TYPE_PAGE_POOL (in case we ever need it).

So the signature field  in "struct page" is used to only indicate a page is
from a page pool, then how do we tell the content of page_private() if both of
the above choices are needed, we might still need an extra indicator to tell
page_private() is page_pool ptr or xdp_mem_info.

It seems storing the page pool ptr in page_private() is clear for recyclable
page from a recyclable skb use case, and the use case for storing xdp_mem_info
in page_private() is unclear yet? As XDP seems to have the xdp_mem_info in the
"struct xdp_frame", so it does not need the xdp_mem_info from page_private().

If the above is true, what does not really makes sense to me here is that:
why do we first implement a unclear use case for storing xdp_mem_info in
page_private(), why not implement the clear use case for storing page pool ptr
in page_private() first?

> 
> 
> I think both choices are sane.  What I am trying to explain here, is
> regardless of what we choose now, we can change it in the future without
> affecting the API consumers at all.  What will change internally is the way we
> lookup the page pool pointer we are trying to recycle.

It seems the below API need changing?
+static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
+					struct xdp_mem_info *mem)


> 
> [...]
> 
> 
> Cheers
> /Ilias
> 
> .
>

Jesper Dangaard Brouer May 7, 2021, 10:19 a.m. UTC | #29

On Fri, 7 May 2021 16:28:30 +0800
Yunsheng Lin <linyunsheng@huawei.com> wrote:

> On 2021/5/7 15:06, Ilias Apalodimas wrote:
> > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:  
> >> On 2021/5/6 20:58, Ilias Apalodimas wrote:  
> >>>>>>  
> >>>>>
> >>>>> Not really, the opposite is happening here. If the pp_recycle bit is set we
> >>>>> will always call page_pool_return_skb_page().  If the page signature matches
> >>>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will
> >>>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
> >>>>> to recycle the page.  If it's not we'll release it from page_pool (releasing
> >>>>> some internal references we keep) unmap the buffer and decrement the refcnt.  
> >>>>
> >>>> Yes, I understood the above is what the page pool do now.
> >>>>
> >>>> But the question is who is still holding an extral reference to the page when
> >>>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
> >>>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb
> >>>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
> >>>> So that we can always reuse the recyclable page from a recyclable skb. This may
> >>>> make the page_pool_destroy() process delays longer than before, I am supposed the
> >>>> page_pool_destroy() delaying for cloned skb case does not really matters here.
> >>>>
> >>>> If the above works, I think the samiliar handling can be added to RX zerocopy if
> >>>> the RX zerocopy also hold extral references to the recyclable page from a recyclable
> >>>> skb too?
> >>>>  
> >>>
> >>> Right, this sounds doable, but I'll have to go back code it and see if it
> >>> really makes sense.  However I'd still prefer the support to go in as-is
> >>> (including the struct xdp_mem_info in struct page, instead of a page_pool
> >>> pointer).
> >>>
> >>> There's a couple of reasons for that.  If we keep the struct xdp_mem_info we
> >>> can in the future recycle different kind of buffers using __xdp_return().
> >>> And this is a non intrusive change if we choose to store the page pool address
> >>> directly in the future.  It just affects the internal contract between the
> >>> page_pool code and struct page.  So it won't affect any drivers that already
> >>> use the feature.  
> >>
> >> This patchset has embeded a signature field in "struct page", and xdp_mem_info
> >> is stored in page_private(), which seems not considering the case for associating
> >> the page pool with "struct page" directly yet?   
> > 
> > Correct
> >   
> >> Is the page pool also stored in
> >> page_private() and a different signature is used to indicate that?  
> > 
> > No only struct xdp_mem_info as you mentioned before
> >   
> >>
> >> I am not saying we have to do it in this patchset, but we have to consider it
> >> while we are adding new signature field to "struct page", right?  
> > 
> > We won't need a new signature.  The signature in both cases is there to 
> > guarantee the page you are trying to recycle was indeed allocated by page_pool.
> > 
> > Basically we got two design choices here: 
> > - We store the page_pool ptr address directly in page->private and then,
> >   we call into page_pool APIs directly to do the recycling.
> >   That would eliminate the lookup through xdp_mem_info and the
> >   XDP helpers to locate page pool pointer (through __xdp_return).
> > - You store the xdp_mem_info on page_private.  In that case you need to go
> >   through __xdp_return()  to locate the page_pool pointer. Although we might
> >   loose some performance that would allow us to recycle additional memory types
> >   and not only MEM_TYPE_PAGE_POOL (in case we ever need it).  
>
> So the signature field  in "struct page" is used to only indicate a page is
> from a page pool, then how do we tell the content of page_private() if both of
> the above choices are needed, we might still need an extra indicator to tell
> page_private() is page_pool ptr or xdp_mem_info.

The signature field in "struct page" and "xdp_mem_info" is a double
construct that was introduced in this patchset.  AFAIK Matteo took the
idea from Jonathan's patchset.  I'm not convinced we need both, maybe
later we do (when someone introduce a new mem model ala NetGPU).

I think Jonathan's use-case was NetGPU[1] (which got shutdown due to
Nvidia[2] being involved which I think was unfair).  The general idea
behind NetGPU makes sense to me, to allow packet headers to live in
first page, and second page belongs to hardware.  This implies that an
SKB can can point to two different pages with different memory types,
which need to be handled correctly when freeing the SKB and the pages it
points to.  Thus, placing (xdp_)mem_info in SKB is wrong as it implies
all pages belong the same mem_info.type.

The point is, when designing this I want us to think about how our
design can handle other memory models than just page_pool.

In this patchset design, we use a single bit in SKB to indicate that
the pages pointed comes from another memory model, in this case
page_pool is the only user of this bit.  The remaining info about the
memory model (page_pool) is stored in struct-page, which we look at
when freeing the pages that the SKB points to (that is at the layer
above the MM-calls that would free the page for real).

[1] https://linuxplumbersconf.org/event/7/contributions/670/
[2] https://lwn.net/Articles/827596/

> It seems storing the page pool ptr in page_private() is clear for recyclable
> page from a recyclable skb use case, and the use case for storing xdp_mem_info
> in page_private() is unclear yet? As XDP seems to have the xdp_mem_info in the
> "struct xdp_frame", so it does not need the xdp_mem_info from page_private().
> 
> If the above is true, what does not really makes sense to me here is that:
> why do we first implement a unclear use case for storing xdp_mem_info in
> page_private(), why not implement the clear use case for storing page pool ptr
> in page_private() first?

I'm actually not against storing page_pool object ptr directly in
struct-page.  It is a nice optimization.  Perhaps we should implement
this optimization outside this patchset first, and let __xdp_return()
for XDP-redirected packets also take advantage to this optimization?

Then it would feel more natural to also used this optimization in this
patchset, right?

> > 
> > 
> > I think both choices are sane.  What I am trying to explain here, is
> > regardless of what we choose now, we can change it in the future without
> > affecting the API consumers at all.  What will change internally is the way we
> > lookup the page pool pointer we are trying to recycle.  
> 
> It seems the below API need changing?
> +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
> +					struct xdp_mem_info *mem)

I don't think we need to change this API, to support future memory
models.  Notice that xdp_mem_info have a 'type' member.

Naming in Computer Science is a hard problem ;-). Something that seems
to confuse a lot of people is the naming of the struct "xdp_mem_info".  
Maybe we should have named it "mem_info" instead or "net_mem_info", as
it doesn't indicate that the device is running XDP.

I see XDP as the RX-layer before the network stack, that helps drivers
to support different memory models, also for handling normal packets
that doesn't get process by XDP, and the drivers doesn't even need to
support XDP to use the "xdp_mem_info" type.

Christoph Hellwig May 7, 2021, 11:31 a.m. UTC | #30

On Fri, May 07, 2021 at 12:19:53PM +0200, Jesper Dangaard Brouer wrote:
> Nvidia[2] being involved which I think was unfair).  The general idea
> behind NetGPU makes sense to me,

Sorry, but that is utter bullshit.  It was rejected because it did
nothing but injecting hooks for an out of tree module while ignoring the
existing kernel infrastructure for much of what it tries to archive.

Shay Agroskin May 9, 2021, 5:11 a.m. UTC | #31

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Fri, 7 May 2021 16:28:30 +0800
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
>> On 2021/5/7 15:06, Ilias Apalodimas wrote:
>> > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:  
>> >> On 2021/5/6 20:58, Ilias Apalodimas wrote:  
>> >>>>>>  
>> >>>>>
> ...
>> > 
>> > 
>> > I think both choices are sane.  What I am trying to explain 
>> > here, is
>> > regardless of what we choose now, we can change it in the 
>> > future without
>> > affecting the API consumers at all.  What will change 
>> > internally is the way we
>> > lookup the page pool pointer we are trying to recycle.  
>> 
>> It seems the below API need changing?
>> +static inline void skb_mark_for_recycle(struct sk_buff *skb, 
>> struct page *page,
>> +					struct xdp_mem_info *mem)
>
> I don't think we need to change this API, to support future 
> memory
> models.  Notice that xdp_mem_info have a 'type' member.

Hi,
Providing that we will (possibly as a future optimization) store 
the pointer to the page pool in struct page instead of strcut 
xdp_mem_info, passing
xdp_mem_info * instead of struct page_pool * would mean that for 
every packet we'll need to call
             xa = rhashtable_lookup(mem_id_ht, &mem->id, 
             mem_id_rht_params);
             xa->page_pool;

which might pressure the Dcache to fetch a pointer that might be 
present already in cache as part of driver's data-structures.

I tend to agree with Yunsheng that it makes more sense to adjust 
the API for the clear use-case now rather than using xdp_mem_info 
indirection. It seems to me like
the page signature provides the same information anyway and allows 
to support different memory types.

Shay

>
> Naming in Computer Science is a hard problem ;-). Something that 
> seems
> to confuse a lot of people is the naming of the struct 
> "xdp_mem_info".  
> Maybe we should have named it "mem_info" instead or 
> "net_mem_info", as
> it doesn't indicate that the device is running XDP.
>
> I see XDP as the RX-layer before the network stack, that helps 
> drivers
> to support different memory models, also for handling normal 
> packets
> that doesn't get process by XDP, and the drivers doesn't even 
> need to
> support XDP to use the "xdp_mem_info" type.

Yunsheng Lin May 10, 2021, 2:20 a.m. UTC | #32

On 2021/5/7 18:19, Jesper Dangaard Brouer wrote:
> On Fri, 7 May 2021 16:28:30 +0800
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 

[...]

> 
> I'm actually not against storing page_pool object ptr directly in
> struct-page.  It is a nice optimization.  Perhaps we should implement
> this optimization outside this patchset first, and let __xdp_return()
> for XDP-redirected packets also take advantage to this optimization?
> 
> Then it would feel more natural to also used this optimization in this
> patchset, right?

Yes, it would be good if XDP can take advantage of this optimization too.

Then it seems we can remove the "mem_id_ht"? if we want to support different
type of page pool in the future, the type field is in the page pool too
when page_pool object ptr directly in struct-page.

>

Ilias Apalodimas May 11, 2021, 8:41 a.m. UTC | #33

Hi Shay,

On Sun, May 09, 2021 at 08:11:35AM +0300, Shay Agroskin wrote:
> 
> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> 
> > On Fri, 7 May 2021 16:28:30 +0800
> > Yunsheng Lin <linyunsheng@huawei.com> wrote:
> > 
> > > On 2021/5/7 15:06, Ilias Apalodimas wrote:
> > > > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:  >>
> > > On 2021/5/6 20:58, Ilias Apalodimas wrote:  >>>>>>  >>>>>
> > ...
> > > > > > I think both choices are sane.  What I am trying to explain >
> > > here, is
> > > > regardless of what we choose now, we can change it in the > future
> > > without
> > > > affecting the API consumers at all.  What will change > internally
> > > is the way we
> > > > lookup the page pool pointer we are trying to recycle.
> > > 
> > > It seems the below API need changing?
> > > +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct
> > > page *page,
> > > +					struct xdp_mem_info *mem)
> > 
> > I don't think we need to change this API, to support future memory
> > models.  Notice that xdp_mem_info have a 'type' member.
> 
> Hi,
> Providing that we will (possibly as a future optimization) store the pointer
> to the page pool in struct page instead of strcut xdp_mem_info, passing
> xdp_mem_info * instead of struct page_pool * would mean that for every
> packet we'll need to call
>             xa = rhashtable_lookup(mem_id_ht, &mem->id,
> mem_id_rht_params);
>             xa->page_pool;
> 
> which might pressure the Dcache to fetch a pointer that might be present
> already in cache as part of driver's data-structures.
> 
> I tend to agree with Yunsheng that it makes more sense to adjust the API for
> the clear use-case now rather than using xdp_mem_info indirection. It seems
> to me like
> the page signature provides the same information anyway and allows to
> support different memory types.

We've switched the patches already.  We didn't notice any performance boost
by doing so (tested on a machiattobin), but I agree as well.  As I
explained the only thing that will change if we ever the need the struct
xdp_mem_info in struct page is the internal contract between struct page
and the recycling function, so let's start clean and see if we ever need
that.


Cheers
/Ilias
> 
> Shay
> 
> > 
> > Naming in Computer Science is a hard problem ;-). Something that seems
> > to confuse a lot of people is the naming of the struct "xdp_mem_info".
> > Maybe we should have named it "mem_info" instead or "net_mem_info", as
> > it doesn't indicate that the device is running XDP.
> > 
> > I see XDP as the RX-layer before the network stack, that helps drivers
> > to support different memory models, also for handling normal packets
> > that doesn't get process by XDP, and the drivers doesn't even need to
> > support XDP to use the "xdp_mem_info" type.
>

[net-next,v3,0/5] page_pool: recycle buffers

Message

Comments