[net-next,v3,2/5] mm: add a signature in struct page

Message ID	20210409223801.104657-3-mcroce@linux.microsoft.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <bpf-owner@kernel.org> From: Matteo Croce <mcroce@linux.microsoft.com> To: netdev@vger.kernel.org, linux-mm@kvack.org Cc: Ayush Sawal <ayush.sawal@chelsio.com>, Vinay Kumar Yadav <vinay.yadav@chelsio.com>, Rohit Maheshwari <rohitm@chelsio.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Thomas Petazzoni <thomas.petazzoni@bootlin.com>, Marcin Wojtas <mw@semihalf.com>, Russell King <linux@armlinux.org.uk>, Mirko Lindner <mlindner@marvell.com>, Stephen Hemminger <stephen@networkplumber.org>, Tariq Toukan <tariqt@nvidia.com>, Jesper Dangaard Brouer <hawk@kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.fastabend@gmail.com>, Boris Pismenny <borisp@nvidia.com>, Arnd Bergmann <arnd@arndb.de>, Andrew Morton <akpm@linux-foundation.org>, "Peter Zijlstra (Intel)" <peterz@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Yu Zhao <yuzhao@google.com>, Will Deacon <will@kernel.org>, Michel Lespinasse <walken@google.com>, Fenghua Yu <fenghua.yu@intel.com>, Roman Gushchin <guro@fb.com>, Hugh Dickins <hughd@google.com>, Peter Xu <peterx@redhat.com>, Jason Gunthorpe <jgg@ziepe.ca>, Guoqing Jiang <guoqing.jiang@cloud.ionos.com>, Jonathan Lemon <jonathan.lemon@gmail.com>, Alexander Lobakin <alobakin@pm.me>, Cong Wang <cong.wang@bytedance.com>, wenxu <wenxu@ucloud.cn>, Kevin Hao <haokexin@gmail.com>, Aleksandr Nogikh <nogikh@google.com>, Jakub Sitnicki <jakub@cloudflare.com>, Marco Elver <elver@google.com>, Willem de Bruijn <willemb@google.com>, Miaohe Lin <linmiaohe@huawei.com>, Yunsheng Lin <linyunsheng@huawei.com>, Guillaume Nault <gnault@redhat.com>, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, bpf@vger.kernel.org, Matthew Wilcox <willy@infradead.org>, Eric Dumazet <edumazet@google.com>, David Ahern <dsahern@gmail.com>, Lorenzo Bianconi <lorenzo@kernel.org>, Saeed Mahameed <saeedm@nvidia.com>, Andrew Lunn <andrew@lunn.ch>, Paolo Abeni <pabeni@redhat.com> Subject: [PATCH net-next v3 2/5] mm: add a signature in struct page Date: Sat, 10 Apr 2021 00:37:58 +0200 Message-Id: <20210409223801.104657-3-mcroce@linux.microsoft.com> In-Reply-To: <20210409223801.104657-1-mcroce@linux.microsoft.com> References: <20210409223801.104657-1-mcroce@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	page_pool: recycle buffers \| expand [net-next,v3,0/5] page_pool: recycle buffers [net-next,v3,1/5] xdp: reduce size of struct xdp_mem_info [net-next,v3,2/5] mm: add a signature in struct page [net-next,v3,3/5] page_pool: Allow drivers to hint on SKB recycling [net-next,v3,4/5] mvpp2: recycle buffers [net-next,v3,5/5] mvneta: recycle buffers

Context	Check	Description
netdev/cover_letter	success	Link
netdev/fixes_present	success	Link
netdev/patch_count	success	Link
netdev/tree_selection	success	Clearly marked for net-next
netdev/subject_prefix	success	Link
netdev/cc_maintainers	warning	5 maintainers not CCed: yhs@fb.com kpsingh@kernel.org andrii@kernel.org kafai@fb.com songliubraving@fb.com
netdev/source_inline	success	Was 0 now: 0
netdev/verify_signedoff	success	Link
netdev/module_param	success	Was 0 now: 0
netdev/build_32bit	success	Errors and warnings before: 23170 this patch: 23170
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/verify_fixes	success	Link
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 31 lines checked
netdev/build_allmodconfig_warn	success	Errors and warnings before: 22776 this patch: 22776
netdev/header_inline	success	Link

Matteo Croce April 9, 2021, 10:37 p.m. UTC

From: Matteo Croce <mcroce@microsoft.com>

This is needed by the page_pool to avoid recycling a page not allocated
via page_pool.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 include/linux/mm_types.h | 1 +
 include/net/page_pool.h  | 2 ++
 net/core/page_pool.c     | 4 ++++
 3 files changed, 7 insertions(+)

Matthew Wilcox (Oracle) April 10, 2021, 3:48 p.m. UTC | #1

On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> This is needed by the page_pool to avoid recycling a page not allocated
> via page_pool.

Is the PageType mechanism more appropriate to your needs?  It wouldn't
be if you use page->_mapcount (ie mapping it to userspace).

> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>  include/linux/mm_types.h | 1 +
>  include/net/page_pool.h  | 2 ++
>  net/core/page_pool.c     | 4 ++++
>  3 files changed, 7 insertions(+)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6613b26a8894..ef2d0d5f62e4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -101,6 +101,7 @@ struct page {
>  			 * 32-bit architectures.
>  			 */
>  			dma_addr_t dma_addr;
> +			unsigned long signature;
>  		};
>  		struct {	/* slab, slob and slub */
>  			union {
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index b5b195305346..b30405e84b5e 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -63,6 +63,8 @@
>   */
>  #define PP_ALLOC_CACHE_SIZE	128
>  #define PP_ALLOC_CACHE_REFILL	64
> +#define PP_SIGNATURE		0x20210303
> +
>  struct pp_alloc_cache {
>  	u32 count;
>  	void *cache[PP_ALLOC_CACHE_SIZE];
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ad8b0707af04..2ae9b554ef98 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>  		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>  
>  skip_dma_map:
> +	page->signature = PP_SIGNATURE;
> +
>  	/* Track how many pages are held 'in-flight' */
>  	pool->pages_state_hold_cnt++;
>  
> @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
>  			     DMA_ATTR_SKIP_CPU_SYNC);
>  	page->dma_addr = 0;
>  skip_dma_unmap:
> +	page->signature = 0;
> +
>  	/* This may be the last page returned, releasing the pool, so
>  	 * it is not safe to reference pool afterwards.
>  	 */
> -- 
> 2.30.2
>

Ilias Apalodimas April 10, 2021, 4:16 p.m. UTC | #2

Hi Matthew 

On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > This is needed by the page_pool to avoid recycling a page not allocated
> > via page_pool.
> 
> Is the PageType mechanism more appropriate to your needs?  It wouldn't
> be if you use page->_mapcount (ie mapping it to userspace).

Interesting!
Please keep in mind this was written ~2018 and was stale on my branches for
quite some time.  So back then I did try to use PageType, but had not free
bits.  Looking at it again though, it's cleaned up.  So yes I think this can
be much much cleaner.  Should we go and define a new PG_pagepool?


Thanks!
/Ilias
> 
> > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > ---
> >  include/linux/mm_types.h | 1 +
> >  include/net/page_pool.h  | 2 ++
> >  net/core/page_pool.c     | 4 ++++
> >  3 files changed, 7 insertions(+)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6613b26a8894..ef2d0d5f62e4 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -101,6 +101,7 @@ struct page {
> >  			 * 32-bit architectures.
> >  			 */
> >  			dma_addr_t dma_addr;
> > +			unsigned long signature;
> >  		};
> >  		struct {	/* slab, slob and slub */
> >  			union {
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index b5b195305346..b30405e84b5e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -63,6 +63,8 @@
> >   */
> >  #define PP_ALLOC_CACHE_SIZE	128
> >  #define PP_ALLOC_CACHE_REFILL	64
> > +#define PP_SIGNATURE		0x20210303
> > +
> >  struct pp_alloc_cache {
> >  	u32 count;
> >  	void *cache[PP_ALLOC_CACHE_SIZE];
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index ad8b0707af04..2ae9b554ef98 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> >  		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> >  
> >  skip_dma_map:
> > +	page->signature = PP_SIGNATURE;
> > +
> >  	/* Track how many pages are held 'in-flight' */
> >  	pool->pages_state_hold_cnt++;
> >  
> > @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
> >  			     DMA_ATTR_SKIP_CPU_SYNC);
> >  	page->dma_addr = 0;
> >  skip_dma_unmap:
> > +	page->signature = 0;
> > +
> >  	/* This may be the last page returned, releasing the pool, so
> >  	 * it is not safe to reference pool afterwards.
> >  	 */
> > -- 
> > 2.30.2
> >

Shakeel Butt April 10, 2021, 5:42 p.m. UTC | #3

On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> Hi Matthew
>
> On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > > This is needed by the page_pool to avoid recycling a page not allocated
> > > via page_pool.
> >
> > Is the PageType mechanism more appropriate to your needs?  It wouldn't
> > be if you use page->_mapcount (ie mapping it to userspace).
>
> Interesting!
> Please keep in mind this was written ~2018 and was stale on my branches for
> quite some time.  So back then I did try to use PageType, but had not free
> bits.  Looking at it again though, it's cleaned up.  So yes I think this can
> be much much cleaner.  Should we go and define a new PG_pagepool?
>
>

Can this page_pool be used for TCP RX zerocopy? If yes then PageType
can not be used.

There is a recent discussion [1] on memcg accounting of TCP RX
zerocopy and I am wondering if this work can somehow help in that
regard. I will take a look at the series.

[1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Ilias Apalodimas April 10, 2021, 6:27 p.m. UTC | #4

Hi Shakeel, 

On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote:
> On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
> <ilias.apalodimas@linaro.org> wrote:
> >
> > Hi Matthew
> >
> > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > > > This is needed by the page_pool to avoid recycling a page not allocated
> > > > via page_pool.
> > >
> > > Is the PageType mechanism more appropriate to your needs?  It wouldn't
> > > be if you use page->_mapcount (ie mapping it to userspace).
> >
> > Interesting!
> > Please keep in mind this was written ~2018 and was stale on my branches for
> > quite some time.  So back then I did try to use PageType, but had not free
> > bits.  Looking at it again though, it's cleaned up.  So yes I think this can
> > be much much cleaner.  Should we go and define a new PG_pagepool?
> >
> >
> 
> Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> can not be used.

Yes it can, since it's going to be used as your default allocator for
payloads, which might end up on an SKB.
So we have to keep the extra added field on struct page for our mark.
Matthew had an intersting idea.  He suggested keeping it, but changing the 
magic number, so it can't be a kernel address, but I'll let him follow 
up on the details.

> 
> There is a recent discussion [1] on memcg accounting of TCP RX
> zerocopy and I am wondering if this work can somehow help in that
> regard. I will take a look at the series.
> 

I'll try having a look on this as well. The idea behind the patchset is to
allow lower speed NICs that use the API already, gain recycling 'easily'.  
Using page_pool for the driver comes with a penalty to begin with.
Allocating pages instead of SKBs has a measurable difference. By enabling them
to recycle they'll get better performance, since you skip the
reallocation/remapping and only care for syncing the buffers correctly.

> [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Matthew Wilcox (Oracle) April 10, 2021, 7:39 p.m. UTC | #5

On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote:
> > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > can not be used.
> 
> Yes it can, since it's going to be used as your default allocator for
> payloads, which might end up on an SKB.
> So we have to keep the extra added field on struct page for our mark.
> Matthew had an intersting idea.  He suggested keeping it, but changing the 
> magic number, so it can't be a kernel address, but I'll let him follow 
> up on the details.

Sure!  So, given the misalignment problem I discovered yesterday [1],
we probably want a page_pool page to look like:

unsigned long	flags;
unsigned long	pp_magic;
unsigned long	xmi;
unsigned long	_pp_mapping_pad;
dma_addr_t	dma_addr;	/* might be one or two words */

The only real restriction here is that pp_magic should not be a valid
pointer, and it must have the bottom bit clear.  I'd recommend something
like:

#define PP_MAGIC	(0x20 + POISON_POINTER_DELTA)

This leaves page->mapping as NULL, so you don't have to worry about
clearing it before free.

[1] https://lore.kernel.org/linux-mm/20210410024313.GX2531743@casper.infradead.org/

Jesper Dangaard Brouer April 11, 2021, 10:05 a.m. UTC | #6

On Sat, 10 Apr 2021 20:39:55 +0100
Matthew Wilcox <willy@infradead.org> wrote:

> On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote:
> > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > can not be used.  
> > 
> > Yes it can, since it's going to be used as your default allocator for
> > payloads, which might end up on an SKB.
> > So we have to keep the extra added field on struct page for our mark.
> > Matthew had an intersting idea.  He suggested keeping it, but changing the 
> > magic number, so it can't be a kernel address, but I'll let him follow 
> > up on the details.  
> 
> Sure!  So, given the misalignment problem I discovered yesterday [1],
> we probably want a page_pool page to look like:
> 
> unsigned long	flags;
> unsigned long	pp_magic;
> unsigned long	xmi;
> unsigned long	_pp_mapping_pad;
> dma_addr_t	dma_addr;	/* might be one or two words */
> 
> The only real restriction here is that pp_magic should not be a valid
> pointer, and it must have the bottom bit clear.  I'd recommend something
> like:
> 
> #define PP_MAGIC	(0x20 + POISON_POINTER_DELTA)
> 
> This leaves page->mapping as NULL, so you don't have to worry about
> clearing it before free.
>
> [1] https://lore.kernel.org/linux-mm/20210410024313.GX2531743@casper.infradead.org/

I didn't see this, before asking[2] for explaining your intent.
I still worry about page->index, see [2].

[2] https://lore.kernel.org/netdev/20210411114307.5087f958@carbon/

Jesper Dangaard Brouer April 14, 2021, 7:41 p.m. UTC | #7

On Sat, 10 Apr 2021 21:27:31 +0300
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote:
>
> > On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
> > <ilias.apalodimas@linaro.org> wrote:  
> > >
> > > Hi Matthew
> > >
> > > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:  
> > > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:  
> > > > > This is needed by the page_pool to avoid recycling a page not allocated
> > > > > via page_pool.  
> > > >
> > > > Is the PageType mechanism more appropriate to your needs?  It wouldn't
> > > > be if you use page->_mapcount (ie mapping it to userspace).  
> > >
> > > Interesting!
> > > Please keep in mind this was written ~2018 and was stale on my branches for
> > > quite some time.  So back then I did try to use PageType, but had not free
> > > bits.  Looking at it again though, it's cleaned up.  So yes I think this can
> > > be much much cleaner.  Should we go and define a new PG_pagepool?
> > >
> > 
> > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > can not be used.  
> 
> Yes it can, since it's going to be used as your default allocator for
> payloads, which might end up on an SKB.

I'm not sure we want or should "allow" page_pool be used for TCP RX
zerocopy.
For several reasons.

(1) This implies mapping these pages page to userspace, which AFAIK
means using page->mapping and page->index members (right?).

(2) It feels wrong (security wise) to keep the DMA-mapping (for the
device) and also map this page into userspace.

(3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
zerocopy will bump the refcnt, which means the page_pool will not
recycle the page when it see the elevated refcnt (it will instead
release its DMA-mapping).

(4) I remember vaguely that this code path for (TCP RX zerocopy) uses
page->private for tricks.  And our patch [3/5] use page->private for
storing xdp_mem_info.

IMHO when the SKB travel into this TCP RX zerocopy code path, we should
call page_pool_release_page() to release its DMA-mapping.

> > [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Shakeel Butt April 14, 2021, 8:09 p.m. UTC | #8

On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
[...]
> > >
> > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > can not be used.
> >
> > Yes it can, since it's going to be used as your default allocator for
> > payloads, which might end up on an SKB.
>
> I'm not sure we want or should "allow" page_pool be used for TCP RX
> zerocopy.
> For several reasons.
>
> (1) This implies mapping these pages page to userspace, which AFAIK
> means using page->mapping and page->index members (right?).
>

No, only page->_mapcount is used.

> (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> device) and also map this page into userspace.
>

I think this is already the case i.e pages still DMA-mapped and also
mapped into userspace.

> (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> zerocopy will bump the refcnt, which means the page_pool will not
> recycle the page when it see the elevated refcnt (it will instead
> release its DMA-mapping).

Yes this is right but the userspace might have already consumed and
unmapped the page before the driver considers to recycle the page.

>
> (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> page->private for tricks.  And our patch [3/5] use page->private for
> storing xdp_mem_info.
>
> IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> call page_pool_release_page() to release its DMA-mapping.
>

I will let TCP RX zerocopy experts respond to this but from my high
level code inspection, I didn't see page->private usage.

Eric Dumazet April 14, 2021, 8:51 p.m. UTC | #9

On Wed, Apr 14, 2021 at 10:09 PM Shakeel Butt <shakeelb@google.com> wrote:

>
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.

Indeed, we do not use page->private, since we do not own the page(s).

Ilias Apalodimas April 19, 2021, 5:12 a.m. UTC | #10

On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> [...]
> > > >
> > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > can not be used.
> > >
> > > Yes it can, since it's going to be used as your default allocator for
> > > payloads, which might end up on an SKB.
> >
> > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > zerocopy.
> > For several reasons.
> >
> > (1) This implies mapping these pages page to userspace, which AFAIK
> > means using page->mapping and page->index members (right?).
> >
> 
> No, only page->_mapcount is used.
> 

I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
adopt the recycling mechanism we should try preserving the current
functionality of the network stack.

The question is how does it work with the current drivers that already have an
internal page recycling mechanism.

> > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > device) and also map this page into userspace.
> >
> 
> I think this is already the case i.e pages still DMA-mapped and also
> mapped into userspace.
> 
> > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > zerocopy will bump the refcnt, which means the page_pool will not
> > recycle the page when it see the elevated refcnt (it will instead
> > release its DMA-mapping).
> 
> Yes this is right but the userspace might have already consumed and
> unmapped the page before the driver considers to recycle the page.

Same question here. I'll have a closer look in a few days and make sure we are
not breaking anything wrt zerocopy.

> 
> >
> > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > page->private for tricks.  And our patch [3/5] use page->private for
> > storing xdp_mem_info.
> >
> > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > call page_pool_release_page() to release its DMA-mapping.
> >
> 
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.

Shakeel are you aware of any 'easy' way I can have rx zerocopy running?


Thanks!
/Ilias

Jesper Dangaard Brouer April 19, 2021, 11:22 a.m. UTC | #11

On Wed, 14 Apr 2021 13:09:47 -0700
Shakeel Butt <shakeelb@google.com> wrote:

> On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >  
> [...]
> > > >
> > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > can not be used.  
> > >
> > > Yes it can, since it's going to be used as your default allocator for
> > > payloads, which might end up on an SKB.  
> >
> > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > zerocopy.
> > For several reasons.
> >
> > (1) This implies mapping these pages page to userspace, which AFAIK
> > means using page->mapping and page->index members (right?).
> >  
> 
> No, only page->_mapcount is used.

Good to know.
I will admit that I don't fully understand the usage of page->mapping
and page->index members.

> > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > device) and also map this page into userspace.
> >  
> 
> I think this is already the case i.e pages still DMA-mapped and also
> mapped into userspace.

True, other drivers are doing the same.

> > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > zerocopy will bump the refcnt, which means the page_pool will not
> > recycle the page when it see the elevated refcnt (it will instead
> > release its DMA-mapping).  
> 
> Yes this is right but the userspace might have already consumed and
> unmapped the page before the driver considers to recycle the page.

That is a good point.  So, there is a race window where it is possible
to gain recycling.

It seems my page_pool co-maintainer Ilias is interested in taking up the
challenge to get this working with TCP RX zerocopy.  So, lets see how
this is doable.

 
> >
> > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > page->private for tricks.  And our patch [3/5] use page->private for
> > storing xdp_mem_info.
> >
> > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > call page_pool_release_page() to release its DMA-mapping.
> >  
> 
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.

I trust when Eric says page->private isn't used in this code path.
So, it might actually be possible :-)

I will challenge Ilias and Matteo to pull this off. (But I know that
both of them are busy for personal reasons, so be patient with them).

Matthew Wilcox (Oracle) April 19, 2021, 1:01 p.m. UTC | #12

On Mon, Apr 19, 2021 at 01:22:04PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 14 Apr 2021 13:09:47 -0700
> Shakeel Butt <shakeelb@google.com> wrote:
> 
> > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > >  
> > [...]
> > > > >
> > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > can not be used.  
> > > >
> > > > Yes it can, since it's going to be used as your default allocator for
> > > > payloads, which might end up on an SKB.  
> > >
> > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > zerocopy.
> > > For several reasons.
> > >
> > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > means using page->mapping and page->index members (right?).
> > >  
> > 
> > No, only page->_mapcount is used.
> 
> Good to know.
> I will admit that I don't fully understand the usage of page->mapping
> and page->index members.

That's fair.  It's not well-documented, and it's complicated.

For a page mapped into userspace, page->mapping is one of:
 - NULL
 - A pointer to a file's address_space
 - A pointer to an anonymous page's anon_vma
If a page isn't mapped into userspace, you can use the space in page->mapping
for anything you like (eg slab uses it)

page->index is only used for indicating pfmemalloc today (and I want to
move that indicator).  I think it can also be used to merge VMAs (if
some other conditions are also true), but failing to merge VMAs isn't
a big deal for this kind of situation.

> > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > device) and also map this page into userspace.
> > 
> > I think this is already the case i.e pages still DMA-mapped and also
> > mapped into userspace.
> 
> True, other drivers are doing the same.

And the contents of this page already came from that device ... if it
wanted to write bad data, it could already have done so.

> > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > zerocopy will bump the refcnt, which means the page_pool will not
> > > recycle the page when it see the elevated refcnt (it will instead
> > > release its DMA-mapping).  
> > 
> > Yes this is right but the userspace might have already consumed and
> > unmapped the page before the driver considers to recycle the page.
> 
> That is a good point.  So, there is a race window where it is possible
> to gain recycling.
> 
> It seems my page_pool co-maintainer Ilias is interested in taking up the
> challenge to get this working with TCP RX zerocopy.  So, lets see how
> this is doable.

You could also check page_ref_count() - page_mapcount() instead of
just checking page_ref_count().  Assuming mapping/unmapping can't
race with recycling?

Shakeel Butt April 19, 2021, 2:57 p.m. UTC | #13

On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > >
> > [...]
> > > > >
> > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > can not be used.
> > > >
> > > > Yes it can, since it's going to be used as your default allocator for
> > > > payloads, which might end up on an SKB.
> > >
> > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > zerocopy.
> > > For several reasons.
> > >
> > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > means using page->mapping and page->index members (right?).
> > >
> >
> > No, only page->_mapcount is used.
> >
>
> I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
> adopt the recycling mechanism we should try preserving the current
> functionality of the network stack.
>
> The question is how does it work with the current drivers that already have an
> internal page recycling mechanism.
>

I think the current drivers check page_ref_count(page) to decide to
reuse (or not) the already allocated pages.

Some examples from the drivers:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page()
drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page()
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get()

> > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > device) and also map this page into userspace.
> > >
> >
> > I think this is already the case i.e pages still DMA-mapped and also
> > mapped into userspace.
> >
> > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > zerocopy will bump the refcnt, which means the page_pool will not
> > > recycle the page when it see the elevated refcnt (it will instead
> > > release its DMA-mapping).
> >
> > Yes this is right but the userspace might have already consumed and
> > unmapped the page before the driver considers to recycle the page.
>
> Same question here. I'll have a closer look in a few days and make sure we are
> not breaking anything wrt zerocopy.
>

Pages mapped into the userspace have their refcnt elevated, so the
page_ref_count() check by the drivers indicates to not reuse such
pages.

> >
> > >
> > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > > page->private for tricks.  And our patch [3/5] use page->private for
> > > storing xdp_mem_info.
> > >
> > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > > call page_pool_release_page() to release its DMA-mapping.
> > >
> >
> > I will let TCP RX zerocopy experts respond to this but from my high
> > level code inspection, I didn't see page->private usage.
>
> Shakeel are you aware of any 'easy' way I can have rx zerocopy running?
>

I would recommend tools/testing/selftests/net/tcp_mmap.c.

Ilias Apalodimas April 19, 2021, 3:43 p.m. UTC | #14

Hi Shakeel,
On Mon, Apr 19, 2021 at 07:57:03AM -0700, Shakeel Butt wrote:
> On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas
> <ilias.apalodimas@linaro.org> wrote:
> >
> > On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > > <brouer@redhat.com> wrote:
> > > >
> > > [...]
> > > > > >
> > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > > can not be used.
> > > > >
> > > > > Yes it can, since it's going to be used as your default allocator for
> > > > > payloads, which might end up on an SKB.
> > > >
> > > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > > zerocopy.
> > > > For several reasons.
> > > >
> > > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > > means using page->mapping and page->index members (right?).
> > > >
> > >
> > > No, only page->_mapcount is used.
> > >
> >
> > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
> > adopt the recycling mechanism we should try preserving the current
> > functionality of the network stack.
> >
> > The question is how does it work with the current drivers that already have an
> > internal page recycling mechanism.
> >
> 
> I think the current drivers check page_ref_count(page) to decide to
> reuse (or not) the already allocated pages.
> 
> Some examples from the drivers:
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page()
> drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page()
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get()
> 

Yes, that's how internal recycling is done in drivers. As Jesper mentioned the
refcnt of the page is 1 for the page_pool owned pages and that's how we decide 
what to do with the page.

> > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > > device) and also map this page into userspace.
> > > >
> > >
> > > I think this is already the case i.e pages still DMA-mapped and also
> > > mapped into userspace.
> > >
> > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > > zerocopy will bump the refcnt, which means the page_pool will not
> > > > recycle the page when it see the elevated refcnt (it will instead
> > > > release its DMA-mapping).
> > >
> > > Yes this is right but the userspace might have already consumed and
> > > unmapped the page before the driver considers to recycle the page.
> >
> > Same question here. I'll have a closer look in a few days and make sure we are
> > not breaking anything wrt zerocopy.
> >
> 
> Pages mapped into the userspace have their refcnt elevated, so the
> page_ref_count() check by the drivers indicates to not reuse such
> pages.
> 

When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch() 
which will end up doing a get_page().
What you are saying is that once the zerocopy is done though, skb_release_data() 
won't be called, but instead put_page() will be? If that's the case then we are 
indeed leaking DMA mappings and memory. That sounds weird though, since the
refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
eventually frees the page? 
If kfree_skb() (or any wrapper that calls skb_release_data()) is called 
eventually, we'll end up properly recycling the page into our pool.

> > >
> > > >
> > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > > > page->private for tricks.  And our patch [3/5] use page->private for
> > > > storing xdp_mem_info.
> > > >
> > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > > > call page_pool_release_page() to release its DMA-mapping.
> > > >
> > >
> > > I will let TCP RX zerocopy experts respond to this but from my high
> > > level code inspection, I didn't see page->private usage.
> >
> > Shakeel are you aware of any 'easy' way I can have rx zerocopy running?
> >
> 
> I would recommend tools/testing/selftests/net/tcp_mmap.c.

Ok, thanks I'll have a look.

Cheers
/Ilias

Shakeel Butt April 19, 2021, 4:21 p.m. UTC | #15

On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
[...]
> > Pages mapped into the userspace have their refcnt elevated, so the
> > page_ref_count() check by the drivers indicates to not reuse such
> > pages.
> >
>
> When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch()
> which will end up doing a get_page().
> What you are saying is that once the zerocopy is done though, skb_release_data()
> won't be called, but instead put_page() will be? If that's the case then we are
> indeed leaking DMA mappings and memory. That sounds weird though, since the
> refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
> eventually frees the page?
> If kfree_skb() (or any wrapper that calls skb_release_data()) is called
> eventually, we'll end up properly recycling the page into our pool.
>

From what I understand (Eric, please correct me if I'm wrong) for
simple cases there are 3 page references taken. One by the driver,
second by skb and third by page table.

In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one
page ref through insert_page_into_pte_locked(). However before
returning from tcp_zerocopy_receive(), the skb references are dropped
through tcp_recv_skb(). So, whenever the user unmaps the page and
drops the page ref only then that page can be reused by the driver.

In my understanding, for zerocopy rx the skb_release_data() is called
on the pages while they are still mapped into the userspace. So,
skb_release_data() might not be the right place to recycle the page
for zerocopy. The email chain at [1] has some discussion on how to
bundle the recycling of pages with their lifetime.

[1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Ilias Apalodimas April 19, 2021, 6:41 p.m. UTC | #16

On Mon, Apr 19, 2021 at 09:21:55AM -0700, Shakeel Butt wrote:
> On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas
> <ilias.apalodimas@linaro.org> wrote:
> >
> [...]
> > > Pages mapped into the userspace have their refcnt elevated, so the
> > > page_ref_count() check by the drivers indicates to not reuse such
> > > pages.
> > >
> >
> > When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch()
> > which will end up doing a get_page().
> > What you are saying is that once the zerocopy is done though, skb_release_data()
> > won't be called, but instead put_page() will be? If that's the case then we are
> > indeed leaking DMA mappings and memory. That sounds weird though, since the
> > refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
> > eventually frees the page?
> > If kfree_skb() (or any wrapper that calls skb_release_data()) is called
> > eventually, we'll end up properly recycling the page into our pool.
> >
> 
> From what I understand (Eric, please correct me if I'm wrong) for
> simple cases there are 3 page references taken. One by the driver,
> second by skb and third by page table.
> 
> In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one
> page ref through insert_page_into_pte_locked(). However before
> returning from tcp_zerocopy_receive(), the skb references are dropped
> through tcp_recv_skb(). So, whenever the user unmaps the page and
> drops the page ref only then that page can be reused by the driver.
> 
> In my understanding, for zerocopy rx the skb_release_data() is called
> on the pages while they are still mapped into the userspace. So,
> skb_release_data() might not be the right place to recycle the page
> for zerocopy. The email chain at [1] has some discussion on how to
> bundle the recycling of pages with their lifetime.
> 
> [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/

Ah right, you mentioned the same email before and I completely forgot about
it! In the past we had thoughts of 'stealing' the page on put_page instead of 
skb_release_data().  We were afraid that this would cause a measurable 
performance hit, so we tried to limit it within the skb lifecycle.

However I don't think this will be a problem.  Assuming we are right here and 
skb_release_data() is called while the userspace holds an extra reference from
the mapping here's what will happen:

skb_release_data() -> skb_free_head() -> page_pool_return_skb_page() ->
set_page_private() -> xdp_return_skb_frame() -> __xdp_return() -> 
page_pool_put_full_page() -> page_pool_put_page() -> __page_pool_put_page()

When we call __page_pool_put_page(), the refcnt will be != 1 (because a user
mapping is still active), so we won't try to recycle it. Instead we'll remove 
the DMA mappings and decrease the refcnt.

So although the recycling won't 'work', nothing bad will happen (famous last
words).

In any case, I'll double check with the test you pointed out before v4.

Thanks!
/Ilias

Ilias Apalodimas April 20, 2021, 8:10 a.m. UTC | #17

Hi Matthew,

[...]
> 
> And the contents of this page already came from that device ... if it
> wanted to write bad data, it could already have done so.
> 
> > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > > zerocopy will bump the refcnt, which means the page_pool will not
> > > > recycle the page when it see the elevated refcnt (it will instead
> > > > release its DMA-mapping).  
> > > 
> > > Yes this is right but the userspace might have already consumed and
> > > unmapped the page before the driver considers to recycle the page.
> > 
> > That is a good point.  So, there is a race window where it is possible
> > to gain recycling.
> > 
> > It seems my page_pool co-maintainer Ilias is interested in taking up the
> > challenge to get this working with TCP RX zerocopy.  So, lets see how
> > this is doable.
> 
> You could also check page_ref_count() - page_mapcount() instead of
> just checking page_ref_count().  Assuming mapping/unmapping can't
> race with recycling?
> 

That's not a bad idea.  As I explained on my last reply to Shakeel, I don't
think the current patch will blow up anywhere.  If the page is unmapped prior
to kfree_skb() it will be recycled.  If it's done in a reverse order, we'll
just free the page entirely and will have to re-allocate it.
The only thing I need to test is potential races (assuming those can even
happen?).

Trying to recycle the page outside of kfree_skb() means we'd have to 'steal'
the page, during put_page() (or some function that's outside the networking
scope).  I think this is going to have a measurable performance penalty though
not in networking, but in general.

In any case, that should be orthogonal to the current patchset.  So unless
someone feels strongly about it, I'd prefer keeping the current code and
trying to enable recycling in the skb zc case, when we have enough users of
the API.


Thanks
/Ilias

[net-next,v3,2/5] mm: add a signature in struct page

Checks

Commit Message

Comments

Patch