diff mbox series

[1/1] mm: Fix struct page layout on 32-bit systems

Message ID 20210410205246.507048-2-willy@infradead.org (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series Fix struct page layout on 32-bit systems | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Matthew Wilcox (Oracle) April 10, 2021, 8:52 p.m. UTC
32-bit architectures which expect 8-byte alignment for 8-byte integers
and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
page inadvertently expanded in 2019.  When the dma_addr_t was added,
it forced the alignment of the union to 8 bytes, which inserted a 4 byte
gap between 'flags' and the union.

We could fix this by telling the compiler to use a smaller alignment
for the dma_addr, but that seems a little fragile.  Instead, move the
'flags' into the union.  That causes dma_addr to shift into the same
bits as 'mapping', so it would have to be cleared on free.  To avoid
this, insert three words of padding and use the same bits as ->index
and ->private, neither of which have to be cleared on free.

Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm_types.h | 38 ++++++++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 12 deletions(-)

Comments

Jesper Dangaard Brouer April 11, 2021, 9:43 a.m. UTC | #1
On Sat, 10 Apr 2021 21:52:45 +0100
"Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:

> 32-bit architectures which expect 8-byte alignment for 8-byte integers
> and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> page inadvertently expanded in 2019.  When the dma_addr_t was added,
> it forced the alignment of the union to 8 bytes, which inserted a 4 byte
> gap between 'flags' and the union.
> 
> We could fix this by telling the compiler to use a smaller alignment
> for the dma_addr, but that seems a little fragile.  Instead, move the
> 'flags' into the union.  That causes dma_addr to shift into the same
> bits as 'mapping', so it would have to be cleared on free.  To avoid
> this, insert three words of padding and use the same bits as ->index
> and ->private, neither of which have to be cleared on free.
> 
> Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/mm_types.h | 38 ++++++++++++++++++++++++++------------
>  1 file changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6613b26a8894..45c563e9b50e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -68,16 +68,22 @@ struct mem_cgroup;
>  #endif
>  
>  struct page {
> -	unsigned long flags;		/* Atomic flags, some possibly
> -					 * updated asynchronously */
>  	/*
> -	 * Five words (20/40 bytes) are available in this union.
> -	 * WARNING: bit 0 of the first word is used for PageTail(). That
> -	 * means the other users of this union MUST NOT use the bit to
> +	 * This union is six words (24 / 48 bytes) in size.
> +	 * The first word is reserved for atomic flags, often updated
> +	 * asynchronously.  Use the PageFoo() macros to access it.  Some
> +	 * of the flags can be reused for your own purposes, but the
> +	 * word as a whole often contains other information and overwriting
> +	 * it will cause functions like page_zone() and page_node() to stop
> +	 * working correctly.
> +	 *
> +	 * Bit 0 of the second word is used for PageTail(). That
> +	 * means the other users of this union MUST leave the bit zero to
>  	 * avoid collision and false-positive PageTail().
>  	 */
>  	union {
>  		struct {	/* Page cache and anonymous pages */
> +			unsigned long flags;
>  			/**
>  			 * @lru: Pageout list, eg. active_list protected by
>  			 * lruvec->lru_lock.  Sometimes used as a generic list
> @@ -96,13 +102,14 @@ struct page {
>  			unsigned long private;
>  		};
>  		struct {	/* page_pool used by netstack */
> -			/**
> -			 * @dma_addr: might require a 64-bit value even on
> -			 * 32-bit architectures.
> -			 */
> -			dma_addr_t dma_addr;

The original intend of placing member @dma_addr here is that it overlap
with @LRU (type struct list_head) which contains two pointers.  Thus, in
case of CONFIG_ARCH_DMA_ADDR_T_64BIT=y on 32-bit architectures it would
use both pointers.

Thinking more about this, this design is flawed as bit 0 of the first
word is used for compound pages (see PageTail and @compound_head), is
reserved.  We knew DMA addresses were aligned, thus we though this
satisfied that need.  BUT for DMA_ADDR_T_64BIT=y on 32-bit arch the
first word will contain the "upper" part of the DMA addr, which I don't
think gives this guarantee.

I guess, nobody are using this combination?!?  I though we added this
to satisfy TI (Texas Instrument) driver cpsw (code in
drivers/net/ethernet/ti/cpsw*).  Thus, I assumed it was in use?


> +			unsigned long _pp_flags;
> +			unsigned long pp_magic;
> +			unsigned long xmi;

Matteo notice, I think intent is we can store xdp_mem_info in @xmi.

> +			unsigned long _pp_mapping_pad;
> +			dma_addr_t dma_addr;	/* might be one or two words */
>  		};

Could you explain your intent here?
I worry about @index.

As I mentioned in other thread[1] netstack use page_is_pfmemalloc()
(code copy-pasted below signature) which imply that the member @index
have to be kept intact. In above, I'm unsure @index is untouched.

[1] https://lore.kernel.org/lkml/20210410082158.79ad09a6@carbon/
Matthew Wilcox (Oracle) April 11, 2021, 10:33 a.m. UTC | #2
On Sun, Apr 11, 2021 at 11:43:07AM +0200, Jesper Dangaard Brouer wrote:
> On Sat, 10 Apr 2021 21:52:45 +0100
> "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote:
> 
> > 32-bit architectures which expect 8-byte alignment for 8-byte integers
> > and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> > page inadvertently expanded in 2019.  When the dma_addr_t was added,
> > it forced the alignment of the union to 8 bytes, which inserted a 4 byte
> > gap between 'flags' and the union.
> > 
> > We could fix this by telling the compiler to use a smaller alignment
> > for the dma_addr, but that seems a little fragile.  Instead, move the
> > 'flags' into the union.  That causes dma_addr to shift into the same
> > bits as 'mapping', so it would have to be cleared on free.  To avoid
> > this, insert three words of padding and use the same bits as ->index
> > and ->private, neither of which have to be cleared on free.
> > 
> > Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  include/linux/mm_types.h | 38 ++++++++++++++++++++++++++------------
> >  1 file changed, 26 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6613b26a8894..45c563e9b50e 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -68,16 +68,22 @@ struct mem_cgroup;
> >  #endif
> >  
> >  struct page {
> > -	unsigned long flags;		/* Atomic flags, some possibly
> > -					 * updated asynchronously */
> >  	/*
> > -	 * Five words (20/40 bytes) are available in this union.
> > -	 * WARNING: bit 0 of the first word is used for PageTail(). That
> > -	 * means the other users of this union MUST NOT use the bit to
> > +	 * This union is six words (24 / 48 bytes) in size.
> > +	 * The first word is reserved for atomic flags, often updated
> > +	 * asynchronously.  Use the PageFoo() macros to access it.  Some
> > +	 * of the flags can be reused for your own purposes, but the
> > +	 * word as a whole often contains other information and overwriting
> > +	 * it will cause functions like page_zone() and page_node() to stop
> > +	 * working correctly.
> > +	 *
> > +	 * Bit 0 of the second word is used for PageTail(). That
> > +	 * means the other users of this union MUST leave the bit zero to
> >  	 * avoid collision and false-positive PageTail().
> >  	 */
> >  	union {
> >  		struct {	/* Page cache and anonymous pages */
> > +			unsigned long flags;
> >  			/**
> >  			 * @lru: Pageout list, eg. active_list protected by
> >  			 * lruvec->lru_lock.  Sometimes used as a generic list
> > @@ -96,13 +102,14 @@ struct page {
> >  			unsigned long private;
> >  		};
> >  		struct {	/* page_pool used by netstack */
> > -			/**
> > -			 * @dma_addr: might require a 64-bit value even on
> > -			 * 32-bit architectures.
> > -			 */
> > -			dma_addr_t dma_addr;
> 
> The original intend of placing member @dma_addr here is that it overlap
> with @LRU (type struct list_head) which contains two pointers.  Thus, in
> case of CONFIG_ARCH_DMA_ADDR_T_64BIT=y on 32-bit architectures it would
> use both pointers.
> 
> Thinking more about this, this design is flawed as bit 0 of the first
> word is used for compound pages (see PageTail and @compound_head), is
> reserved.  We knew DMA addresses were aligned, thus we though this
> satisfied that need.  BUT for DMA_ADDR_T_64BIT=y on 32-bit arch the
> first word will contain the "upper" part of the DMA addr, which I don't
> think gives this guarantee.
> 
> I guess, nobody are using this combination?!?  I though we added this
> to satisfy TI (Texas Instrument) driver cpsw (code in
> drivers/net/ethernet/ti/cpsw*).  Thus, I assumed it was in use?

It may be in use, but we've got away with it?  It's relatively rare
that this is going to bite us.  I think what has to happen is:

page is mapped to userspace
task calls get_user_page_fast(), loads the PTE
<preempted>
page is unmapped & freed
page is reallocated to the page_pool
page is DMA mapped to an address that happens to have that bit set
<first task resumes>
task looks for the compound_head() of that PTE, and attempts to bump
the refcount.  *oops*

If it has happened, would it have turned into a bug report?
If we had seen such a bug report, would we have noticed it?

> > +			unsigned long _pp_flags;
> > +			unsigned long pp_magic;
> > +			unsigned long xmi;
> 
> Matteo notice, I think intent is we can store xdp_mem_info in @xmi.

Yep.

> > +			unsigned long _pp_mapping_pad;
> > +			dma_addr_t dma_addr;	/* might be one or two words */
> >  		};
> 
> Could you explain your intent here?
> I worry about @index.
> 
> As I mentioned in other thread[1] netstack use page_is_pfmemalloc()
> (code copy-pasted below signature) which imply that the member @index
> have to be kept intact. In above, I'm unsure @index is untouched.

Argh, I read that piece of your message, and then promptly forgot about
it.  I really don't like page_is_pfmemalloc() using the entirety of
page->index for this.  How about we just do what slab does anyway
and use PageActive for page_is_pfmemalloc()?

Basically, we have three aligned dwords here.  We can either alias with
@flags and the first word of @lru, or the second word of @lru and @mapping,
or @index and @private.  @flags is a non-starter.  If we use @mapping,
then you have to set it to NULL before you free it, and I'm not sure
how easy that will be for you.  If that's trivial, then we could use
the layout:

	unsigned long _pp_flags;
	unsigned long pp_magic;
	union {
		dma_addr_t dma_addr;    /* might be one or two words */
		unsigned long _pp_align[2];
	};
	unsigned long pp_pfmemalloc;
	unsigned long xmi;
Matthew Wilcox (Oracle) April 12, 2021, 1:15 a.m. UTC | #3
On Sun, Apr 11, 2021 at 11:33:18AM +0100, Matthew Wilcox wrote:
> Basically, we have three aligned dwords here.  We can either alias with
> @flags and the first word of @lru, or the second word of @lru and @mapping,
> or @index and @private.  @flags is a non-starter.  If we use @mapping,
> then you have to set it to NULL before you free it, and I'm not sure
> how easy that will be for you.  If that's trivial, then we could use
> the layout:
> 
> 	unsigned long _pp_flags;
> 	unsigned long pp_magic;
> 	union {
> 		dma_addr_t dma_addr;    /* might be one or two words */
> 		unsigned long _pp_align[2];
> 	};
> 	unsigned long pp_pfmemalloc;
> 	unsigned long xmi;

I forgot about the munmap path.  That calls zap_page_range() which calls
set_page_dirty() which calls page_mapping().  If we use page->mapping,
that's going to get interpreted as an address_space pointer.

*sigh*.  Foiled at every turn.

I'm kind of inclined towards using two (or more) bits for PageSlab as
we discussed here:

https://lore.kernel.org/linux-mm/01000163efe179fe-d6270c58-eaba-482f-a6bd-334667250ef7-000000@email.amazonses.com/

so we have PageKAlloc that's true for PageSlab, PagePool, PageDMAPool,
PageVMalloc, PageFrag and maybe a few other kernel-internal allocations.

(see also here:)
https://lore.kernel.org/linux-mm/20180518194519.3820-18-willy@infradead.org/
Matthew Wilcox (Oracle) April 12, 2021, 6:23 p.m. UTC | #4
On Sun, Apr 11, 2021 at 11:43:07AM +0200, Jesper Dangaard Brouer wrote:
> Could you explain your intent here?
> I worry about @index.
> 
> As I mentioned in other thread[1] netstack use page_is_pfmemalloc()
> (code copy-pasted below signature) which imply that the member @index
> have to be kept intact. In above, I'm unsure @index is untouched.

Well, I tried three different approaches.  Here's the one I hated the least.

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Sat, 10 Apr 2021 16:12:06 -0400
Subject: [PATCH] mm: Fix struct page layout on 32-bit systems

32-bit architectures which expect 8-byte alignment for 8-byte integers
and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
page inadvertently expanded in 2019.  When the dma_addr_t was added,
it forced the alignment of the union to 8 bytes, which inserted a 4 byte
gap between 'flags' and the union.

We could fix this by telling the compiler to use a smaller alignment
for the dma_addr, but that seems a little fragile.  Instead, move the
'flags' into the union.  That causes dma_addr to shift into the same
bits as 'mapping', which causes problems with page_mapping() called from
set_page_dirty() in the munmap path.  To avoid this, insert three words
of padding and use the same bits as ->index and ->private, neither of
which have to be cleared on free.

However, page->index is currently used to indicate page_is_pfmemalloc.
Move that information to bit 1 of page->lru (aka compound_head).  This
has the same properties; it will be overwritten by callers who do
not care about pfmemalloc (as opposed to using a bit in page->flags).

Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h       | 12 +++++++-----
 include/linux/mm_types.h | 38 ++++++++++++++++++++++++++------------
 2 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b58c73e50da0..23cca0eaa9da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1668,10 +1668,12 @@ struct address_space *page_mapping(struct page *page);
 static inline bool page_is_pfmemalloc(const struct page *page)
 {
 	/*
-	 * Page index cannot be this large so this must be
-	 * a pfmemalloc page.
+	 * This is not a tail page; compound_head of a head page is unused
+	 * at return from the page allocator, and will be overwritten
+	 * by callers who do not care whether the page came from the
+	 * reserves.
 	 */
-	return page->index == -1UL;
+	return page->compound_head & 2;
 }
 
 /*
@@ -1680,12 +1682,12 @@ static inline bool page_is_pfmemalloc(const struct page *page)
  */
 static inline void set_page_pfmemalloc(struct page *page)
 {
-	page->index = -1UL;
+	page->compound_head = 2;
 }
 
 static inline void clear_page_pfmemalloc(struct page *page)
 {
-	page->index = 0;
+	page->compound_head = 0;
 }
 
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..45c563e9b50e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -68,16 +68,22 @@ struct mem_cgroup;
 #endif
 
 struct page {
-	unsigned long flags;		/* Atomic flags, some possibly
-					 * updated asynchronously */
 	/*
-	 * Five words (20/40 bytes) are available in this union.
-	 * WARNING: bit 0 of the first word is used for PageTail(). That
-	 * means the other users of this union MUST NOT use the bit to
+	 * This union is six words (24 / 48 bytes) in size.
+	 * The first word is reserved for atomic flags, often updated
+	 * asynchronously.  Use the PageFoo() macros to access it.  Some
+	 * of the flags can be reused for your own purposes, but the
+	 * word as a whole often contains other information and overwriting
+	 * it will cause functions like page_zone() and page_node() to stop
+	 * working correctly.
+	 *
+	 * Bit 0 of the second word is used for PageTail(). That
+	 * means the other users of this union MUST leave the bit zero to
 	 * avoid collision and false-positive PageTail().
 	 */
 	union {
 		struct {	/* Page cache and anonymous pages */
+			unsigned long flags;
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
 			 * lruvec->lru_lock.  Sometimes used as a generic list
@@ -96,13 +102,14 @@ struct page {
 			unsigned long private;
 		};
 		struct {	/* page_pool used by netstack */
-			/**
-			 * @dma_addr: might require a 64-bit value even on
-			 * 32-bit architectures.
-			 */
-			dma_addr_t dma_addr;
+			unsigned long _pp_flags;
+			unsigned long pp_magic;
+			unsigned long xmi;
+			unsigned long _pp_mapping_pad;
+			dma_addr_t dma_addr;	/* might be one or two words */
 		};
 		struct {	/* slab, slob and slub */
+			unsigned long _slab_flags;
 			union {
 				struct list_head slab_list;
 				struct {	/* Partial pages */
@@ -130,6 +137,7 @@ struct page {
 			};
 		};
 		struct {	/* Tail pages of compound page */
+			unsigned long _t1_flags;
 			unsigned long compound_head;	/* Bit zero is set */
 
 			/* First tail page only */
@@ -139,12 +147,14 @@ struct page {
 			unsigned int compound_nr; /* 1 << compound_order */
 		};
 		struct {	/* Second tail page of compound page */
+			unsigned long _t2_flags;
 			unsigned long _compound_pad_1;	/* compound_head */
 			atomic_t hpage_pinned_refcount;
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
 		struct {	/* Page table pages */
+			unsigned long _pt_flags;
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
@@ -159,6 +169,7 @@ struct page {
 #endif
 		};
 		struct {	/* ZONE_DEVICE pages */
+			unsigned long _zd_flags;
 			/** @pgmap: Points to the hosting device page map. */
 			struct dev_pagemap *pgmap;
 			void *zone_device_data;
@@ -174,8 +185,11 @@ struct page {
 			 */
 		};
 
-		/** @rcu_head: You can use this to free a page by RCU. */
-		struct rcu_head rcu_head;
+		struct {
+			unsigned long _rcu_flags;
+			/** @rcu_head: You can use this to free a page by RCU. */
+			struct rcu_head rcu_head;
+		};
 	};
 
 	union {		/* This union is 4 bytes in size. */
David Laight April 13, 2021, 8:21 a.m. UTC | #5
From: Matthew Wilcox <willy@infradead.org>
> Sent: 12 April 2021 19:24
> 
> On Sun, Apr 11, 2021 at 11:43:07AM +0200, Jesper Dangaard Brouer wrote:
> > Could you explain your intent here?
> > I worry about @index.
> >
> > As I mentioned in other thread[1] netstack use page_is_pfmemalloc()
> > (code copy-pasted below signature) which imply that the member @index
> > have to be kept intact. In above, I'm unsure @index is untouched.
> 
> Well, I tried three different approaches.  Here's the one I hated the least.
> 
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Sat, 10 Apr 2021 16:12:06 -0400
> Subject: [PATCH] mm: Fix struct page layout on 32-bit systems
> 
> 32-bit architectures which expect 8-byte alignment for 8-byte integers
> and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> page inadvertently expanded in 2019.  When the dma_addr_t was added,
> it forced the alignment of the union to 8 bytes, which inserted a 4 byte
> gap between 'flags' and the union.
> 
> We could fix this by telling the compiler to use a smaller alignment
> for the dma_addr, but that seems a little fragile.  Instead, move the
> 'flags' into the union.  That causes dma_addr to shift into the same
> bits as 'mapping', which causes problems with page_mapping() called from
> set_page_dirty() in the munmap path.  To avoid this, insert three words
> of padding and use the same bits as ->index and ->private, neither of
> which have to be cleared on free.

This all looks horribly fragile and is bound to get broken again.
Are there two problems?
1) The 'folio' structure needs to match 'rcu' part of the page
   so that it can use the same rcu list to free items.
2) Various uses of 'struct page' need to overlay fields to save space.

For (1) the rcu bit should probably be a named structure in an
anonymous union - probably in both structures.

For (2) is it worth explicitly defining the word number for each field?
So you end up with something like:
#define F(offset, member) struct { long _pad_##offset[offset]; member; }
struct page [
	union {
		struct page_rcu;
		unsigned long flags;
		F(1, unsigned long xxx);
		F(2, unsigned long yyy);
	etc.


		
...
>  		struct {	/* page_pool used by netstack */
> -			/**
> -			 * @dma_addr: might require a 64-bit value even on
> -			 * 32-bit architectures.
> -			 */
> -			dma_addr_t dma_addr;
> +			unsigned long _pp_flags;
> +			unsigned long pp_magic;
> +			unsigned long xmi;
> +			unsigned long _pp_mapping_pad;
> +			dma_addr_t dma_addr;	/* might be one or two words */
>  		};

Isn't that 6 words?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jesper Dangaard Brouer April 14, 2021, 8:10 a.m. UTC | #6
On Mon, 12 Apr 2021 02:15:32 +0100
Matthew Wilcox <willy@infradead.org> wrote:

> On Sun, Apr 11, 2021 at 11:33:18AM +0100, Matthew Wilcox wrote:
> > Basically, we have three aligned dwords here.  We can either alias with
> > @flags and the first word of @lru, or the second word of @lru and @mapping,
> > or @index and @private.  @flags is a non-starter.  If we use @mapping,
> > then you have to set it to NULL before you free it, and I'm not sure
> > how easy that will be for you.  If that's trivial, then we could use
> > the layout:
> > 
> > 	unsigned long _pp_flags;
> > 	unsigned long pp_magic;
> > 	union {
> > 		dma_addr_t dma_addr;    /* might be one or two words */
> > 		unsigned long _pp_align[2];
> > 	};
> > 	unsigned long pp_pfmemalloc;
> > 	unsigned long xmi;  
> 
> I forgot about the munmap path.  That calls zap_page_range() which calls
> set_page_dirty() which calls page_mapping().  If we use page->mapping,
> that's going to get interpreted as an address_space pointer.
> 
> *sigh*.  Foiled at every turn.

Yes, indeed! - And very frustrating.  It's keeping me up at night.
I'm dreaming about 32 vs 64 bit data structures. My fitbit stats tell
me that I don't sleep well with these kind of dreams ;-)

> I'm kind of inclined towards using two (or more) bits for PageSlab as
> we discussed here:
> 
> https://lore.kernel.org/linux-mm/01000163efe179fe-d6270c58-eaba-482f-a6bd-334667250ef7-000000@email.amazonses.com/
> 
> so we have PageKAlloc that's true for PageSlab, PagePool, PageDMAPool,
> PageVMalloc, PageFrag and maybe a few other kernel-internal allocations.

I actually like this idea a lot.  I also think it will solve or remove
Matteo/Ilias'es[2] need to introduce the pp_magic signature.  Ilias do
say[1] that page_pool pages could be used for TCP RX zerocopy, but I
don't think we should "allow" that (meaning page_pool should drop the
DMA-mapping and give up recycling).  I should argue why in that thread.

That said, I think we need to have a quicker fix for the immediate
issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
it leaves[3] in struct page.  In[3] you mention ppc32, does it only
happens on certain 32-bit archs?

I'm seriously considering removing page_pool's support for doing/keeping
DMA-mappings on 32-bit arch's.  AFAIK only a single driver use this.

> (see also here:)
> https://lore.kernel.org/linux-mm/20180518194519.3820-18-willy@infradead.org/

[1] https://lore.kernel.org/netdev/YHHuE7g73mZNrMV4@enceladus/
[2] https://patchwork.kernel.org/project/netdevbpf/patch/20210409223801.104657-3-mcroce@linux.microsoft.com/
[3] https://lore.kernel.org/linux-mm/20210410024313.GX2531743@casper.infradead.org/
Matthew Wilcox (Oracle) April 14, 2021, 11:50 a.m. UTC | #7
On Wed, Apr 14, 2021 at 10:10:44AM +0200, Jesper Dangaard Brouer wrote:
> Yes, indeed! - And very frustrating.  It's keeping me up at night.
> I'm dreaming about 32 vs 64 bit data structures. My fitbit stats tell
> me that I don't sleep well with these kind of dreams ;-)

Then you're going to love this ... even with the latest patch, there's
still a problem.  Because dma_addr_t is still 64-bit aligned _as a type_,
that forces the union to be 64-bit aligned (as we already knew and worked
around), but what I'd forgotten is that forces the entirety of struct
page to be 64-bit aligned.  Which means ...

        /* size: 40, cachelines: 1, members: 4 */
        /* padding: 4 */
        /* forced alignments: 1 */
        /* last cacheline: 40 bytes */
} __attribute__((__aligned__(8)));

.. that we still have a hole!  It's just moved from being at offset 4
to being at offset 36.

> That said, I think we need to have a quicker fix for the immediate
> issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
> it leaves[3] in struct page.  In[3] you mention ppc32, does it only
> happens on certain 32-bit archs?

AFAICT it happens on mips32, ppc32, arm32 and arc.  It doesn't happen
on x86-32 because dma_addr_t is 32-bit aligned.

Doing this fixes it:

+++ b/include/linux/types.h
@@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
  * so they don't care about the size of the actual bus addresses.
  */
 #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
-typedef u64 dma_addr_t;
+typedef u64 __attribute__((aligned(sizeof(void *)))) dma_addr_t;
 #else
 typedef u32 dma_addr_t;
 #endif

> I'm seriously considering removing page_pool's support for doing/keeping
> DMA-mappings on 32-bit arch's.  AFAIK only a single driver use this.

... if you're going to do that, then we don't need to do this.
Ilias Apalodimas April 14, 2021, 11:56 a.m. UTC | #8
On Wed, Apr 14, 2021 at 12:50:52PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 14, 2021 at 10:10:44AM +0200, Jesper Dangaard Brouer wrote:
> > Yes, indeed! - And very frustrating.  It's keeping me up at night.
> > I'm dreaming about 32 vs 64 bit data structures. My fitbit stats tell
> > me that I don't sleep well with these kind of dreams ;-)
> 
> Then you're going to love this ... even with the latest patch, there's
> still a problem.  Because dma_addr_t is still 64-bit aligned _as a type_,
> that forces the union to be 64-bit aligned (as we already knew and worked
> around), but what I'd forgotten is that forces the entirety of struct
> page to be 64-bit aligned.  Which means ...
> 
>         /* size: 40, cachelines: 1, members: 4 */
>         /* padding: 4 */
>         /* forced alignments: 1 */
>         /* last cacheline: 40 bytes */
> } __attribute__((__aligned__(8)));
> 
> .. that we still have a hole!  It's just moved from being at offset 4
> to being at offset 36.
> 
> > That said, I think we need to have a quicker fix for the immediate
> > issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
> > it leaves[3] in struct page.  In[3] you mention ppc32, does it only
> > happens on certain 32-bit archs?
> 
> AFAICT it happens on mips32, ppc32, arm32 and arc.  It doesn't happen
> on x86-32 because dma_addr_t is 32-bit aligned.
> 
> Doing this fixes it:
> 
> +++ b/include/linux/types.h
> @@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
>   * so they don't care about the size of the actual bus addresses.
>   */
>  #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> -typedef u64 dma_addr_t;
> +typedef u64 __attribute__((aligned(sizeof(void *)))) dma_addr_t;
>  #else
>  typedef u32 dma_addr_t;
>  #endif
> 
> > I'm seriously considering removing page_pool's support for doing/keeping
> > DMA-mappings on 32-bit arch's.  AFAIK only a single driver use this.
> 
> ... if you're going to do that, then we don't need to do this.

FWIW I already proposed that to Matthew in private a few days ago...
II am not even sure the AM572x has that support.  I'd much prefer getting rid
of it as well, instead of overcomplicating the struct for a device noone is
going to need.

Cheers
/Ilias
David Laight April 14, 2021, 3:52 p.m. UTC | #9
> Doing this fixes it:
> 
> +++ b/include/linux/types.h
> @@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
>   * so they don't care about the size of the actual bus addresses.
>   */
>  #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> -typedef u64 dma_addr_t;
> +typedef u64 __attribute__((aligned(sizeof(void *)))) dma_addr_t;
>  #else
>  typedef u32 dma_addr_t;
>  #endif

I hate __packed so much I've been checking what it does!

If you add __packed to the dma_addr_t field inside the union
then gcc (at least) removes the pad from before it, but also
'remembers' the alignment that is enforced by other members
of the structure.

So you don't need the extra aligned(sizeof (void *)) since
that is implicit.

So in this case __packed probably has no side effects.
(Unless a 32bit arch has instructions for a 64bit read
that must not be on an 8n+4 boundary and the address is taken).

It also doesn't affect 64bit - since the previous field
forces 64bit alignment.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jesper Dangaard Brouer April 14, 2021, 7:13 p.m. UTC | #10
On Wed, 14 Apr 2021 12:50:52 +0100
Matthew Wilcox <willy@infradead.org> wrote:

> > That said, I think we need to have a quicker fix for the immediate
> > issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
> > it leaves[3] in struct page.  In[3] you mention ppc32, does it only
> > happens on certain 32-bit archs?  
> 
> AFAICT it happens on mips32, ppc32, arm32 and arc.  It doesn't happen
> on x86-32 because dma_addr_t is 32-bit aligned.

(If others want to reproduce).  First I could not reproduce on ARM32.
Then I found out that enabling CONFIG_XEN on ARCH=arm was needed to
cause the issue by enabling CONFIG_ARCH_DMA_ADDR_T_64BIT.

Details below signature.
Matthew Wilcox (Oracle) April 14, 2021, 9:35 p.m. UTC | #11
On Wed, Apr 14, 2021 at 09:13:22PM +0200, Jesper Dangaard Brouer wrote:
> (If others want to reproduce).  First I could not reproduce on ARM32.
> Then I found out that enabling CONFIG_XEN on ARCH=arm was needed to
> cause the issue by enabling CONFIG_ARCH_DMA_ADDR_T_64BIT.

hmmm ... you should be able to provoke it by enabling ARM_LPAE,
which selects PHYS_ADDR_T_64BIT, and

config ARCH_DMA_ADDR_T_64BIT
        def_bool 64BIT || PHYS_ADDR_T_64BIT

>  struct page {
>         long unsigned int          flags;                /*     0     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         union {
>                 struct {
>                         struct list_head lru;            /*     8     8 */
>                         struct address_space * mapping;  /*    16     4 */
>                         long unsigned int index;         /*    20     4 */
>                         long unsigned int private;       /*    24     4 */
>                 };                                       /*     8    20 */
>                 struct {
>                         dma_addr_t dma_addr;             /*     8     8 */
>                 };                                       /*     8     8 */
[...]
>         } __attribute__((__aligned__(8)));               /*     8    24 */
>         union {
>                 atomic_t           _mapcount;            /*    32     4 */
>                 unsigned int       page_type;            /*    32     4 */
>                 unsigned int       active;               /*    32     4 */
>                 int                units;                /*    32     4 */
>         };                                               /*    32     4 */
>         atomic_t                   _refcount;            /*    36     4 */
> 
>         /* size: 40, cachelines: 1, members: 4 */
>         /* sum members: 36, holes: 1, sum holes: 4 */
>         /* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
>         /* last cacheline: 40 bytes */
> } __attribute__((__aligned__(8)));

If you also enable CONFIG_MEMCG or enough options to make
LAST_CPUPID_NOT_IN_PAGE_FLAGS true, you'll end up with another 4-byte
hole at the end.
David Laight April 14, 2021, 9:56 p.m. UTC | #12
From: Matthew Wilcox
> Sent: 14 April 2021 22:36
> 
> On Wed, Apr 14, 2021 at 09:13:22PM +0200, Jesper Dangaard Brouer wrote:
> > (If others want to reproduce).  First I could not reproduce on ARM32.
> > Then I found out that enabling CONFIG_XEN on ARCH=arm was needed to
> > cause the issue by enabling CONFIG_ARCH_DMA_ADDR_T_64BIT.
> 
> hmmm ... you should be able to provoke it by enabling ARM_LPAE,
> which selects PHYS_ADDR_T_64BIT, and
> 
> config ARCH_DMA_ADDR_T_64BIT
>         def_bool 64BIT || PHYS_ADDR_T_64BIT
> 
> >  struct page {
> >         long unsigned int          flags;                /*     0     4 */
> >
> >         /* XXX 4 bytes hole, try to pack */
> >
> >         union {
> >                 struct {
> >                         struct list_head lru;            /*     8     8 */
> >                         struct address_space * mapping;  /*    16     4 */
> >                         long unsigned int index;         /*    20     4 */
> >                         long unsigned int private;       /*    24     4 */
> >                 };                                       /*     8    20 */
> >                 struct {
> >                         dma_addr_t dma_addr

Adding __packed here will remove the 4 byte hole before the union
and the compiler seems clever enough to know that anything following
a 'long' must also be 'long' aligned.
So you don't get anything horrid like byte accesses.
On 64bit dma_addr will remain 64bit aligned.
On arm32 dma_addr will be 32bit aligned - but forcing two 32bit access
won't make any difference.

So definitely the only simple fix.

	David

> >                                            ;             /*     8     8 */
> >                 };                                       /*     8     8 */
> [...]
> >         } __attribute__((__aligned__(8)));               /*     8    24 */
> >         union {
> >                 atomic_t           _mapcount;            /*    32     4 */
> >                 unsigned int       page_type;            /*    32     4 */
> >                 unsigned int       active;               /*    32     4 */
> >                 int                units;                /*    32     4 */
> >         };                                               /*    32     4 */
> >         atomic_t                   _refcount;            /*    36     4 */
> >
> >         /* size: 40, cachelines: 1, members: 4 */
> >         /* sum members: 36, holes: 1, sum holes: 4 */
> >         /* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
> >         /* last cacheline: 40 bytes */
> > } __attribute__((__aligned__(8)));
> 
> If you also enable CONFIG_MEMCG or enough options to make
> LAST_CPUPID_NOT_IN_PAGE_FLAGS true, you'll end up with another 4-byte
> hole at the end.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jesper Dangaard Brouer April 15, 2021, 6:08 p.m. UTC | #13
On Wed, 14 Apr 2021 21:56:39 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Matthew Wilcox
> > Sent: 14 April 2021 22:36
> > 
> > On Wed, Apr 14, 2021 at 09:13:22PM +0200, Jesper Dangaard Brouer wrote:  
> > > (If others want to reproduce).  First I could not reproduce on ARM32.
> > > Then I found out that enabling CONFIG_XEN on ARCH=arm was needed to
> > > cause the issue by enabling CONFIG_ARCH_DMA_ADDR_T_64BIT.  
> > 
> > hmmm ... you should be able to provoke it by enabling ARM_LPAE,
> > which selects PHYS_ADDR_T_64BIT, and
> > 
> > config ARCH_DMA_ADDR_T_64BIT
> >         def_bool 64BIT || PHYS_ADDR_T_64BIT
> >   
> > >  struct page {
> > >         long unsigned int          flags;                /*     0     4 */
> > >
> > >         /* XXX 4 bytes hole, try to pack */
> > >
> > >         union {
> > >                 struct {
> > >                         struct list_head lru;            /*     8     8 */
> > >                         struct address_space * mapping;  /*    16     4 */
> > >                         long unsigned int index;         /*    20     4 */
> > >                         long unsigned int private;       /*    24     4 */
> > >                 };                                       /*     8    20 */
> > >                 struct {
> > >                         dma_addr_t dma_addr  
> 
> Adding __packed here will remove the 4 byte hole before the union
> and the compiler seems clever enough to know that anything following
> a 'long' must also be 'long' aligned.

Played with __packed in below patch, and I can confirm it seems to work.

> So you don't get anything horrid like byte accesses.
> On 64bit dma_addr will remain 64bit aligned.
> On arm32 dma_addr will be 32bit aligned - but forcing two 32bit access
> won't make any difference.

See below patch.  Where I swap32 the dma address to satisfy
page->compound having bit zero cleared. (It is the simplest fix I could
come up with).


[PATCH] page_pool: handling 32-bit archs with 64-bit dma_addr_t

From: Jesper Dangaard Brouer <brouer@redhat.com>

Workaround for storing 64-bit DMA-addr on 32-bit machines in struct
page.  The page->dma_addr share area with page->compound_head which
use bit zero to mark compound pages. This is okay, as DMA-addr are
aligned pointers which have bit zero cleared.

In the 32-bit case, page->compound_head is 32-bit.  Thus, when
dma_addr_t is 64-bit it will be located in top 32-bit.  Solve by
swapping dma_addr 32-bit segments.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/mm_types.h |    2 +-
 include/linux/types.h    |    1 +
 include/net/page_pool.h  |   21 ++++++++++++++++++++-
 net/core/page_pool.c     |    8 +++++---
 4 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..27406e3b1e1b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -100,7 +100,7 @@ struct page {
 			 * @dma_addr: might require a 64-bit value even on
 			 * 32-bit architectures.
 			 */
-			dma_addr_t dma_addr;
+			dma_addr_t dma_addr __packed;
 		};
 		struct {	/* slab, slob and slub */
 			union {
diff --git a/include/linux/types.h b/include/linux/types.h
index ac825ad90e44..65fd5d630016 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -141,6 +141,7 @@ typedef u64 blkcnt_t;
  */
 #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
 typedef u64 dma_addr_t;
+//typedef u64 __attribute__((aligned(sizeof(void *)))) dma_addr_t;
 #else
 typedef u32 dma_addr_t;
 #endif
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b5b195305346..c2329088665c 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -196,9 +196,28 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
 	page_pool_put_full_page(pool, page, true);
 }
 
+static inline
+dma_addr_t page_pool_dma_addr_read(dma_addr_t dma_addr)
+{
+	/* Workaround for storing 64-bit DMA-addr on 32-bit machines in struct
+	 * page.  The page->dma_addr share area with page->compound_head which
+	 * use bit zero to mark compound pages. This is okay, as DMA-addr are
+	 * aligned pointers which have bit zero cleared.
+	 *
+	 * In the 32-bit case, page->compound_head is 32-bit.  Thus, when
+	 * dma_addr_t is 64-bit it will be located in top 32-bit.  Solve by
+	 * swapping dma_addr 32-bit segments.
+	 */
+#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
+	if (sizeof(long unsigned int) == 4) /* 32-bit system */
+		dma_addr = (dma_addr << 32) | (dma_addr >> 32);
+#endif
+	return dma_addr;
+}
+
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-	return page->dma_addr;
+	return page_pool_dma_addr_read(page->dma_addr);
 }
 
 static inline bool is_page_pool_compiled_in(void)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..813598ea23f6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -174,8 +174,10 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
 					  struct page *page,
 					  unsigned int dma_sync_size)
 {
+	dma_addr_t dma = page_pool_dma_addr_read(page->dma_addr);
+
 	dma_sync_size = min(dma_sync_size, pool->p.max_len);
-	dma_sync_single_range_for_device(pool->p.dev, page->dma_addr,
+	dma_sync_single_range_for_device(pool->p.dev, dma,
 					 pool->p.offset, dma_sync_size,
 					 pool->p.dma_dir);
 }
@@ -226,7 +228,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 		put_page(page);
 		return NULL;
 	}
-	page->dma_addr = dma;
+	page->dma_addr = page_pool_dma_addr_read(dma);
 
 	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
 		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
@@ -294,7 +296,7 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
 		 */
 		goto skip_dma_unmap;
 
-	dma = page->dma_addr;
+	dma = page_pool_dma_addr_read(page->dma_addr);
 
 	/* When page is unmapped, it cannot be returned our pool */
 	dma_unmap_page_attrs(pool->p.dev, dma,


--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Matthew Wilcox (Oracle) April 15, 2021, 6:21 p.m. UTC | #14
On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> +static inline
> +dma_addr_t page_pool_dma_addr_read(dma_addr_t dma_addr)
> +{
> +	/* Workaround for storing 64-bit DMA-addr on 32-bit machines in struct
> +	 * page.  The page->dma_addr share area with page->compound_head which
> +	 * use bit zero to mark compound pages. This is okay, as DMA-addr are
> +	 * aligned pointers which have bit zero cleared.
> +	 *
> +	 * In the 32-bit case, page->compound_head is 32-bit.  Thus, when
> +	 * dma_addr_t is 64-bit it will be located in top 32-bit.  Solve by
> +	 * swapping dma_addr 32-bit segments.
> +	 */
> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT

#if defined(CONFIG_ARCH_DMA_ADDR_T_64BIT) && defined(__BIG_ENDIAN)
otherwise you'll create the problem on ARM that you're avoiding on PPC ...

I think you want to delete the word '_read' from this function name because
you're using it for both read and write.
David Laight April 15, 2021, 9:11 p.m. UTC | #15
From: Matthew Wilcox <willy@infradead.org>
> Sent: 15 April 2021 19:22
> 
> On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > +static inline
> > +dma_addr_t page_pool_dma_addr_read(dma_addr_t dma_addr)
> > +{
> > +	/* Workaround for storing 64-bit DMA-addr on 32-bit machines in struct
> > +	 * page.  The page->dma_addr share area with page->compound_head which
> > +	 * use bit zero to mark compound pages. This is okay, as DMA-addr are
> > +	 * aligned pointers which have bit zero cleared.
> > +	 *
> > +	 * In the 32-bit case, page->compound_head is 32-bit.  Thus, when
> > +	 * dma_addr_t is 64-bit it will be located in top 32-bit.  Solve by
> > +	 * swapping dma_addr 32-bit segments.
> > +	 */
> > +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> 
> #if defined(CONFIG_ARCH_DMA_ADDR_T_64BIT) && defined(__BIG_ENDIAN)
> otherwise you'll create the problem on ARM that you're avoiding on PPC ...
> 
> I think you want to delete the word '_read' from this function name because
> you're using it for both read and write.

I think I'd use explicit dma_addr_hi and dma_addr_lo and
separate read/write functions just to make absolutely sure
nothing picks up the swapped value.

Isn't it possible to move the field down one long?
This might require an explicit zero - but this is not a common
code path - the extra write will be noise.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox (Oracle) April 15, 2021, 10:22 p.m. UTC | #16
On Thu, Apr 15, 2021 at 09:11:56PM +0000, David Laight wrote:
> Isn't it possible to move the field down one long?
> This might require an explicit zero - but this is not a common
> code path - the extra write will be noise.

Then it overlaps page->mapping.  See emails passim.
David Laight April 16, 2021, 7:32 a.m. UTC | #17
From: Matthew Wilcox <willy@infradead.org>
> Sent: 15 April 2021 23:22
> 
> On Thu, Apr 15, 2021 at 09:11:56PM +0000, David Laight wrote:
> > Isn't it possible to move the field down one long?
> > This might require an explicit zero - but this is not a common
> > code path - the extra write will be noise.
> 
> Then it overlaps page->mapping.  See emails passim.

The rules on overlaps make be wonder if every 'long'
should be in its own union.
The comments would need to say when each field is used.
It would, at least, make these errors less common.

That doesn't solve the 64bit dma_addr though.

Actually rather that word-swapping dma_addr on 32bit BE
could you swap over the two fields it overlays with.
That might look messy in the .h, but it doesn't require
an accessor function to do the swap - easily missed.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox (Oracle) April 16, 2021, 11:05 a.m. UTC | #18
On Fri, Apr 16, 2021 at 07:32:35AM +0000, David Laight wrote:
> From: Matthew Wilcox <willy@infradead.org>
> > Sent: 15 April 2021 23:22
> > 
> > On Thu, Apr 15, 2021 at 09:11:56PM +0000, David Laight wrote:
> > > Isn't it possible to move the field down one long?
> > > This might require an explicit zero - but this is not a common
> > > code path - the extra write will be noise.
> > 
> > Then it overlaps page->mapping.  See emails passim.
> 
> The rules on overlaps make be wonder if every 'long'
> should be in its own union.

That was what we used to have.  It was worse.

> The comments would need to say when each field is used.
> It would, at least, make these errors less common.
> 
> That doesn't solve the 64bit dma_addr though.
> 
> Actually rather that word-swapping dma_addr on 32bit BE
> could you swap over the two fields it overlays with.
> That might look messy in the .h, but it doesn't require
> an accessor function to do the swap - easily missed.

No.
Matthew Wilcox (Oracle) April 16, 2021, 3:27 p.m. UTC | #19
On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> See below patch.  Where I swap32 the dma address to satisfy
> page->compound having bit zero cleared. (It is the simplest fix I could
> come up with).

I think this is slightly simpler, and as a bonus code that assumes the
old layout won't compile.

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..5aacc1c10a45 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -97,10 +97,10 @@ struct page {
 		};
 		struct {	/* page_pool used by netstack */
 			/**
-			 * @dma_addr: might require a 64-bit value even on
+			 * @dma_addr: might require a 64-bit value on
 			 * 32-bit architectures.
 			 */
-			dma_addr_t dma_addr;
+			unsigned long dma_addr[2];
 		};
 		struct {	/* slab, slob and slub */
 			union {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b5b195305346..db7c7020746a 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
 
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-	return page->dma_addr;
+	dma_addr_t ret = page->dma_addr[0];
+	if (sizeof(dma_addr_t) > sizeof(unsigned long))
+		ret |= (dma_addr_t)page->dma_addr[1] << 32;
+	return ret;
+}
+
+static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
+{
+	page->dma_addr[0] = addr;
+	if (sizeof(dma_addr_t) > sizeof(unsigned long))
+		page->dma_addr[1] = addr >> 32;
 }
 
 static inline bool is_page_pool_compiled_in(void)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..f014fd8c19a6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -174,8 +174,10 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
 					  struct page *page,
 					  unsigned int dma_sync_size)
 {
+	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+
 	dma_sync_size = min(dma_sync_size, pool->p.max_len);
-	dma_sync_single_range_for_device(pool->p.dev, page->dma_addr,
+	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
 					 pool->p.offset, dma_sync_size,
 					 pool->p.dma_dir);
 }
@@ -226,7 +228,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 		put_page(page);
 		return NULL;
 	}
-	page->dma_addr = dma;
+	page_pool_set_dma_addr(page, dma);
 
 	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
 		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
@@ -294,13 +296,13 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
 		 */
 		goto skip_dma_unmap;
 
-	dma = page->dma_addr;
+	dma = page_pool_get_dma_addr(page);
 
-	/* When page is unmapped, it cannot be returned our pool */
+	/* When page is unmapped, it cannot be returned to our pool */
 	dma_unmap_page_attrs(pool->p.dev, dma,
 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 			     DMA_ATTR_SKIP_CPU_SYNC);
-	page->dma_addr = 0;
+	page_pool_set_dma_addr(page, 0);
 skip_dma_unmap:
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
Jesper Dangaard Brouer April 16, 2021, 5:08 p.m. UTC | #20
On Fri, 16 Apr 2021 16:27:55 +0100
Matthew Wilcox <willy@infradead.org> wrote:

> On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > See below patch.  Where I swap32 the dma address to satisfy
> > page->compound having bit zero cleared. (It is the simplest fix I could
> > come up with).  
> 
> I think this is slightly simpler, and as a bonus code that assumes the
> old layout won't compile.

This is clever, I like it!  When reading the code one just have to
remember 'unsigned long' size difference between 64-bit vs 32-bit.
And I assume compiler can optimize the sizeof check out then doable.

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6613b26a8894..5aacc1c10a45 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -97,10 +97,10 @@ struct page {
>  		};
>  		struct {	/* page_pool used by netstack */
>  			/**
> -			 * @dma_addr: might require a 64-bit value even on
> +			 * @dma_addr: might require a 64-bit value on
>  			 * 32-bit architectures.
>  			 */
> -			dma_addr_t dma_addr;
> +			unsigned long dma_addr[2];
>  		};
>  		struct {	/* slab, slob and slub */
>  			union {
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index b5b195305346..db7c7020746a 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>  
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> -	return page->dma_addr;
> +	dma_addr_t ret = page->dma_addr[0];
> +	if (sizeof(dma_addr_t) > sizeof(unsigned long))
> +		ret |= (dma_addr_t)page->dma_addr[1] << 32;
> +	return ret;
> +}
> +
> +static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> +{
> +	page->dma_addr[0] = addr;
> +	if (sizeof(dma_addr_t) > sizeof(unsigned long))
> +		page->dma_addr[1] = addr >> 32;
>  }
>  
>  static inline bool is_page_pool_compiled_in(void)
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ad8b0707af04..f014fd8c19a6 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -174,8 +174,10 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>  					  struct page *page,
>  					  unsigned int dma_sync_size)
>  {
> +	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> +
>  	dma_sync_size = min(dma_sync_size, pool->p.max_len);
> -	dma_sync_single_range_for_device(pool->p.dev, page->dma_addr,
> +	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
>  					 pool->p.offset, dma_sync_size,
>  					 pool->p.dma_dir);
>  }
> @@ -226,7 +228,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>  		put_page(page);
>  		return NULL;
>  	}
> -	page->dma_addr = dma;
> +	page_pool_set_dma_addr(page, dma);
>  
>  	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>  		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> @@ -294,13 +296,13 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
>  		 */
>  		goto skip_dma_unmap;
>  
> -	dma = page->dma_addr;
> +	dma = page_pool_get_dma_addr(page);
>  
> -	/* When page is unmapped, it cannot be returned our pool */
> +	/* When page is unmapped, it cannot be returned to our pool */
>  	dma_unmap_page_attrs(pool->p.dev, dma,
>  			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
>  			     DMA_ATTR_SKIP_CPU_SYNC);
> -	page->dma_addr = 0;
> +	page_pool_set_dma_addr(page, 0);
>  skip_dma_unmap:
>  	/* This may be the last page returned, releasing the pool, so
>  	 * it is not safe to reference pool afterwards.
>
Matthew Wilcox (Oracle) April 17, 2021, 3:19 a.m. UTC | #21
On Fri, Apr 16, 2021 at 07:08:23PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 16 Apr 2021 16:27:55 +0100
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > > See below patch.  Where I swap32 the dma address to satisfy
> > > page->compound having bit zero cleared. (It is the simplest fix I could
> > > come up with).  
> > 
> > I think this is slightly simpler, and as a bonus code that assumes the
> > old layout won't compile.
> 
> This is clever, I like it!  When reading the code one just have to
> remember 'unsigned long' size difference between 64-bit vs 32-bit.
> And I assume compiler can optimize the sizeof check out then doable.

I checked before/after with the replacement patch that doesn't
have compiler warnings.  On x86, there is zero codegen difference
(objdump -dr before/after matches exactly) for both x86-32 with 32-bit
dma_addr_t and x86-64.  For x86-32 with 64-bit dma_addr_t, the compiler
makes some different inlining decisions in page_pool_empty_ring(),
page_pool_put_page() and page_pool_put_page_bulk(), but it's not clear
to me that they're wrong.

Function                                     old     new   delta
page_pool_empty_ring                         387     307     -80
page_pool_put_page                           604     516     -88
page_pool_put_page_bulk                      690     517    -173
Arnd Bergmann April 17, 2021, 10:31 a.m. UTC | #22
On Fri, Apr 16, 2021 at 5:27 PM Matthew Wilcox <willy@infradead.org> wrote:
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index b5b195305346..db7c7020746a 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> -       return page->dma_addr;
> +       dma_addr_t ret = page->dma_addr[0];
> +       if (sizeof(dma_addr_t) > sizeof(unsigned long))
> +               ret |= (dma_addr_t)page->dma_addr[1] << 32;
> +       return ret;
> +}

Have you considered using a PFN type address here? I suspect you
can prove that shifting the DMA address by PAGE_BITS would
make it fit into an 'unsigned long' on all 32-bit architectures with
64-bit dma_addr_t. This requires that page->dma_addr to be
page aligned, as well as fit into 44 bits. I recently went through the
maximum address space per architecture to define a
MAX_POSSIBLE_PHYSMEM_BITS, and none of them have more than
40 here, presumably the same is true for dma address space.

        Arnd
David Laight April 17, 2021, 10:59 a.m. UTC | #23
From: Matthew Wilcox
> Sent: 16 April 2021 16:28
> 
> On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > See below patch.  Where I swap32 the dma address to satisfy
> > page->compound having bit zero cleared. (It is the simplest fix I could
> > come up with).
> 
> I think this is slightly simpler, and as a bonus code that assumes the
> old layout won't compile.

Always a good plan.

...
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> -	return page->dma_addr;
> +	dma_addr_t ret = page->dma_addr[0];
> +	if (sizeof(dma_addr_t) > sizeof(unsigned long))
> +		ret |= (dma_addr_t)page->dma_addr[1] << 32;
> +	return ret;
> +}

Won't some compiler/option combinations generate an
error for the '<< 32' when dma_addr_t is 32bit?

You might need to use a (u64) cast.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox (Oracle) April 17, 2021, 1:56 p.m. UTC | #24
On Sat, Apr 17, 2021 at 12:31:37PM +0200, Arnd Bergmann wrote:
> On Fri, Apr 16, 2021 at 5:27 PM Matthew Wilcox <willy@infradead.org> wrote:
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index b5b195305346..db7c7020746a 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> >
> >  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >  {
> > -       return page->dma_addr;
> > +       dma_addr_t ret = page->dma_addr[0];
> > +       if (sizeof(dma_addr_t) > sizeof(unsigned long))
> > +               ret |= (dma_addr_t)page->dma_addr[1] << 32;
> > +       return ret;
> > +}
> 
> Have you considered using a PFN type address here? I suspect you
> can prove that shifting the DMA address by PAGE_BITS would
> make it fit into an 'unsigned long' on all 32-bit architectures with
> 64-bit dma_addr_t. This requires that page->dma_addr to be
> page aligned, as well as fit into 44 bits. I recently went through the
> maximum address space per architecture to define a
> MAX_POSSIBLE_PHYSMEM_BITS, and none of them have more than
> 40 here, presumably the same is true for dma address space.

I wouldn't like to make that assumption.  I've come across IOMMUs (maybe
on parisc?  powerpc?) that like to encode fun information in the top
few bits.  So we could get it down to 52 bits, but I don't think we can
get all the way down to 32 bits.  Also, we need to keep the bottom bit
clear for PageTail, so that further constrains us.

Anyway, I like the "two unsigned longs" approach I posted yesterday,
but thanks for the suggestion.
Arnd Bergmann April 17, 2021, 5:30 p.m. UTC | #25
On Sat, Apr 17, 2021 at 3:58 PM Matthew Wilcox <willy@infradead.org> wrote:
> I wouldn't like to make that assumption.  I've come across IOMMUs (maybe
> on parisc?  powerpc?) that like to encode fun information in the top
> few bits.  So we could get it down to 52 bits, but I don't think we can
> get all the way down to 32 bits.  Also, we need to keep the bottom bit
> clear for PageTail, so that further constrains us.

I'd be surprised to find such an IOMMU on a 32-bit machine, given that
the main reason for using an IOMMU on these is to avoid the 32-bit
address limit in DMA masters.

I see that parisc32 does not enable 64-bit dma_addr_t, while powerpc32
does not support any IOMMU, so it wouldn't be either of those two.

I do remember some powerpc systems that encode additional flags
(transaction ordering, caching, ...) into the high bits of the physical
address in the IOTLB, but not the virtual address used for looking
them up.

> Anyway, I like the "two unsigned longs" approach I posted yesterday,
> but thanks for the suggestion.

Ok, fair enough. As long as there are enough bits in this branch of
'struct page', I suppose it is the safe choice.

        Arnd
Christoph Hellwig April 19, 2021, 6:34 a.m. UTC | #26
On Fri, Apr 16, 2021 at 04:27:55PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > See below patch.  Where I swap32 the dma address to satisfy
> > page->compound having bit zero cleared. (It is the simplest fix I could
> > come up with).
> 
> I think this is slightly simpler, and as a bonus code that assumes the
> old layout won't compile.

So, why do we even do this crappy overlay of a dma address?  This just
all seems like a giant hack.  Random subsystems should not just steal
a few struct page fields as that just turns into the desasters like the
one we've seen here or probably something worse next time.
Ilias Apalodimas April 19, 2021, 7:15 a.m. UTC | #27
Hi Christoph,

On Mon, Apr 19, 2021 at 08:34:41AM +0200, Christoph Hellwig wrote:
> On Fri, Apr 16, 2021 at 04:27:55PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > > See below patch.  Where I swap32 the dma address to satisfy
> > > page->compound having bit zero cleared. (It is the simplest fix I could
> > > come up with).
> > 
> > I think this is slightly simpler, and as a bonus code that assumes the
> > old layout won't compile.
> 
> So, why do we even do this crappy overlay of a dma address?  This just
> all seems like a giant hack.  Random subsystems should not just steal
> a few struct page fields as that just turns into the desasters like the
> one we've seen here or probably something worse next time.

The page pool API was using page->private in the past to store these kind of
info. That caused a problem to begin with, since it would fail  on 32-bit
systems with 64bit DMA.  We had a similar discussion on the past but decided
struct page is the right place to store that [1].

Another advantage is that we can now use the information from the networking 
subsystem and enable recycling of SKBs and SKB fragments, by using the stored 
metadata of struct page [2].

[1] https://lore.kernel.org/netdev/20190207.132519.1698007650891404763.davem@davemloft.net/
[2] https://lore.kernel.org/netdev/20210409223801.104657-1-mcroce@linux.microsoft.com/

Cheers
/Ilias
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..45c563e9b50e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -68,16 +68,22 @@  struct mem_cgroup;
 #endif
 
 struct page {
-	unsigned long flags;		/* Atomic flags, some possibly
-					 * updated asynchronously */
 	/*
-	 * Five words (20/40 bytes) are available in this union.
-	 * WARNING: bit 0 of the first word is used for PageTail(). That
-	 * means the other users of this union MUST NOT use the bit to
+	 * This union is six words (24 / 48 bytes) in size.
+	 * The first word is reserved for atomic flags, often updated
+	 * asynchronously.  Use the PageFoo() macros to access it.  Some
+	 * of the flags can be reused for your own purposes, but the
+	 * word as a whole often contains other information and overwriting
+	 * it will cause functions like page_zone() and page_node() to stop
+	 * working correctly.
+	 *
+	 * Bit 0 of the second word is used for PageTail(). That
+	 * means the other users of this union MUST leave the bit zero to
 	 * avoid collision and false-positive PageTail().
 	 */
 	union {
 		struct {	/* Page cache and anonymous pages */
+			unsigned long flags;
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
 			 * lruvec->lru_lock.  Sometimes used as a generic list
@@ -96,13 +102,14 @@  struct page {
 			unsigned long private;
 		};
 		struct {	/* page_pool used by netstack */
-			/**
-			 * @dma_addr: might require a 64-bit value even on
-			 * 32-bit architectures.
-			 */
-			dma_addr_t dma_addr;
+			unsigned long _pp_flags;
+			unsigned long pp_magic;
+			unsigned long xmi;
+			unsigned long _pp_mapping_pad;
+			dma_addr_t dma_addr;	/* might be one or two words */
 		};
 		struct {	/* slab, slob and slub */
+			unsigned long _slab_flags;
 			union {
 				struct list_head slab_list;
 				struct {	/* Partial pages */
@@ -130,6 +137,7 @@  struct page {
 			};
 		};
 		struct {	/* Tail pages of compound page */
+			unsigned long _t1_flags;
 			unsigned long compound_head;	/* Bit zero is set */
 
 			/* First tail page only */
@@ -139,12 +147,14 @@  struct page {
 			unsigned int compound_nr; /* 1 << compound_order */
 		};
 		struct {	/* Second tail page of compound page */
+			unsigned long _t2_flags;
 			unsigned long _compound_pad_1;	/* compound_head */
 			atomic_t hpage_pinned_refcount;
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
 		struct {	/* Page table pages */
+			unsigned long _pt_flags;
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
@@ -159,6 +169,7 @@  struct page {
 #endif
 		};
 		struct {	/* ZONE_DEVICE pages */
+			unsigned long _zd_flags;
 			/** @pgmap: Points to the hosting device page map. */
 			struct dev_pagemap *pgmap;
 			void *zone_device_data;
@@ -174,8 +185,11 @@  struct page {
 			 */
 		};
 
-		/** @rcu_head: You can use this to free a page by RCU. */
-		struct rcu_head rcu_head;
+		struct {
+			unsigned long _rcu_flags;
+			/** @rcu_head: You can use this to free a page by RCU. */
+			struct rcu_head rcu_head;
+		};
 	};
 
 	union {		/* This union is 4 bytes in size. */