mm: Decline to manipulate the refcount on a slab page

Message ID	20250310143544.1216127-1-willy@infradead.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, netdev@vger.kernel.org, Vlastimil Babka <vbabka@suse.cz>, linux-mm@kvack.org, Hannes Reinecke <hare@suse.de> Subject: [PATCH] mm: Decline to manipulate the refcount on a slab page Date: Mon, 10 Mar 2025 14:35:24 +0000 Message-ID: <20250310143544.1216127-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Decline to manipulate the refcount on a slab page \| expand mm: Decline to manipulate the refcount on a slab page

Matthew Wilcox March 10, 2025, 2:35 p.m. UTC

Slab pages now have a refcount of 0, so nobody should be trying to
manipulate the refcount on them.  Doing so has little effect; the object
could be freed and reallocated to a different purpose, although the slab
itself would not be until the refcount was put making it behave rather
like TYPESAFE_BY_RCU.

Unfortunately, __iov_iter_get_pages_alloc() does take a refcount.
Fix that to not change the refcount, and make put_page() silently not
change the refcount.  get_page() warns so that we can fix any other
callers that need to be changed.

Long-term, networking needs to stop taking a refcount on the pages that
it uses and rely on the caller to hold whatever references are necessary
to make the memory stable.  In the medium term, more page types are going
to hav a zero refcount, so we'll want to move get_page() and put_page()
out of line.

Reported-by: Hannes Reinecke <hare@suse.de>
Fixes: 9aec2fb0fd5e (slab: allocate frozen pages)
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h | 7 ++++++-
 lib/iov_iter.c     | 8 ++++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

Vlastimil Babka March 10, 2025, 2:37 p.m. UTC | #1

On 3/10/25 15:35, Matthew Wilcox (Oracle) wrote:
> Slab pages now have a refcount of 0, so nobody should be trying to
> manipulate the refcount on them.  Doing so has little effect; the object
> could be freed and reallocated to a different purpose, although the slab
> itself would not be until the refcount was put making it behave rather
> like TYPESAFE_BY_RCU.
> 
> Unfortunately, __iov_iter_get_pages_alloc() does take a refcount.
> Fix that to not change the refcount, and make put_page() silently not
> change the refcount.  get_page() warns so that we can fix any other
> callers that need to be changed.
> 
> Long-term, networking needs to stop taking a refcount on the pages that
> it uses and rely on the caller to hold whatever references are necessary
> to make the memory stable.  In the medium term, more page types are going
> to hav a zero refcount, so we'll want to move get_page() and put_page()
> out of line.
> 
> Reported-by: Hannes Reinecke <hare@suse.de>

Closes:
https://lore.kernel.org/all/08c29e4b-2f71-4b6d-8046-27e407214d8c@suse.com/

> Fixes: 9aec2fb0fd5e (slab: allocate frozen pages)
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Note it's a 6.14 hotfix for kernel oopses due to page refcount overflow.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/mm.h | 7 ++++++-
>  lib/iov_iter.c     | 8 ++++++--
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 61de65c4e430..4e118cbe0556 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1539,7 +1539,10 @@ static inline void folio_get(struct folio *folio)
>  
>  static inline void get_page(struct page *page)
>  {
> -	folio_get(page_folio(page));
> +	struct folio *folio = page_folio(page);
> +	if (WARN_ON_ONCE(folio_test_slab(folio)))
> +		return;
> +	folio_get(folio);
>  }
>  
>  static inline __must_check bool try_get_page(struct page *page)
> @@ -1633,6 +1636,8 @@ static inline void put_page(struct page *page)
>  {
>  	struct folio *folio = page_folio(page);
>  
> +	if (folio_test_slab(folio))
> +		return;
>  	folio_put(folio);
>  }
>  
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 65f550cb5081..8c7fdb7d8c8f 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1190,8 +1190,12 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		if (!n)
>  			return -ENOMEM;
>  		p = *pages;
> -		for (int k = 0; k < n; k++)
> -			get_page(p[k] = page + k);
> +		for (int k = 0; k < n; k++) {
> +			struct folio *folio = page_folio(page);
> +			p[k] = page + k;
> +			if (!folio_test_slab(folio))
> +				folio_get(folio);
> +		}
>  		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
>  		i->count -= maxsize;
>  		i->iov_offset += maxsize;

Matthew Wilcox March 10, 2025, 2:50 p.m. UTC | #2

On Mon, Mar 10, 2025 at 03:37:51PM +0100, Vlastimil Babka wrote:
> Note it's a 6.14 hotfix for kernel oopses due to page refcount overflow.

Not actually overflow ... without VM_DEBUG enabled, networking increases
the refcount from 0 to 1, then decrements it from 1 to 0, causing the
slab to be freed.  So it's a UAF bug induced by a messed-up refcount.

Jakub Kicinski March 11, 2025, 10:15 a.m. UTC | #3

On Mon, 10 Mar 2025 14:35:24 +0000 Matthew Wilcox (Oracle) wrote:
> Long-term, networking needs to stop taking a refcount on the pages that
> it uses and rely on the caller to hold whatever references are necessary
> to make the memory stable. 

TBH I'm not clear on who is going to fix this.
IIRC we already told NVMe people that sending slab memory over sendpage 
is not well supported. Plus the bug is in BPF integration, judging by
the stack traces (skmsg is a BPF thing). Joy.

Hannes Reinecke March 11, 2025, 3:46 p.m. UTC | #4

On 3/11/25 11:15, Jakub Kicinski wrote:
> On Mon, 10 Mar 2025 14:35:24 +0000 Matthew Wilcox (Oracle) wrote:
>> Long-term, networking needs to stop taking a refcount on the pages that
>> it uses and rely on the caller to hold whatever references are necessary
>> to make the memory stable.
> 
> TBH I'm not clear on who is going to fix this.
> IIRC we already told NVMe people that sending slab memory over sendpage
> is not well supported. Plus the bug is in BPF integration, judging by
> the stack traces (skmsg is a BPF thing). Joy.

Hmm. Did you? Seem to have missed it.
We make sure to not do it via the 'sendpage_ok()' call; but other than
that it's not much we can do.

And BPF is probably not the culprit; issue here is that we have a kvec, 
package it into a bio (where it gets converted into a bvec),
and then call an iov iterator in tls_sw to get to the pages.
But at that stage we only see the bvec iterator, and the information
that it was an kvec to start with has been lost.

All wouldn't be so bad if we wouldn't call get_page/put_page (the caller
holds the reference, after all), but iov iterators and the skmsg code
insists upon.

Cheers,

Hannes

Matthew Wilcox March 11, 2025, 4:59 p.m. UTC | #5

On Tue, Mar 11, 2025 at 04:46:45PM +0100, Hannes Reinecke wrote:
> On 3/11/25 11:15, Jakub Kicinski wrote:
> > On Mon, 10 Mar 2025 14:35:24 +0000 Matthew Wilcox (Oracle) wrote:
> > > Long-term, networking needs to stop taking a refcount on the pages that
> > > it uses and rely on the caller to hold whatever references are necessary
> > > to make the memory stable.
> > 
> > TBH I'm not clear on who is going to fix this.
> > IIRC we already told NVMe people that sending slab memory over sendpage
> > is not well supported. Plus the bug is in BPF integration, judging by
> > the stack traces (skmsg is a BPF thing). Joy.
> 
> Hmm. Did you? Seem to have missed it.
> We make sure to not do it via the 'sendpage_ok()' call; but other than
> that it's not much we can do.
> 
> And BPF is probably not the culprit; issue here is that we have a kvec,
> package it into a bio (where it gets converted into a bvec),
> and then call an iov iterator in tls_sw to get to the pages.
> But at that stage we only see the bvec iterator, and the information
> that it was an kvec to start with has been lost.

So I have two questions:

Hannes:
 - Why does nvme need to turn the kvec into a bio rather than just
   send it directly?
Jakub:
 - Why does the socket code think it needs to get a refcount on a bvec
   at all, since the block layer doesn't?

mm: Decline to manipulate the refcount on a slab page

Commit Message

Comments

Patch