diff mbox series

[resend] mm/gup: fix try_grab_compound_head() race with split_huge_page()

Message ID 20210611161545.998858-1-jannh@google.com (mailing list archive)
State New, archived
Headers show
Series [resend] mm/gup: fix try_grab_compound_head() race with split_huge_page() | expand

Commit Message

Jann Horn June 11, 2021, 4:15 p.m. UTC
try_grab_compound_head() is used to grab a reference to a page from
get_user_pages_fast(), which is only protected against concurrent
freeing of page tables (via local_irq_save()), but not against
concurrent TLB flushes, freeing of data pages, or splitting of compound
pages.

Because no reference is held to the page when try_grab_compound_head()
is called, the page may have been freed and reallocated by the time its
refcount has been elevated; therefore, once we're holding a stable
reference to the page, the caller re-checks whether the PTE still points
to the same page (with the same access rights).

The problem is that try_grab_compound_head() has to grab a reference on
the head page; but between the time we look up what the head page is and
the time we actually grab a reference on the head page, the compound
page may have been split up (either explicitly through split_huge_page()
or by freeing the compound page to the buddy allocator and then
allocating its individual order-0 pages).
If that happens, get_user_pages_fast() may end up returning the right
page but lifting the refcount on a now-unrelated page, leading to
use-after-free of pages.

To fix it:
Re-check whether the pages still belong together after lifting the
refcount on the head page.
Move anything else that checks compound_head(page) below the refcount
increment.

This can't actually happen on bare-metal x86 (because there, disabling
IRQs locks out remote TLB flushes), but it can happen on virtualized x86
(e.g. under KVM) and probably also on arm64. The race window is pretty
narrow, and constantly allocating and shattering hugepages isn't exactly
fast; for now I've only managed to reproduce this in an x86 KVM guest with
an artificially widened timing window (by adding a loop that repeatedly
calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits,
so that PV TLB flushes are used instead of IPIs).

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
Signed-off-by: Jann Horn <jannh@google.com>
---
resending because linux-mm was down

 mm/gup.c | 54 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 39 insertions(+), 15 deletions(-)


base-commit: 614124bea77e452aa6df7a8714e8bc820b489922

Comments

Andrew Morton June 11, 2021, 10:36 p.m. UTC | #1
On Fri, 11 Jun 2021 18:15:45 +0200 Jann Horn <jannh@google.com> wrote:

> try_grab_compound_head() is used to grab a reference to a page from
> get_user_pages_fast(), which is only protected against concurrent
> freeing of page tables (via local_irq_save()), but not against
> concurrent TLB flushes, freeing of data pages, or splitting of compound
> pages.
> 
> Because no reference is held to the page when try_grab_compound_head()
> is called, the page may have been freed and reallocated by the time its
> refcount has been elevated; therefore, once we're holding a stable
> reference to the page, the caller re-checks whether the PTE still points
> to the same page (with the same access rights).
> 
> The problem is that try_grab_compound_head() has to grab a reference on
> the head page; but between the time we look up what the head page is and
> the time we actually grab a reference on the head page, the compound
> page may have been split up (either explicitly through split_huge_page()
> or by freeing the compound page to the buddy allocator and then
> allocating its individual order-0 pages).
> If that happens, get_user_pages_fast() may end up returning the right
> page but lifting the refcount on a now-unrelated page, leading to
> use-after-free of pages.
> 
> To fix it:
> Re-check whether the pages still belong together after lifting the
> refcount on the head page.
> Move anything else that checks compound_head(page) below the refcount
> increment.
> 
> This can't actually happen on bare-metal x86 (because there, disabling
> IRQs locks out remote TLB flushes), but it can happen on virtualized x86
> (e.g. under KVM) and probably also on arm64. The race window is pretty
> narrow, and constantly allocating and shattering hugepages isn't exactly
> fast; for now I've only managed to reproduce this in an x86 KVM guest with
> an artificially widened timing window (by adding a loop that repeatedly
> calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits,
> so that PV TLB flushes are used instead of IPIs).
> 
> ...
>
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -43,8 +43,21 @@ static void hpage_pincount_sub(struct page *page, int refs)
>  
>  	atomic_sub(refs, compound_pincount_ptr(page));
>  }
>  
> +/* Equivalent to calling put_page() @refs times. */
> +static void put_page_refs(struct page *page, int refs)
> +{
> +	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);

I don't think there's a need to nuke the whole kernel in this case. 
Can we warn then simply leak the page?  That way we have a much better
chance of getting a good bug report.

> +	/*
> +	 * Calling put_page() for each ref is unnecessarily slow. Only the last
> +	 * ref needs a put_page().
> +	 */
> +	if (refs > 1)
> +		page_ref_sub(page, refs - 1);
> +	put_page(page);
> +}
Jann Horn June 12, 2021, 1:49 a.m. UTC | #2
On Sat, Jun 12, 2021 at 12:36 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri, 11 Jun 2021 18:15:45 +0200 Jann Horn <jannh@google.com> wrote:
> > +/* Equivalent to calling put_page() @refs times. */
> > +static void put_page_refs(struct page *page, int refs)
> > +{
> > +     VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
>
> I don't think there's a need to nuke the whole kernel in this case.
> Can we warn then simply leak the page?  That way we have a much better
> chance of getting a good bug report.

Ah, yeah, I guess that makes sense. I had just copied this over from
put_compound_head(), and figured it was fine to keep it as-is, but I
guess changing it would be reasonable. I'm not quite sure what the
best way to do that would be though.

I guess the check should go away in !DEBUG_VM builds?

Should I just explicitly put the check in an ifdef block? Like so:

#ifdef CONFIG_DEBUG_VM
if (VM_WARN_ON_ONCE_PAGE(...))
  return;
#endif

Or, since inline ifdeffery looks ugly, get rid of the explicit ifdef,
and change the !DEBUG_VM definition of VM_WARN_ON_ONCE_PAGE() as
follows so that the branch is compiled away?

#define VM_WARN_ON_ONCE_PAGE(cond, page)  (BUILD_BUG_ON_INVALID(cond), false)

That would look kinda neat, but it would be different from the
behavior of WARN_ON(), which still returns the original condition even
in !BUG builds, so that could be confusing...

> > +     /*
> > +      * Calling put_page() for each ref is unnecessarily slow. Only the last
> > +      * ref needs a put_page().
> > +      */
> > +     if (refs > 1)
> > +             page_ref_sub(page, refs - 1);
> > +     put_page(page);
> > +}
>
John Hubbard June 12, 2021, 10:17 a.m. UTC | #3
On 6/11/21 3:49 PM, Jann Horn wrote:
> On Sat, Jun 12, 2021 at 12:36 AM Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> On Fri, 11 Jun 2021 18:15:45 +0200 Jann Horn <jannh@google.com> wrote:
>>> +/* Equivalent to calling put_page() @refs times. */
>>> +static void put_page_refs(struct page *page, int refs)
>>> +{
>>> +     VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
>>
>> I don't think there's a need to nuke the whole kernel in this case.
>> Can we warn then simply leak the page?  That way we have a much better
>> chance of getting a good bug report.
> 
> Ah, yeah, I guess that makes sense. I had just copied this over from
> put_compound_head(), and figured it was fine to keep it as-is, but I
> guess changing it would be reasonable. I'm not quite sure what the
> best way to do that would be though.
> 
> I guess the check should go away in !DEBUG_VM builds?
> 
> Should I just explicitly put the check in an ifdef block? Like so:
> 
> #ifdef CONFIG_DEBUG_VM
> if (VM_WARN_ON_ONCE_PAGE(...))
>    return;
> #endif
> 
> Or, since inline ifdeffery looks ugly, get rid of the explicit ifdef,

Agreed: VM_WARN_ON_ONCE_PAGE(), at least at the API level, seems like
the best thing to use here. However, as you point out below, it needs a
little something first.

> and change the !DEBUG_VM definition of VM_WARN_ON_ONCE_PAGE() as
> follows so that the branch is compiled away?
> 
> #define VM_WARN_ON_ONCE_PAGE(cond, page)  (BUILD_BUG_ON_INVALID(cond), false)
> 
> That would look kinda neat, but it would be different from the
> behavior of WARN_ON(), which still returns the original condition even
> in !BUG builds, so that could be confusing...
> 

The VM_WARN_ON_ONCE_PAGE() is not implemented exactly right
in the !CONFIG_DEBUG_VM case. IMHO it should follow the WARN*()
behavior, and return the original condition and keep going
in that case.

Then you could use it directly here.


thanks,
Jann Horn June 14, 2021, 4:47 a.m. UTC | #4
On Sat, Jun 12, 2021 at 12:17 PM John Hubbard <jhubbard@nvidia.com> wrote:
> On 6/11/21 3:49 PM, Jann Horn wrote:
> > On Sat, Jun 12, 2021 at 12:36 AM Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> >> On Fri, 11 Jun 2021 18:15:45 +0200 Jann Horn <jannh@google.com> wrote:
> >>> +/* Equivalent to calling put_page() @refs times. */
> >>> +static void put_page_refs(struct page *page, int refs)
> >>> +{
> >>> +     VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
> >>
> >> I don't think there's a need to nuke the whole kernel in this case.
> >> Can we warn then simply leak the page?  That way we have a much better
> >> chance of getting a good bug report.
> >
> > Ah, yeah, I guess that makes sense. I had just copied this over from
> > put_compound_head(), and figured it was fine to keep it as-is, but I
> > guess changing it would be reasonable. I'm not quite sure what the
> > best way to do that would be though.
> >
> > I guess the check should go away in !DEBUG_VM builds?
> >
> > Should I just explicitly put the check in an ifdef block? Like so:
> >
> > #ifdef CONFIG_DEBUG_VM
> > if (VM_WARN_ON_ONCE_PAGE(...))
> >    return;
> > #endif
> >
> > Or, since inline ifdeffery looks ugly, get rid of the explicit ifdef,
>
> Agreed: VM_WARN_ON_ONCE_PAGE(), at least at the API level, seems like
> the best thing to use here. However, as you point out below, it needs a
> little something first.
>
> > and change the !DEBUG_VM definition of VM_WARN_ON_ONCE_PAGE() as
> > follows so that the branch is compiled away?
> >
> > #define VM_WARN_ON_ONCE_PAGE(cond, page)  (BUILD_BUG_ON_INVALID(cond), false)
> >
> > That would look kinda neat, but it would be different from the
> > behavior of WARN_ON(), which still returns the original condition even
> > in !BUG builds, so that could be confusing...
> >
>
> The VM_WARN_ON_ONCE_PAGE() is not implemented exactly right
> in the !CONFIG_DEBUG_VM case. IMHO it should follow the WARN*()
> behavior, and return the original condition and keep going
> in that case.

But the point of the existing definition is that the compiler can
avoid generating code for the condition in !DEBUG_VM builds, even if
it can't prove that the condition is free of side effects, right? If
VM_WARN_ON_ONCE_PAGE() was changed as you propose, then I think that
in mem_cgroup_page_lruvec(), the compiler would have to generate code
for mem_cgroup_disabled(), which calls static_branch_likely(), which
ends up in "asm volatile" statements; so the compiler probably won't
be able to eliminate the condition.

> Then you could use it directly here.

Depending on whether the intended behavior here is to skip the check
in !DEBUG_VM builds (which was the case before) or also perform the
check in DEBUG_VM builds. And if DEBUG_VM is a config option because
it might have some performance impact, isn't the cost of the check
probably quite large compared to the cost of printing the warning on a
codpath that should never execute?
Kirill A. Shutemov June 14, 2021, 1:10 p.m. UTC | #5
On Fri, Jun 11, 2021 at 06:15:45PM +0200, Jann Horn wrote:
> try_grab_compound_head() is used to grab a reference to a page from
> get_user_pages_fast(), which is only protected against concurrent
> freeing of page tables (via local_irq_save()), but not against
> concurrent TLB flushes, freeing of data pages, or splitting of compound
> pages.
> 
> Because no reference is held to the page when try_grab_compound_head()
> is called, the page may have been freed and reallocated by the time its
> refcount has been elevated; therefore, once we're holding a stable
> reference to the page, the caller re-checks whether the PTE still points
> to the same page (with the same access rights).
> 
> The problem is that try_grab_compound_head() has to grab a reference on
> the head page; but between the time we look up what the head page is and
> the time we actually grab a reference on the head page, the compound
> page may have been split up (either explicitly through split_huge_page()
> or by freeing the compound page to the buddy allocator and then
> allocating its individual order-0 pages).
> If that happens, get_user_pages_fast() may end up returning the right
> page but lifting the refcount on a now-unrelated page, leading to
> use-after-free of pages.
> 
> To fix it:
> Re-check whether the pages still belong together after lifting the
> refcount on the head page.
> Move anything else that checks compound_head(page) below the refcount
> increment.
> 
> This can't actually happen on bare-metal x86 (because there, disabling
> IRQs locks out remote TLB flushes), but it can happen on virtualized x86
> (e.g. under KVM) and probably also on arm64. The race window is pretty
> narrow, and constantly allocating and shattering hugepages isn't exactly
> fast; for now I've only managed to reproduce this in an x86 KVM guest with
> an artificially widened timing window (by adding a loop that repeatedly
> calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits,
> so that PV TLB flushes are used instead of IPIs).
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: stable@vger.kernel.org
> Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
> Signed-off-by: Jann Horn <jannh@google.com>

Looks good to me:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
John Hubbard June 15, 2021, 12:38 a.m. UTC | #6
On 6/13/21 9:47 PM, Jann Horn wrote:
...
>> The VM_WARN_ON_ONCE_PAGE() is not implemented exactly right
>> in the !CONFIG_DEBUG_VM case. IMHO it should follow the WARN*()
>> behavior, and return the original condition and keep going
>> in that case.
> 
> But the point of the existing definition is that the compiler can
> avoid generating code for the condition in !DEBUG_VM builds, even if
> it can't prove that the condition is free of side effects, right? If
> VM_WARN_ON_ONCE_PAGE() was changed as you propose, then I think that
> in mem_cgroup_page_lruvec(), the compiler would have to generate code
> for mem_cgroup_disabled(), which calls static_branch_likely(), which
> ends up in "asm volatile" statements; so the compiler probably won't
> be able to eliminate the condition.
> 
>> Then you could use it directly here.
> 
> Depending on whether the intended behavior here is to skip the check
> in !DEBUG_VM builds (which was the case before) or also perform the
> check in DEBUG_VM builds. And if DEBUG_VM is a config option because
> it might have some performance impact, isn't the cost of the check
> probably quite large compared to the cost of printing the warning on a
> codpath that should never execute?
> 

That's true for these VM_WARN*() macros, but not true for the more widely
used WARN*() macros. And I was hoping to bring VM macros closer to the
WARN macros. But as you point out, pre-existing callers expect to have
zero impact in !DEBUG_VM builds, and so some caution is required.

I feel like a separate set of macros would be reasonable. Something that
has WARN*() type of behavior, and accepts a struct page (which typically
means that WARN_ON_ONCE is required, because for pages you have to limit
it to that pretty much always).

thanks,
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 3ded6a5f26b2..1f9c0ac15073 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -43,8 +43,21 @@  static void hpage_pincount_sub(struct page *page, int refs)
 
 	atomic_sub(refs, compound_pincount_ptr(page));
 }
 
+/* Equivalent to calling put_page() @refs times. */
+static void put_page_refs(struct page *page, int refs)
+{
+	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+	/*
+	 * Calling put_page() for each ref is unnecessarily slow. Only the last
+	 * ref needs a put_page().
+	 */
+	if (refs > 1)
+		page_ref_sub(page, refs - 1);
+	put_page(page);
+}
+
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
  */
@@ -55,8 +68,23 @@  static inline struct page *try_get_compound_head(struct page *page, int refs)
 	if (WARN_ON_ONCE(page_ref_count(head) < 0))
 		return NULL;
 	if (unlikely(!page_cache_add_speculative(head, refs)))
 		return NULL;
+
+	/*
+	 * At this point we have a stable reference to the head page; but it
+	 * could be that between the compound_head() lookup and the refcount
+	 * increment, the compound page was split, in which case we'd end up
+	 * holding a reference on a page that has nothing to do with the page
+	 * we were given anymore.
+	 * So now that the head page is stable, recheck that the pages still
+	 * belong together.
+	 */
+	if (unlikely(compound_head(page) != head)) {
+		put_page_refs(head, refs);
+		return NULL;
+	}
+
 	return head;
 }
 
 /*
@@ -94,25 +122,28 @@  __maybe_unused struct page *try_grab_compound_head(struct page *page,
 		if (unlikely((flags & FOLL_LONGTERM) &&
 			     !is_pinnable_page(page)))
 			return NULL;
 
+		/*
+		 * CAUTION: Don't use compound_head() on the page before this
+		 * point, the result won't be stable.
+		 */
+		page = try_get_compound_head(page, refs);
+		if (!page)
+			return NULL;
+
 		/*
 		 * When pinning a compound page of order > 1 (which is what
 		 * hpage_pincount_available() checks for), use an exact count to
 		 * track it, via hpage_pincount_add/_sub().
 		 *
 		 * However, be sure to *also* increment the normal page refcount
 		 * field at least once, so that the page really is pinned.
 		 */
-		if (!hpage_pincount_available(page))
-			refs *= GUP_PIN_COUNTING_BIAS;
-
-		page = try_get_compound_head(page, refs);
-		if (!page)
-			return NULL;
-
 		if (hpage_pincount_available(page))
 			hpage_pincount_add(page, refs);
+		else
+			page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));
 
 		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
 				    orig_refs);
 
@@ -134,16 +165,9 @@  static void put_compound_head(struct page *page, int refs, unsigned int flags)
 		else
 			refs *= GUP_PIN_COUNTING_BIAS;
 	}
 
-	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
-	/*
-	 * Calling put_page() for each ref is unnecessarily slow. Only the last
-	 * ref needs a put_page().
-	 */
-	if (refs > 1)
-		page_ref_sub(page, refs - 1);
-	put_page(page);
+	put_page_refs(page, refs);
 }
 
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount