diff mbox series

[v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY

Message ID 20210521074433.931380-1-almasrymina@google.com (mailing list archive)
State New, archived
Headers show
Series [v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY | expand

Commit Message

Mina Almasry May 21, 2021, 7:44 a.m. UTC
The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This
happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
an index for which we already have a page in the cache. When this
happens, we allocate a second page, double consuming the reservation,
and then fail to insert the page into the cache and return -EEXIST.

To fix this, we first if there exists a page in the cache which already
consumed the reservation, and return -EEXIST immediately if so.

Secondly, if we fail to copy the page contents while holding the
hugetlb_fault_mutex, we will drop the mutex and return to the caller
after allocating a page that consumed a reservation. In this case there
may be a fault that double consumes the reservation. To handle this, we
free the allocated page, fix the reservations, and allocate a temporary
hugetlb page and return that to the caller. When the caller does the
copy outside of the lock, we again check the cache, and allocate a page
consuming the reservation, and copy over the contents.

Test:
Hacked the code locally such that resv_huge_pages underflows produce
a warning and the copy_huge_page_from_user() always fails, then:

./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
./tools/testing/selftests/vm/userfaultfd hugetlb 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success

Both tests succeed and produce no warnings. After the test runs
number of free/resv hugepages is correct.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

---
 include/linux/hugetlb.h |   4 ++
 mm/hugetlb.c            | 103 ++++++++++++++++++++++++++++++++++++----
 mm/migrate.c            |  39 +++------------
 3 files changed, 103 insertions(+), 43 deletions(-)

--
2.31.1.818.g46aad6cb9e-goog

Comments

Andrew Morton May 22, 2021, 9:19 p.m. UTC | #1
On Fri, 21 May 2021 00:44:33 -0700 Mina Almasry <almasrymina@google.com> wrote:

> The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This
> happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
> an index for which we already have a page in the cache. When this
> happens, we allocate a second page, double consuming the reservation,
> and then fail to insert the page into the cache and return -EEXIST.
> 
> To fix this, we first if there exists a page in the cache which already

                       ^ check

> consumed the reservation, and return -EEXIST immediately if so.
> 
> Secondly, if we fail to copy the page contents while holding the
> hugetlb_fault_mutex, we will drop the mutex and return to the caller
> after allocating a page that consumed a reservation. In this case there
> may be a fault that double consumes the reservation. To handle this, we
> free the allocated page, fix the reservations, and allocate a temporary
> hugetlb page and return that to the caller. When the caller does the
> copy outside of the lock, we again check the cache, and allocate a page
> consuming the reservation, and copy over the contents.
> 
> Test:
> Hacked the code locally such that resv_huge_pages underflows produce
> a warning and the copy_huge_page_from_user() always fails, then:
> 
> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
> 	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> ./tools/testing/selftests/vm/userfaultfd hugetlb 10
> 	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> 
> Both tests succeed and produce no warnings. After the test runs
> number of free/resv hugepages is correct.
>
> ...
>
>  include/linux/hugetlb.h |   4 ++
>  mm/hugetlb.c            | 103 ++++++++++++++++++++++++++++++++++++----
>  mm/migrate.c            |  39 +++------------
>  3 files changed, 103 insertions(+), 43 deletions(-)

I'm assuming we want this in -stable?

Are we able to identify a Fixes: for this?

It's a large change.  Can we come up with some smaller and easier to
review and integrate version which we can feed into 5.13 and -stable
and do the fancier version for 5.14?

If you don't think -stable needs this then this version will be OK as-is.
Mina Almasry May 22, 2021, 9:32 p.m. UTC | #2
On Sat, May 22, 2021 at 2:19 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 21 May 2021 00:44:33 -0700 Mina Almasry <almasrymina@google.com> wrote:
>
> > The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This
> > happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
> > an index for which we already have a page in the cache. When this
> > happens, we allocate a second page, double consuming the reservation,
> > and then fail to insert the page into the cache and return -EEXIST.
> >
> > To fix this, we first if there exists a page in the cache which already
>
>                        ^ check
>
> > consumed the reservation, and return -EEXIST immediately if so.
> >
> > Secondly, if we fail to copy the page contents while holding the
> > hugetlb_fault_mutex, we will drop the mutex and return to the caller
> > after allocating a page that consumed a reservation. In this case there
> > may be a fault that double consumes the reservation. To handle this, we
> > free the allocated page, fix the reservations, and allocate a temporary
> > hugetlb page and return that to the caller. When the caller does the
> > copy outside of the lock, we again check the cache, and allocate a page
> > consuming the reservation, and copy over the contents.
> >
> > Test:
> > Hacked the code locally such that resv_huge_pages underflows produce
> > a warning and the copy_huge_page_from_user() always fails, then:
> >
> > ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
> >       2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> > ./tools/testing/selftests/vm/userfaultfd hugetlb 10
> >       2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> >
> > Both tests succeed and produce no warnings. After the test runs
> > number of free/resv hugepages is correct.
> >
> > ...
> >
> >  include/linux/hugetlb.h |   4 ++
> >  mm/hugetlb.c            | 103 ++++++++++++++++++++++++++++++++++++----
> >  mm/migrate.c            |  39 +++------------
> >  3 files changed, 103 insertions(+), 43 deletions(-)
>
> I'm assuming we want this in -stable?
>

Umm I'll yield to Mike. This is a transient underflow; not actually
THAT serious of an issue. Sorry, I'll clarify that in the commit
message for the next version.

> Are we able to identify a Fixes: for this?
>

No, this issue has been there latent for some time. It repros as far
back as 5.11 at least, which is why maybe it's not that serious to
require in -stable.

> It's a large change.  Can we come up with some smaller and easier to
> review and integrate version which we can feed into 5.13 and -stable
> and do the fancier version for 5.14?
>

Yes. If we only do the hugetlbfs_pagecache_present() check then that
gets us some 90% of the way there, the rest of the patch addresses an
unlikely race.

> If you don't think -stable needs this then this version will be OK as-is.
Mike Kravetz May 24, 2021, 6:07 p.m. UTC | #3
On 5/21/21 12:44 AM, Mina Almasry wrote:
> The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This
> happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
> an index for which we already have a page in the cache. When this
> happens, we allocate a second page, double consuming the reservation,
> and then fail to insert the page into the cache and return -EEXIST.
> 
> To fix this, we first if there exists a page in the cache which already
> consumed the reservation, and return -EEXIST immediately if so.

Thanks Mina.

Andrew brought up the possibility of two separate fixes:
1) A simple one to address the common cause of the underflow
2) A more extensive modification to address all the error path code

I think this is a good idea.  You mentioned that the underflow is
transient and will correct itself.  That is true, however code assumes
the reserve count is an unsigned value always >= the number of free
pages.  If code like the following is run while in this transitive state,
we will run into more serious issues:

	/*
	 * A child process with MAP_PRIVATE mappings created by their parent
	 * have no page reserves. This check ensures that reservations are
	 * not "stolen". The child may still get SIGKILLed
	 */
	if (!vma_has_reserves(vma, chg) &&
			h->free_huge_pages - h->resv_huge_pages == 0)
		goto err;

	/* If reserves cannot be used, ensure enough pages are in the pool */
	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
		goto err;

For this reason, I think the simple change sent to stable would be a
good idea.

Do note that the call to hugetlbfs_pagecache_present() which would be
used in the stable fix is not 100% correct.  See below.

> Secondly, if we fail to copy the page contents while holding the
> hugetlb_fault_mutex, we will drop the mutex and return to the caller
> after allocating a page that consumed a reservation. In this case there
> may be a fault that double consumes the reservation. To handle this, we
> free the allocated page, fix the reservations, and allocate a temporary
> hugetlb page and return that to the caller. When the caller does the
> copy outside of the lock, we again check the cache, and allocate a page
> consuming the reservation, and copy over the contents.
> 
> Test:
> Hacked the code locally such that resv_huge_pages underflows produce
> a warning and the copy_huge_page_from_user() always fails, then:
> 
> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
> 	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> ./tools/testing/selftests/vm/userfaultfd hugetlb 10
> 	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> 
> Both tests succeed and produce no warnings. After the test runs
> number of free/resv hugepages is correct.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> 
> ---
>  include/linux/hugetlb.h |   4 ++
>  mm/hugetlb.c            | 103 ++++++++++++++++++++++++++++++++++++----
>  mm/migrate.c            |  39 +++------------
>  3 files changed, 103 insertions(+), 43 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index b92f25ccef58..427974510965 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -194,6 +194,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  bool is_hugetlb_entry_migration(pte_t pte);
>  void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
> 
> +void hugetlb_copy_page(struct page *dst, struct page *src);
> +
>  #else /* !CONFIG_HUGETLB_PAGE */
> 
>  static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> @@ -379,6 +381,8 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
> 
>  static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }
> 
> +static inline void hugetlb_copy_page(struct page *dst, struct page *src);
> +
>  #endif /* !CONFIG_HUGETLB_PAGE */
>  /*
>   * hugepages at page global directory. If arch support

How about just making the existing routine copy_huge_page() callable from
outside migrate.c and avoid all this code movement?  If we want to do
any code movement, I would suggest moving copy_huge_page and routines it
depends on to mm/huge_memory.c.  That seems like a more appropriate
location.  But, I am fine with leaving them in migrate.c

> index 629aa4c2259c..cb041c97a558 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
< snip code dealing with movement of copy routines >
> @@ -4868,19 +4907,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  			    struct page **pagep)
>  {
>  	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> -	struct address_space *mapping;
> -	pgoff_t idx;
> +	struct hstate *h = hstate_vma(dst_vma);
> +	struct address_space *mapping = dst_vma->vm_file->f_mapping;
> +	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
>  	unsigned long size;
>  	int vm_shared = dst_vma->vm_flags & VM_SHARED;
> -	struct hstate *h = hstate_vma(dst_vma);
>  	pte_t _dst_pte;
>  	spinlock_t *ptl;
> -	int ret;
> +	int ret = -ENOMEM;
>  	struct page *page;
>  	int writable;
> -
> -	mapping = dst_vma->vm_file->f_mapping;
> -	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> +	struct mempolicy *mpol;
> +	nodemask_t *nodemask;
> +	gfp_t gfp_mask = htlb_alloc_mask(h);
> +	int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol, &nodemask);
> 
>  	if (is_continue) {
>  		ret = -EFAULT;
> @@ -4888,7 +4928,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		if (!page)
>  			goto out;
>  	} else if (!*pagep) {
> -		ret = -ENOMEM;
> +		/* If a page already exists, then it's UFFDIO_COPY for
> +		 * a non-missing case. Return -EEXIST.
> +		 */
> +		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {

We only want to call hugetlbfs_pagecache_present() if vm_shared.
Something like:

		if (vm_shared &&
		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {

This could be a private mapping of a file in which case we do not care
about the contents of the cache/file.  In fact, it could provide a false
positive.

> +			ret = -EEXIST;
> +			goto out;
> +		}
> +
>  		page = alloc_huge_page(dst_vma, dst_addr, 0);
>  		if (IS_ERR(page))
>  			goto out;
> @@ -4900,12 +4947,48 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		/* fallback to copy_from_user outside mmap_lock */
>  		if (unlikely(ret)) {
>  			ret = -ENOENT;
> +			/* Free the allocated page which may have
> +			 * consumed a reservation.
> +			 */
> +			restore_reserve_on_error(h, dst_vma, dst_addr, page);

Good!

> +			if (!HPageRestoreReserve(page)) {
> +				if (unlikely(hugetlb_unreserve_pages(
> +					    mapping->host, idx, idx + 1, 1)))
> +					hugetlb_fix_reserve_counts(
> +						mapping->host);
> +			}

I do not understand the need to call hugetlb_unreserve_pages().  The
call to restore_reserve_on_error 'should' fix up the reserve map to
align with restoring the reserve count in put_page/free_huge_page.
Can you explain why that is there?

> +			put_page(page);
> +
> +			/* Allocate a temporary page to hold the copied
> +			 * contents.
> +			 */
> +			page = alloc_migrate_huge_page(h, gfp_mask, node,
> +						       nodemask);

I would suggest calling alloc_huge_page_vma to allocate the page here.
alloc_huge_page_vma will allocate a huge page from the pool if one is
available, and fall back to the buddy allocator.  This is the way the
migration code works.

alloc_huge_page_vma has the advantage that it 'can' work for gigantic
pages if some are available in the pool.  Note that alloc_migrate_huge_page
immediately checks for and fails if a gigantic page is needed.

If we use alloc_migrate_huge_page, the userfaultfd selftest you are
using will fail when tested with 'hacked code' to ALWAYS take this
error path.  That is because we will need two huge pages (one
temporary) and only one is available.  I believe this is OK.  As
mentioned, this is how the migration code works.

> +			if (IS_ERR(page)) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
>  			*pagep = page;
> -			/* don't free the page */
> +			/* Set the outparam pagep and return to the caller to
> +			 * copy the contents outside the lock. Don't free the
> +			 * page.
> +			 */
>  			goto out;
>  		}
>  	} else {
> -		page = *pagep;
> +		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {

Again, check for vm_shared before calling hugetlbfs_pagecache_present.

> +			put_page(*pagep);
> +			ret = -EEXIST;
> +			goto out;

We need to set *pagep = NULL before returning.  The calling routine
__mcopy_atomic_hugetlb will BUG() if we return with an error code other
than -ENOENT and *pagep not NULL.

> +		}
> +
> +		page = alloc_huge_page(dst_vma, dst_addr, 0);
> +		if (IS_ERR(page)) {
> +			ret = -ENOMEM;

Again, clear *pagep before returning.

> +			goto out;
> +		}
> +		__copy_gigantic_page(page, *pagep, pages_per_huge_page(h));

See previous discussion about copy routine.

> +		put_page(*pagep);
>  		*pagep = NULL;
>  	}

Later in hugetlb_mcopy_atomic_pte, we will update the page cache and
page table.  There are error checks when performing these operations.
It is unlikely we will ever hit one of these errors.  However, if we
do encounter an error, the code should call restore_reserve_on_error
before the put_page.  Perhaps something like:

@@ -4996,6 +4996,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (vm_shared || is_continue)
 		unlock_page(page);
 out_release_nounlock:
+	restore_reserve_on_error(h, dst_vma, dst_addr, page);
 	put_page(page);
 	goto out;
 }


> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6b37d00890ca..d3437f9a608d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
< snip code dealing with movement of copy routines >

With changes like those above in hugetlb_mcopy_atomic_pte, we should be
able to remove the following code in __mcopy_atomic_hugetlb.  We can do
this because any page returned/passed to __mcopy_atomic_hugetlb will NOT
have consumed a reserve.  It would be a nice cleanup as that big comment
explains how we currently guess what is the right thing to do in this
situation.

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 63a73e164d55..e13a0492b7ba 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -346,54 +346,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 out_unlock:
 	mmap_read_unlock(dst_mm);
 out:
-	if (page) {
-		/*
-		 * We encountered an error and are about to free a newly
-		 * allocated huge page.
-		 *
-		 * Reservation handling is very subtle, and is different for
-		 * private and shared mappings.  See the routine
-		 * restore_reserve_on_error for details.  Unfortunately, we
-		 * can not call restore_reserve_on_error now as it would
-		 * require holding mmap_lock.
-		 *
-		 * If a reservation for the page existed in the reservation
-		 * map of a private mapping, the map was modified to indicate
-		 * the reservation was consumed when the page was allocated.
-		 * We clear the HPageRestoreReserve flag now so that the global
-		 * reserve count will not be incremented in free_huge_page.
-		 * The reservation map will still indicate the reservation
-		 * was consumed and possibly prevent later page allocation.
-		 * This is better than leaking a global reservation.  If no
-		 * reservation existed, it is still safe to clear
-		 * HPageRestoreReserve as no adjustments to reservation counts
-		 * were made during allocation.
-		 *
-		 * The reservation map for shared mappings indicates which
-		 * pages have reservations.  When a huge page is allocated
-		 * for an address with a reservation, no change is made to
-		 * the reserve map.  In this case HPageRestoreReserve will be
-		 * set to indicate that the global reservation count should be
-		 * incremented when the page is freed.  This is the desired
-		 * behavior.  However, when a huge page is allocated for an
-		 * address without a reservation a reservation entry is added
-		 * to the reservation map, and HPageRestoreReserve will not be
-		 * set. When the page is freed, the global reserve count will
-		 * NOT be incremented and it will appear as though we have
-		 * leaked reserved page.  In this case, set HPageRestoreReserve
-		 * so that the global reserve count will be incremented to
-		 * match the reservation map entry which was created.
-		 *
-		 * Note that vm_alloc_shared is based on the flags of the vma
-		 * for which the page was originally allocated.  dst_vma could
-		 * be different or NULL on error.
-		 */
-		if (vm_alloc_shared)
-			SetHPageRestoreReserve(page);
-		else
-			ClearHPageRestoreReserve(page);
+	if (page)
 		put_page(page);
-	}
 	BUG_ON(copied < 0);
 	BUG_ON(err > 0);
 	BUG_ON(!copied && !err);
Mina Almasry May 25, 2021, 12:11 a.m. UTC | #4
> > +                     if (!HPageRestoreReserve(page)) {
> > +                             if (unlikely(hugetlb_unreserve_pages(
> > +                                         mapping->host, idx, idx + 1, 1)))
> > +                                     hugetlb_fix_reserve_counts(
> > +                                             mapping->host);
> > +                     }
>
> I do not understand the need to call hugetlb_unreserve_pages().  The
> call to restore_reserve_on_error 'should' fix up the reserve map to
> align with restoring the reserve count in put_page/free_huge_page.
> Can you explain why that is there?
>

AFAICT here is what happens for a given index *without* the call to
hugetlb_unreserve_pages():

1. hugetlb_no_page() allocates a page consuming the reservation,
resv_huge_pages decrements.
2. remove_inode_hugepages() does remove_huge_page() and
hugetlb_unreserve_pages(). This removes the entry from the resv_map,
but does NOT increment back the resv_huge_pages. Because we removed
the entry, it looks like we have no reservation for this index.
free_huge_page() gets called on this page, and resv_huge_pages is not
incremented, I'm not sure why. This page should have come from the
reserves.
3. hugetlb_mcopy_pte_atomic() gets called for this index. Because of
the prior call to hugetlb_unreserve_page(), there is no entry in the
resv_map for this index, which means it looks like we don't have a
reservation for this index. We allocate a page outside the reserves
(deferred_reservation=1, HPageRestoreReserve=0), add an entry into
resv_map, and don't modify resv_huge_pages.
4. The copy fails and we deallocate the page, since
HPageRestoreReserve==0 for this page, restore_reserve_on_error() does
nothing.
5. hugetlb_mcopy_pte_atomic() gets recalled with the temporary page,
and we allocate another page. Now, since we added an entry in the
resv_map in the previous allocation, it looks like we have a
reservation for this allocation. We allocate a page with
deferred_reserve=0 && HPageRestoreReserve=1, we decrement
resv_huge_pages. Boom, we decremented resv_huge_pages twice for this
index, never incremented it.

To fix this, in step 4, when I deallocate a page, I check
HPageRestoreReserve(page). If HPageRestoreReserve=0, then this
reservation was consumed and deallocated before, and so I need to
remove the entry from the resv_map.
Mike Kravetz May 25, 2021, 12:45 a.m. UTC | #5
On 5/24/21 5:11 PM, Mina Almasry wrote:
>>> +                     if (!HPageRestoreReserve(page)) {
>>> +                             if (unlikely(hugetlb_unreserve_pages(
>>> +                                         mapping->host, idx, idx + 1, 1)))
>>> +                                     hugetlb_fix_reserve_counts(
>>> +                                             mapping->host);
>>> +                     }
>>
>> I do not understand the need to call hugetlb_unreserve_pages().  The
>> call to restore_reserve_on_error 'should' fix up the reserve map to
>> align with restoring the reserve count in put_page/free_huge_page.
>> Can you explain why that is there?
>>
> 
> AFAICT here is what happens for a given index *without* the call to
> hugetlb_unreserve_pages():
> 
> 1. hugetlb_no_page() allocates a page consuming the reservation,
> resv_huge_pages decrements.

Ok

> 2. remove_inode_hugepages() does remove_huge_page() and
> hugetlb_unreserve_pages(). This removes the entry from the resv_map,
> but does NOT increment back the resv_huge_pages. Because we removed
> the entry, it looks like we have no reservation for this index.
> free_huge_page() gets called on this page, and resv_huge_pages is not
> incremented, I'm not sure why. This page should have come from the
> reserves.

This corresponds to a 'hole punch' operation.  When hugetlbfs hole punch
was added, a design decision was made to not try to restore reservations.
IIRC, this is because the hole punch is associated with the file and
reservations are associated with mappings.  Trying to 'go back' to
associated mappings and determine if a reservation should be restored
would be a difficult exercise.

> 3. hugetlb_mcopy_pte_atomic() gets called for this index. Because of
> the prior call to hugetlb_unreserve_page(), there is no entry in the
> resv_map for this index, which means it looks like we don't have a
> reservation for this index. We allocate a page outside the reserves
> (deferred_reservation=1, HPageRestoreReserve=0), add an entry into
> resv_map, and don't modify resv_huge_pages.

Ok

> 4. The copy fails and we deallocate the page, since
> HPageRestoreReserve==0 for this page, restore_reserve_on_error() does
> nothing.

Yes.  And this may be something we want to fix in the more general case.
i.e. I believe this can happen in code paths other than
hugetlb_mcopy_pte_atomic.  Rather than add special code here, I'll look
into updating restore_reserve_on_error to handle this.  Currently
restore_reserve_on_error only handles the case where HPageRestoreReserve
flags is set.

Thanks for explaining this!  I forgot about this 'special case' where
there is not an entry in the reserve map for a shared mapping.
diff mbox series

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b92f25ccef58..427974510965 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -194,6 +194,8 @@  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 bool is_hugetlb_entry_migration(pte_t pte);
 void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);

+void hugetlb_copy_page(struct page *dst, struct page *src);
+
 #else /* !CONFIG_HUGETLB_PAGE */

 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
@@ -379,6 +381,8 @@  static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,

 static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }

+static inline void hugetlb_copy_page(struct page *dst, struct page *src);
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 /*
  * hugepages at page global directory. If arch support
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 629aa4c2259c..cb041c97a558 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -81,6 +81,45 @@  struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);

+/*
+ * Gigantic pages are so large that we do not guarantee that page++ pointer
+ * arithmetic will work across the entire page.  We need something more
+ * specialized.
+ */
+static void __copy_gigantic_page(struct page *dst, struct page *src,
+				 int nr_pages)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < nr_pages;) {
+		cond_resched();
+		copy_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+void hugetlb_copy_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+	int nr_pages = pages_per_huge_page(h);
+
+	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
+		__copy_gigantic_page(dst, src, nr_pages);
+		return;
+	}
+
+	for (i = 0; i < nr_pages; i++) {
+		cond_resched();
+		copy_highpage(dst + i, src + i);
+	}
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -4868,19 +4907,20 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct page **pagep)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
-	struct address_space *mapping;
-	pgoff_t idx;
+	struct hstate *h = hstate_vma(dst_vma);
+	struct address_space *mapping = dst_vma->vm_file->f_mapping;
+	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
 	unsigned long size;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
-	struct hstate *h = hstate_vma(dst_vma);
 	pte_t _dst_pte;
 	spinlock_t *ptl;
-	int ret;
+	int ret = -ENOMEM;
 	struct page *page;
 	int writable;
-
-	mapping = dst_vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+	struct mempolicy *mpol;
+	nodemask_t *nodemask;
+	gfp_t gfp_mask = htlb_alloc_mask(h);
+	int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol, &nodemask);

 	if (is_continue) {
 		ret = -EFAULT;
@@ -4888,7 +4928,14 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		if (!page)
 			goto out;
 	} else if (!*pagep) {
-		ret = -ENOMEM;
+		/* If a page already exists, then it's UFFDIO_COPY for
+		 * a non-missing case. Return -EEXIST.
+		 */
+		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+			ret = -EEXIST;
+			goto out;
+		}
+
 		page = alloc_huge_page(dst_vma, dst_addr, 0);
 		if (IS_ERR(page))
 			goto out;
@@ -4900,12 +4947,48 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		/* fallback to copy_from_user outside mmap_lock */
 		if (unlikely(ret)) {
 			ret = -ENOENT;
+			/* Free the allocated page which may have
+			 * consumed a reservation.
+			 */
+			restore_reserve_on_error(h, dst_vma, dst_addr, page);
+			if (!HPageRestoreReserve(page)) {
+				if (unlikely(hugetlb_unreserve_pages(
+					    mapping->host, idx, idx + 1, 1)))
+					hugetlb_fix_reserve_counts(
+						mapping->host);
+			}
+			put_page(page);
+
+			/* Allocate a temporary page to hold the copied
+			 * contents.
+			 */
+			page = alloc_migrate_huge_page(h, gfp_mask, node,
+						       nodemask);
+			if (IS_ERR(page)) {
+				ret = -ENOMEM;
+				goto out;
+			}
 			*pagep = page;
-			/* don't free the page */
+			/* Set the outparam pagep and return to the caller to
+			 * copy the contents outside the lock. Don't free the
+			 * page.
+			 */
 			goto out;
 		}
 	} else {
-		page = *pagep;
+		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+			put_page(*pagep);
+			ret = -EEXIST;
+			goto out;
+		}
+
+		page = alloc_huge_page(dst_vma, dst_addr, 0);
+		if (IS_ERR(page)) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		__copy_gigantic_page(page, *pagep, pages_per_huge_page(h));
+		put_page(*pagep);
 		*pagep = NULL;
 	}

diff --git a/mm/migrate.c b/mm/migrate.c
index 6b37d00890ca..d3437f9a608d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -528,28 +528,6 @@  int migrate_huge_page_move_mapping(struct address_space *mapping,
 	return MIGRATEPAGE_SUCCESS;
 }

-/*
- * Gigantic pages are so large that we do not guarantee that page++ pointer
- * arithmetic will work across the entire page.  We need something more
- * specialized.
- */
-static void __copy_gigantic_page(struct page *dst, struct page *src,
-				int nr_pages)
-{
-	int i;
-	struct page *dst_base = dst;
-	struct page *src_base = src;
-
-	for (i = 0; i < nr_pages; ) {
-		cond_resched();
-		copy_highpage(dst, src);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
-}
-
 static void copy_huge_page(struct page *dst, struct page *src)
 {
 	int i;
@@ -557,19 +535,14 @@  static void copy_huge_page(struct page *dst, struct page *src)

 	if (PageHuge(src)) {
 		/* hugetlbfs page */
-		struct hstate *h = page_hstate(src);
-		nr_pages = pages_per_huge_page(h);
-
-		if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
-			__copy_gigantic_page(dst, src, nr_pages);
-			return;
-		}
-	} else {
-		/* thp page */
-		BUG_ON(!PageTransHuge(src));
-		nr_pages = thp_nr_pages(src);
+		hugetlb_copy_page(dst, src);
+		return;
 	}

+	/* thp page */
+	BUG_ON(!PageTransHuge(src));
+	nr_pages = thp_nr_pages(src);
+
 	for (i = 0; i < nr_pages; i++) {
 		cond_resched();
 		copy_highpage(dst + i, src + i);