Message ID | 20210521074433.931380-1-almasrymina@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY | expand |
On Fri, 21 May 2021 00:44:33 -0700 Mina Almasry <almasrymina@google.com> wrote: > The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This > happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on > an index for which we already have a page in the cache. When this > happens, we allocate a second page, double consuming the reservation, > and then fail to insert the page into the cache and return -EEXIST. > > To fix this, we first if there exists a page in the cache which already ^ check > consumed the reservation, and return -EEXIST immediately if so. > > Secondly, if we fail to copy the page contents while holding the > hugetlb_fault_mutex, we will drop the mutex and return to the caller > after allocating a page that consumed a reservation. In this case there > may be a fault that double consumes the reservation. To handle this, we > free the allocated page, fix the reservations, and allocate a temporary > hugetlb page and return that to the caller. When the caller does the > copy outside of the lock, we again check the cache, and allocate a page > consuming the reservation, and copy over the contents. > > Test: > Hacked the code locally such that resv_huge_pages underflows produce > a warning and the copy_huge_page_from_user() always fails, then: > > ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 > 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success > ./tools/testing/selftests/vm/userfaultfd hugetlb 10 > 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success > > Both tests succeed and produce no warnings. After the test runs > number of free/resv hugepages is correct. > > ... > > include/linux/hugetlb.h | 4 ++ > mm/hugetlb.c | 103 ++++++++++++++++++++++++++++++++++++---- > mm/migrate.c | 39 +++------------ > 3 files changed, 103 insertions(+), 43 deletions(-) I'm assuming we want this in -stable? Are we able to identify a Fixes: for this? It's a large change. Can we come up with some smaller and easier to review and integrate version which we can feed into 5.13 and -stable and do the fancier version for 5.14? If you don't think -stable needs this then this version will be OK as-is.
On Sat, May 22, 2021 at 2:19 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 21 May 2021 00:44:33 -0700 Mina Almasry <almasrymina@google.com> wrote: > > > The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This > > happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on > > an index for which we already have a page in the cache. When this > > happens, we allocate a second page, double consuming the reservation, > > and then fail to insert the page into the cache and return -EEXIST. > > > > To fix this, we first if there exists a page in the cache which already > > ^ check > > > consumed the reservation, and return -EEXIST immediately if so. > > > > Secondly, if we fail to copy the page contents while holding the > > hugetlb_fault_mutex, we will drop the mutex and return to the caller > > after allocating a page that consumed a reservation. In this case there > > may be a fault that double consumes the reservation. To handle this, we > > free the allocated page, fix the reservations, and allocate a temporary > > hugetlb page and return that to the caller. When the caller does the > > copy outside of the lock, we again check the cache, and allocate a page > > consuming the reservation, and copy over the contents. > > > > Test: > > Hacked the code locally such that resv_huge_pages underflows produce > > a warning and the copy_huge_page_from_user() always fails, then: > > > > ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 > > 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success > > ./tools/testing/selftests/vm/userfaultfd hugetlb 10 > > 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success > > > > Both tests succeed and produce no warnings. After the test runs > > number of free/resv hugepages is correct. > > > > ... > > > > include/linux/hugetlb.h | 4 ++ > > mm/hugetlb.c | 103 ++++++++++++++++++++++++++++++++++++---- > > mm/migrate.c | 39 +++------------ > > 3 files changed, 103 insertions(+), 43 deletions(-) > > I'm assuming we want this in -stable? > Umm I'll yield to Mike. This is a transient underflow; not actually THAT serious of an issue. Sorry, I'll clarify that in the commit message for the next version. > Are we able to identify a Fixes: for this? > No, this issue has been there latent for some time. It repros as far back as 5.11 at least, which is why maybe it's not that serious to require in -stable. > It's a large change. Can we come up with some smaller and easier to > review and integrate version which we can feed into 5.13 and -stable > and do the fancier version for 5.14? > Yes. If we only do the hugetlbfs_pagecache_present() check then that gets us some 90% of the way there, the rest of the patch addresses an unlikely race. > If you don't think -stable needs this then this version will be OK as-is.
On 5/21/21 12:44 AM, Mina Almasry wrote: > The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This > happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on > an index for which we already have a page in the cache. When this > happens, we allocate a second page, double consuming the reservation, > and then fail to insert the page into the cache and return -EEXIST. > > To fix this, we first if there exists a page in the cache which already > consumed the reservation, and return -EEXIST immediately if so. Thanks Mina. Andrew brought up the possibility of two separate fixes: 1) A simple one to address the common cause of the underflow 2) A more extensive modification to address all the error path code I think this is a good idea. You mentioned that the underflow is transient and will correct itself. That is true, however code assumes the reserve count is an unsigned value always >= the number of free pages. If code like the following is run while in this transitive state, we will run into more serious issues: /* * A child process with MAP_PRIVATE mappings created by their parent * have no page reserves. This check ensures that reservations are * not "stolen". The child may still get SIGKILLed */ if (!vma_has_reserves(vma, chg) && h->free_huge_pages - h->resv_huge_pages == 0) goto err; /* If reserves cannot be used, ensure enough pages are in the pool */ if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0) goto err; For this reason, I think the simple change sent to stable would be a good idea. Do note that the call to hugetlbfs_pagecache_present() which would be used in the stable fix is not 100% correct. See below. > Secondly, if we fail to copy the page contents while holding the > hugetlb_fault_mutex, we will drop the mutex and return to the caller > after allocating a page that consumed a reservation. In this case there > may be a fault that double consumes the reservation. To handle this, we > free the allocated page, fix the reservations, and allocate a temporary > hugetlb page and return that to the caller. When the caller does the > copy outside of the lock, we again check the cache, and allocate a page > consuming the reservation, and copy over the contents. > > Test: > Hacked the code locally such that resv_huge_pages underflows produce > a warning and the copy_huge_page_from_user() always fails, then: > > ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 > 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success > ./tools/testing/selftests/vm/userfaultfd hugetlb 10 > 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success > > Both tests succeed and produce no warnings. After the test runs > number of free/resv hugepages is correct. > > Signed-off-by: Mina Almasry <almasrymina@google.com> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: linux-mm@kvack.org > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > > --- > include/linux/hugetlb.h | 4 ++ > mm/hugetlb.c | 103 ++++++++++++++++++++++++++++++++++++---- > mm/migrate.c | 39 +++------------ > 3 files changed, 103 insertions(+), 43 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index b92f25ccef58..427974510965 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -194,6 +194,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > bool is_hugetlb_entry_migration(pte_t pte); > void hugetlb_unshare_all_pmds(struct vm_area_struct *vma); > > +void hugetlb_copy_page(struct page *dst, struct page *src); > + > #else /* !CONFIG_HUGETLB_PAGE */ > > static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) > @@ -379,6 +381,8 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm, > > static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { } > > +static inline void hugetlb_copy_page(struct page *dst, struct page *src); > + > #endif /* !CONFIG_HUGETLB_PAGE */ > /* > * hugepages at page global directory. If arch support How about just making the existing routine copy_huge_page() callable from outside migrate.c and avoid all this code movement? If we want to do any code movement, I would suggest moving copy_huge_page and routines it depends on to mm/huge_memory.c. That seems like a more appropriate location. But, I am fine with leaving them in migrate.c > index 629aa4c2259c..cb041c97a558 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c < snip code dealing with movement of copy routines > > @@ -4868,19 +4907,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > struct page **pagep) > { > bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); > - struct address_space *mapping; > - pgoff_t idx; > + struct hstate *h = hstate_vma(dst_vma); > + struct address_space *mapping = dst_vma->vm_file->f_mapping; > + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); > unsigned long size; > int vm_shared = dst_vma->vm_flags & VM_SHARED; > - struct hstate *h = hstate_vma(dst_vma); > pte_t _dst_pte; > spinlock_t *ptl; > - int ret; > + int ret = -ENOMEM; > struct page *page; > int writable; > - > - mapping = dst_vma->vm_file->f_mapping; > - idx = vma_hugecache_offset(h, dst_vma, dst_addr); > + struct mempolicy *mpol; > + nodemask_t *nodemask; > + gfp_t gfp_mask = htlb_alloc_mask(h); > + int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol, &nodemask); > > if (is_continue) { > ret = -EFAULT; > @@ -4888,7 +4928,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > if (!page) > goto out; > } else if (!*pagep) { > - ret = -ENOMEM; > + /* If a page already exists, then it's UFFDIO_COPY for > + * a non-missing case. Return -EEXIST. > + */ > + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { We only want to call hugetlbfs_pagecache_present() if vm_shared. Something like: if (vm_shared && hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { This could be a private mapping of a file in which case we do not care about the contents of the cache/file. In fact, it could provide a false positive. > + ret = -EEXIST; > + goto out; > + } > + > page = alloc_huge_page(dst_vma, dst_addr, 0); > if (IS_ERR(page)) > goto out; > @@ -4900,12 +4947,48 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > /* fallback to copy_from_user outside mmap_lock */ > if (unlikely(ret)) { > ret = -ENOENT; > + /* Free the allocated page which may have > + * consumed a reservation. > + */ > + restore_reserve_on_error(h, dst_vma, dst_addr, page); Good! > + if (!HPageRestoreReserve(page)) { > + if (unlikely(hugetlb_unreserve_pages( > + mapping->host, idx, idx + 1, 1))) > + hugetlb_fix_reserve_counts( > + mapping->host); > + } I do not understand the need to call hugetlb_unreserve_pages(). The call to restore_reserve_on_error 'should' fix up the reserve map to align with restoring the reserve count in put_page/free_huge_page. Can you explain why that is there? > + put_page(page); > + > + /* Allocate a temporary page to hold the copied > + * contents. > + */ > + page = alloc_migrate_huge_page(h, gfp_mask, node, > + nodemask); I would suggest calling alloc_huge_page_vma to allocate the page here. alloc_huge_page_vma will allocate a huge page from the pool if one is available, and fall back to the buddy allocator. This is the way the migration code works. alloc_huge_page_vma has the advantage that it 'can' work for gigantic pages if some are available in the pool. Note that alloc_migrate_huge_page immediately checks for and fails if a gigantic page is needed. If we use alloc_migrate_huge_page, the userfaultfd selftest you are using will fail when tested with 'hacked code' to ALWAYS take this error path. That is because we will need two huge pages (one temporary) and only one is available. I believe this is OK. As mentioned, this is how the migration code works. > + if (IS_ERR(page)) { > + ret = -ENOMEM; > + goto out; > + } > *pagep = page; > - /* don't free the page */ > + /* Set the outparam pagep and return to the caller to > + * copy the contents outside the lock. Don't free the > + * page. > + */ > goto out; > } > } else { > - page = *pagep; > + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { Again, check for vm_shared before calling hugetlbfs_pagecache_present. > + put_page(*pagep); > + ret = -EEXIST; > + goto out; We need to set *pagep = NULL before returning. The calling routine __mcopy_atomic_hugetlb will BUG() if we return with an error code other than -ENOENT and *pagep not NULL. > + } > + > + page = alloc_huge_page(dst_vma, dst_addr, 0); > + if (IS_ERR(page)) { > + ret = -ENOMEM; Again, clear *pagep before returning. > + goto out; > + } > + __copy_gigantic_page(page, *pagep, pages_per_huge_page(h)); See previous discussion about copy routine. > + put_page(*pagep); > *pagep = NULL; > } Later in hugetlb_mcopy_atomic_pte, we will update the page cache and page table. There are error checks when performing these operations. It is unlikely we will ever hit one of these errors. However, if we do encounter an error, the code should call restore_reserve_on_error before the put_page. Perhaps something like: @@ -4996,6 +4996,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (vm_shared || is_continue) unlock_page(page); out_release_nounlock: + restore_reserve_on_error(h, dst_vma, dst_addr, page); put_page(page); goto out; } > > diff --git a/mm/migrate.c b/mm/migrate.c > index 6b37d00890ca..d3437f9a608d 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c < snip code dealing with movement of copy routines > With changes like those above in hugetlb_mcopy_atomic_pte, we should be able to remove the following code in __mcopy_atomic_hugetlb. We can do this because any page returned/passed to __mcopy_atomic_hugetlb will NOT have consumed a reserve. It would be a nice cleanup as that big comment explains how we currently guess what is the right thing to do in this situation. diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 63a73e164d55..e13a0492b7ba 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -346,54 +346,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, out_unlock: mmap_read_unlock(dst_mm); out: - if (page) { - /* - * We encountered an error and are about to free a newly - * allocated huge page. - * - * Reservation handling is very subtle, and is different for - * private and shared mappings. See the routine - * restore_reserve_on_error for details. Unfortunately, we - * can not call restore_reserve_on_error now as it would - * require holding mmap_lock. - * - * If a reservation for the page existed in the reservation - * map of a private mapping, the map was modified to indicate - * the reservation was consumed when the page was allocated. - * We clear the HPageRestoreReserve flag now so that the global - * reserve count will not be incremented in free_huge_page. - * The reservation map will still indicate the reservation - * was consumed and possibly prevent later page allocation. - * This is better than leaking a global reservation. If no - * reservation existed, it is still safe to clear - * HPageRestoreReserve as no adjustments to reservation counts - * were made during allocation. - * - * The reservation map for shared mappings indicates which - * pages have reservations. When a huge page is allocated - * for an address with a reservation, no change is made to - * the reserve map. In this case HPageRestoreReserve will be - * set to indicate that the global reservation count should be - * incremented when the page is freed. This is the desired - * behavior. However, when a huge page is allocated for an - * address without a reservation a reservation entry is added - * to the reservation map, and HPageRestoreReserve will not be - * set. When the page is freed, the global reserve count will - * NOT be incremented and it will appear as though we have - * leaked reserved page. In this case, set HPageRestoreReserve - * so that the global reserve count will be incremented to - * match the reservation map entry which was created. - * - * Note that vm_alloc_shared is based on the flags of the vma - * for which the page was originally allocated. dst_vma could - * be different or NULL on error. - */ - if (vm_alloc_shared) - SetHPageRestoreReserve(page); - else - ClearHPageRestoreReserve(page); + if (page) put_page(page); - } BUG_ON(copied < 0); BUG_ON(err > 0); BUG_ON(!copied && !err);
> > + if (!HPageRestoreReserve(page)) { > > + if (unlikely(hugetlb_unreserve_pages( > > + mapping->host, idx, idx + 1, 1))) > > + hugetlb_fix_reserve_counts( > > + mapping->host); > > + } > > I do not understand the need to call hugetlb_unreserve_pages(). The > call to restore_reserve_on_error 'should' fix up the reserve map to > align with restoring the reserve count in put_page/free_huge_page. > Can you explain why that is there? > AFAICT here is what happens for a given index *without* the call to hugetlb_unreserve_pages(): 1. hugetlb_no_page() allocates a page consuming the reservation, resv_huge_pages decrements. 2. remove_inode_hugepages() does remove_huge_page() and hugetlb_unreserve_pages(). This removes the entry from the resv_map, but does NOT increment back the resv_huge_pages. Because we removed the entry, it looks like we have no reservation for this index. free_huge_page() gets called on this page, and resv_huge_pages is not incremented, I'm not sure why. This page should have come from the reserves. 3. hugetlb_mcopy_pte_atomic() gets called for this index. Because of the prior call to hugetlb_unreserve_page(), there is no entry in the resv_map for this index, which means it looks like we don't have a reservation for this index. We allocate a page outside the reserves (deferred_reservation=1, HPageRestoreReserve=0), add an entry into resv_map, and don't modify resv_huge_pages. 4. The copy fails and we deallocate the page, since HPageRestoreReserve==0 for this page, restore_reserve_on_error() does nothing. 5. hugetlb_mcopy_pte_atomic() gets recalled with the temporary page, and we allocate another page. Now, since we added an entry in the resv_map in the previous allocation, it looks like we have a reservation for this allocation. We allocate a page with deferred_reserve=0 && HPageRestoreReserve=1, we decrement resv_huge_pages. Boom, we decremented resv_huge_pages twice for this index, never incremented it. To fix this, in step 4, when I deallocate a page, I check HPageRestoreReserve(page). If HPageRestoreReserve=0, then this reservation was consumed and deallocated before, and so I need to remove the entry from the resv_map.
On 5/24/21 5:11 PM, Mina Almasry wrote: >>> + if (!HPageRestoreReserve(page)) { >>> + if (unlikely(hugetlb_unreserve_pages( >>> + mapping->host, idx, idx + 1, 1))) >>> + hugetlb_fix_reserve_counts( >>> + mapping->host); >>> + } >> >> I do not understand the need to call hugetlb_unreserve_pages(). The >> call to restore_reserve_on_error 'should' fix up the reserve map to >> align with restoring the reserve count in put_page/free_huge_page. >> Can you explain why that is there? >> > > AFAICT here is what happens for a given index *without* the call to > hugetlb_unreserve_pages(): > > 1. hugetlb_no_page() allocates a page consuming the reservation, > resv_huge_pages decrements. Ok > 2. remove_inode_hugepages() does remove_huge_page() and > hugetlb_unreserve_pages(). This removes the entry from the resv_map, > but does NOT increment back the resv_huge_pages. Because we removed > the entry, it looks like we have no reservation for this index. > free_huge_page() gets called on this page, and resv_huge_pages is not > incremented, I'm not sure why. This page should have come from the > reserves. This corresponds to a 'hole punch' operation. When hugetlbfs hole punch was added, a design decision was made to not try to restore reservations. IIRC, this is because the hole punch is associated with the file and reservations are associated with mappings. Trying to 'go back' to associated mappings and determine if a reservation should be restored would be a difficult exercise. > 3. hugetlb_mcopy_pte_atomic() gets called for this index. Because of > the prior call to hugetlb_unreserve_page(), there is no entry in the > resv_map for this index, which means it looks like we don't have a > reservation for this index. We allocate a page outside the reserves > (deferred_reservation=1, HPageRestoreReserve=0), add an entry into > resv_map, and don't modify resv_huge_pages. Ok > 4. The copy fails and we deallocate the page, since > HPageRestoreReserve==0 for this page, restore_reserve_on_error() does > nothing. Yes. And this may be something we want to fix in the more general case. i.e. I believe this can happen in code paths other than hugetlb_mcopy_pte_atomic. Rather than add special code here, I'll look into updating restore_reserve_on_error to handle this. Currently restore_reserve_on_error only handles the case where HPageRestoreReserve flags is set. Thanks for explaining this! I forgot about this 'special case' where there is not an entry in the reserve map for a shared mapping.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index b92f25ccef58..427974510965 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -194,6 +194,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, bool is_hugetlb_entry_migration(pte_t pte); void hugetlb_unshare_all_pmds(struct vm_area_struct *vma); +void hugetlb_copy_page(struct page *dst, struct page *src); + #else /* !CONFIG_HUGETLB_PAGE */ static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) @@ -379,6 +381,8 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm, static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { } +static inline void hugetlb_copy_page(struct page *dst, struct page *src); + #endif /* !CONFIG_HUGETLB_PAGE */ /* * hugepages at page global directory. If arch support diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 629aa4c2259c..cb041c97a558 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -81,6 +81,45 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; /* Forward declaration */ static int hugetlb_acct_memory(struct hstate *h, long delta); +/* + * Gigantic pages are so large that we do not guarantee that page++ pointer + * arithmetic will work across the entire page. We need something more + * specialized. + */ +static void __copy_gigantic_page(struct page *dst, struct page *src, + int nr_pages) +{ + int i; + struct page *dst_base = dst; + struct page *src_base = src; + + for (i = 0; i < nr_pages;) { + cond_resched(); + copy_highpage(dst, src); + + i++; + dst = mem_map_next(dst, dst_base, i); + src = mem_map_next(src, src_base, i); + } +} + +void hugetlb_copy_page(struct page *dst, struct page *src) +{ + int i; + struct hstate *h = page_hstate(src); + int nr_pages = pages_per_huge_page(h); + + if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) { + __copy_gigantic_page(dst, src, nr_pages); + return; + } + + for (i = 0; i < nr_pages; i++) { + cond_resched(); + copy_highpage(dst + i, src + i); + } +} + static inline bool subpool_is_free(struct hugepage_subpool *spool) { if (spool->count) @@ -4868,19 +4907,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, struct page **pagep) { bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); - struct address_space *mapping; - pgoff_t idx; + struct hstate *h = hstate_vma(dst_vma); + struct address_space *mapping = dst_vma->vm_file->f_mapping; + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); unsigned long size; int vm_shared = dst_vma->vm_flags & VM_SHARED; - struct hstate *h = hstate_vma(dst_vma); pte_t _dst_pte; spinlock_t *ptl; - int ret; + int ret = -ENOMEM; struct page *page; int writable; - - mapping = dst_vma->vm_file->f_mapping; - idx = vma_hugecache_offset(h, dst_vma, dst_addr); + struct mempolicy *mpol; + nodemask_t *nodemask; + gfp_t gfp_mask = htlb_alloc_mask(h); + int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol, &nodemask); if (is_continue) { ret = -EFAULT; @@ -4888,7 +4928,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (!page) goto out; } else if (!*pagep) { - ret = -ENOMEM; + /* If a page already exists, then it's UFFDIO_COPY for + * a non-missing case. Return -EEXIST. + */ + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { + ret = -EEXIST; + goto out; + } + page = alloc_huge_page(dst_vma, dst_addr, 0); if (IS_ERR(page)) goto out; @@ -4900,12 +4947,48 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { ret = -ENOENT; + /* Free the allocated page which may have + * consumed a reservation. + */ + restore_reserve_on_error(h, dst_vma, dst_addr, page); + if (!HPageRestoreReserve(page)) { + if (unlikely(hugetlb_unreserve_pages( + mapping->host, idx, idx + 1, 1))) + hugetlb_fix_reserve_counts( + mapping->host); + } + put_page(page); + + /* Allocate a temporary page to hold the copied + * contents. + */ + page = alloc_migrate_huge_page(h, gfp_mask, node, + nodemask); + if (IS_ERR(page)) { + ret = -ENOMEM; + goto out; + } *pagep = page; - /* don't free the page */ + /* Set the outparam pagep and return to the caller to + * copy the contents outside the lock. Don't free the + * page. + */ goto out; } } else { - page = *pagep; + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { + put_page(*pagep); + ret = -EEXIST; + goto out; + } + + page = alloc_huge_page(dst_vma, dst_addr, 0); + if (IS_ERR(page)) { + ret = -ENOMEM; + goto out; + } + __copy_gigantic_page(page, *pagep, pages_per_huge_page(h)); + put_page(*pagep); *pagep = NULL; } diff --git a/mm/migrate.c b/mm/migrate.c index 6b37d00890ca..d3437f9a608d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -528,28 +528,6 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, return MIGRATEPAGE_SUCCESS; } -/* - * Gigantic pages are so large that we do not guarantee that page++ pointer - * arithmetic will work across the entire page. We need something more - * specialized. - */ -static void __copy_gigantic_page(struct page *dst, struct page *src, - int nr_pages) -{ - int i; - struct page *dst_base = dst; - struct page *src_base = src; - - for (i = 0; i < nr_pages; ) { - cond_resched(); - copy_highpage(dst, src); - - i++; - dst = mem_map_next(dst, dst_base, i); - src = mem_map_next(src, src_base, i); - } -} - static void copy_huge_page(struct page *dst, struct page *src) { int i; @@ -557,19 +535,14 @@ static void copy_huge_page(struct page *dst, struct page *src) if (PageHuge(src)) { /* hugetlbfs page */ - struct hstate *h = page_hstate(src); - nr_pages = pages_per_huge_page(h); - - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) { - __copy_gigantic_page(dst, src, nr_pages); - return; - } - } else { - /* thp page */ - BUG_ON(!PageTransHuge(src)); - nr_pages = thp_nr_pages(src); + hugetlb_copy_page(dst, src); + return; } + /* thp page */ + BUG_ON(!PageTransHuge(src)); + nr_pages = thp_nr_pages(src); + for (i = 0; i < nr_pages; i++) { cond_resched(); copy_highpage(dst + i, src + i);
The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on an index for which we already have a page in the cache. When this happens, we allocate a second page, double consuming the reservation, and then fail to insert the page into the cache and return -EEXIST. To fix this, we first if there exists a page in the cache which already consumed the reservation, and return -EEXIST immediately if so. Secondly, if we fail to copy the page contents while holding the hugetlb_fault_mutex, we will drop the mutex and return to the caller after allocating a page that consumed a reservation. In this case there may be a fault that double consumes the reservation. To handle this, we free the allocated page, fix the reservations, and allocate a temporary hugetlb page and return that to the caller. When the caller does the copy outside of the lock, we again check the cache, and allocate a page consuming the reservation, and copy over the contents. Test: Hacked the code locally such that resv_huge_pages underflows produce a warning and the copy_huge_page_from_user() always fails, then: ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success ./tools/testing/selftests/vm/userfaultfd hugetlb 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success Both tests succeed and produce no warnings. After the test runs number of free/resv hugepages is correct. Signed-off-by: Mina Almasry <almasrymina@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: linux-mm@kvack.org Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- include/linux/hugetlb.h | 4 ++ mm/hugetlb.c | 103 ++++++++++++++++++++++++++++++++++++---- mm/migrate.c | 39 +++------------ 3 files changed, 103 insertions(+), 43 deletions(-) -- 2.31.1.818.g46aad6cb9e-goog