diff mbox series

[4/8] hugetlb: handle truncate racing with page faults

Message ID 20220824175757.20590-5-mike.kravetz@oracle.com (mailing list archive)
State New
Headers show
Series hugetlb: Use new vma mutex for huge pmd sharing synchronization | expand

Commit Message

Mike Kravetz Aug. 24, 2022, 5:57 p.m. UTC
When page fault code needs to allocate and instantiate a new hugetlb
page (huegtlb_no_page), it checks early to determine if the fault is
beyond i_size.  When discovered early, it is easy to abort the fault and
return an error.  However, it becomes much more difficult to handle when
discovered later after allocating the page and consuming reservations
and adding to the page cache.  Backing out changes in such instances
becomes difficult and error prone.

Instead of trying to catch and backout all such races, use the hugetlb
fault mutex to handle truncate racing with page faults.  The most
significant change is modification of the routine remove_inode_hugepages
such that it will take the fault mutex for EVERY index in the truncated
range (or hole in the case of hole punch).  Since remove_inode_hugepages
is called in the truncate path after updating i_size, we can experience
races as follows.
- truncate code updates i_size and takes fault mutex before a racing
  fault.  After fault code takes mutex, it will notice fault beyond
  i_size and abort early.
- fault code obtains mutex, and truncate updates i_size after early
  checks in fault code.  fault code will add page beyond i_size.
  When truncate code takes mutex for page/index, it will remove the
  page.
- truncate updates i_size, but fault code obtains mutex first.  If
  fault code sees updated i_size it will abort early.  If fault code
  does not see updated i_size, it will add page beyond i_size and
  truncate code will remove page when it obtains fault mutex.

Note, for performance reasons remove_inode_hugepages will still use
filemap_get_folios for bulk folio lookups.  For indicies not returned in
the bulk lookup, it will need to lookup individual folios to check for
races with page fault.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
 mm/hugetlb.c         |  41 +++++-----
 2 files changed, 152 insertions(+), 73 deletions(-)

Comments

Mike Kravetz Aug. 25, 2022, 5 p.m. UTC | #1
On 08/24/22 10:57, Mike Kravetz wrote:
> When page fault code needs to allocate and instantiate a new hugetlb
> page (huegtlb_no_page), it checks early to determine if the fault is
> beyond i_size.  When discovered early, it is easy to abort the fault and
> return an error.  However, it becomes much more difficult to handle when
> discovered later after allocating the page and consuming reservations
> and adding to the page cache.  Backing out changes in such instances
> becomes difficult and error prone.
> 
> Instead of trying to catch and backout all such races, use the hugetlb
> fault mutex to handle truncate racing with page faults.  The most
> significant change is modification of the routine remove_inode_hugepages
> such that it will take the fault mutex for EVERY index in the truncated
> range (or hole in the case of hole punch).  Since remove_inode_hugepages
> is called in the truncate path after updating i_size, we can experience
> races as follows.
> - truncate code updates i_size and takes fault mutex before a racing
>   fault.  After fault code takes mutex, it will notice fault beyond
>   i_size and abort early.
> - fault code obtains mutex, and truncate updates i_size after early
>   checks in fault code.  fault code will add page beyond i_size.
>   When truncate code takes mutex for page/index, it will remove the
>   page.
> - truncate updates i_size, but fault code obtains mutex first.  If
>   fault code sees updated i_size it will abort early.  If fault code
>   does not see updated i_size, it will add page beyond i_size and
>   truncate code will remove page when it obtains fault mutex.
> 
> Note, for performance reasons remove_inode_hugepages will still use
> filemap_get_folios for bulk folio lookups.  For indicies not returned in
> the bulk lookup, it will need to lookup individual folios to check for
> races with page fault.
> 
<snip>
>  /*
>   * remove_inode_hugepages handles two distinct cases: truncation and hole
>   * punch.  There are subtle differences in operation for each case.
> @@ -418,11 +507,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>   * truncation is indicated by end of range being LLONG_MAX
>   *	In this case, we first scan the range and release found pages.
>   *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
> - *	maps and global counts.  Page faults can not race with truncation
> - *	in this routine.  hugetlb_no_page() prevents page faults in the
> - *	truncated range.  It checks i_size before allocation, and again after
> - *	with the page table lock for the page held.  The same lock must be
> - *	acquired to unmap a page.
> + *	maps and global counts.  Page faults can race with truncation.
> + *	During faults, hugetlb_no_page() checks i_size before page allocation,
> + *	and again after	obtaining page table lock.  It will 'back out'
> + *	allocations in the truncated range.
>   * hole punch is indicated if end is not LLONG_MAX
>   *	In the hole punch case we scan the range and release found pages.
>   *	Only when releasing a page is the associated region/reserve map
> @@ -431,75 +519,69 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>   *	This is indicated if we find a mapped page.
>   * Note: If the passed end of range value is beyond the end of file, but
>   * not LLONG_MAX this routine still performs a hole punch operation.
> + *
> + * Since page faults can race with this routine, care must be taken as both
> + * modify huge page reservation data.  To somewhat synchronize these operations
> + * the hugetlb fault mutex is taken for EVERY index in the range to be hole
> + * punched or truncated.  In this way, we KNOW either:
> + * - fault code has added a page beyond i_size, and we will remove here
> + * - fault code will see updated i_size and not add a page beyond
> + * The parameter 'lm__end' indicates the offset of the end of hole or file
> + * before truncation.  For hole punch lm_end == lend.
>   */
>  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> -				   loff_t lend)
> +				   loff_t lend, loff_t lm_end)
>  {
>  	struct hstate *h = hstate_inode(inode);
>  	struct address_space *mapping = &inode->i_data;
>  	const pgoff_t start = lstart >> huge_page_shift(h);
>  	const pgoff_t end = lend >> huge_page_shift(h);
> +	pgoff_t m_end = lm_end >> huge_page_shift(h);
> +	pgoff_t m_start, m_index;
>  	struct folio_batch fbatch;
> +	struct folio *folio;
>  	pgoff_t next, index;
> -	int i, freed = 0;
> +	unsigned int i;
> +	long freed = 0;
> +	u32 hash;
>  	bool truncate_op = (lend == LLONG_MAX);
>  
>  	folio_batch_init(&fbatch);
> -	next = start;
> +	next = m_start = start;
>  	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
>  		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> -			struct folio *folio = fbatch.folios[i];
> -			u32 hash = 0;
> +			folio = fbatch.folios[i];
>  
>  			index = folio->index;
> -			hash = hugetlb_fault_mutex_hash(mapping, index);
> -			mutex_lock(&hugetlb_fault_mutex_table[hash]);
> -
>  			/*
> -			 * If folio is mapped, it was faulted in after being
> -			 * unmapped in caller.  Unmap (again) now after taking
> -			 * the fault mutex.  The mutex will prevent faults
> -			 * until we finish removing the folio.
> -			 *
> -			 * This race can only happen in the hole punch case.
> -			 * Getting here in a truncate operation is a bug.
> +			 * Take fault mutex for missing folios before index,
> +			 * while checking folios that might have been added
> +			 * due to a race with fault code.
>  			 */
> -			if (unlikely(folio_mapped(folio))) {
> -				BUG_ON(truncate_op);
> -
> -				i_mmap_lock_write(mapping);
> -				hugetlb_vmdelete_list(&mapping->i_mmap,
> -					index * pages_per_huge_page(h),
> -					(index + 1) * pages_per_huge_page(h),
> -					ZAP_FLAG_DROP_MARKER);
> -				i_mmap_unlock_write(mapping);
> -			}
> +			freed += fault_lock_inode_indicies(h, inode, mapping,
> +						m_start, m_index, truncate_op);

This should be 'index' instead of 'm_index' as discovered here:
https://lore.kernel.org/linux-mm/CA+G9fYsHVdu0toduQqk6vsR8Z8mOVzZ9-_p3O5fjQ5mOpSxsDA@mail.gmail.com/
Miaohe Lin Aug. 27, 2022, 8:02 a.m. UTC | #2
On 2022/8/25 1:57, Mike Kravetz wrote:
> When page fault code needs to allocate and instantiate a new hugetlb
> page (huegtlb_no_page), it checks early to determine if the fault is
> beyond i_size.  When discovered early, it is easy to abort the fault and
> return an error.  However, it becomes much more difficult to handle when
> discovered later after allocating the page and consuming reservations
> and adding to the page cache.  Backing out changes in such instances
> becomes difficult and error prone.
> 
> Instead of trying to catch and backout all such races, use the hugetlb
> fault mutex to handle truncate racing with page faults.  The most
> significant change is modification of the routine remove_inode_hugepages
> such that it will take the fault mutex for EVERY index in the truncated
> range (or hole in the case of hole punch).  Since remove_inode_hugepages
> is called in the truncate path after updating i_size, we can experience
> races as follows.
> - truncate code updates i_size and takes fault mutex before a racing
>   fault.  After fault code takes mutex, it will notice fault beyond
>   i_size and abort early.
> - fault code obtains mutex, and truncate updates i_size after early
>   checks in fault code.  fault code will add page beyond i_size.
>   When truncate code takes mutex for page/index, it will remove the
>   page.
> - truncate updates i_size, but fault code obtains mutex first.  If
>   fault code sees updated i_size it will abort early.  If fault code
>   does not see updated i_size, it will add page beyond i_size and
>   truncate code will remove page when it obtains fault mutex.
> 
> Note, for performance reasons remove_inode_hugepages will still use
> filemap_get_folios for bulk folio lookups.  For indicies not returned in
> the bulk lookup, it will need to lookup individual folios to check for
> races with page fault.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
>  mm/hugetlb.c         |  41 +++++-----
>  2 files changed, 152 insertions(+), 73 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index d98c6edbd1a4..e83fd31671b3 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -411,6 +411,95 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>  	}
>  }
>  
> +/*
> + * Called with hugetlb fault mutex held.
> + * Returns true if page was actually removed, false otherwise.
> + */
> +static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
> +					struct address_space *mapping,
> +					struct folio *folio, pgoff_t index,
> +					bool truncate_op)
> +{
> +	bool ret = false;
> +
> +	/*
> +	 * If folio is mapped, it was faulted in after being
> +	 * unmapped in caller.  Unmap (again) while holding
> +	 * the fault mutex.  The mutex will prevent faults
> +	 * until we finish removing the folio.
> +	 */
> +	if (unlikely(folio_mapped(folio))) {
> +		i_mmap_lock_write(mapping);
> +		hugetlb_vmdelete_list(&mapping->i_mmap,
> +					index * pages_per_huge_page(h),
> +					(index + 1) * pages_per_huge_page(h),
> +					ZAP_FLAG_DROP_MARKER);
> +		i_mmap_unlock_write(mapping);
> +	}
> +
> +	folio_lock(folio);
> +	/*
> +	 * After locking page, make sure mapping is the same.
> +	 * We could have raced with page fault populate and
> +	 * backout code.
> +	 */
> +	if (folio_mapping(folio) == mapping) {

Could you explain this more? IIUC, page fault won't remove the hugetlb page from page
cache anymore. So this check is unneeded? Or we should always check this in case future
code changing?

> +		/*
> +		 * We must remove the folio from page cache before removing
> +		 * the region/ reserve map (hugetlb_unreserve_pages).  In
> +		 * rare out of memory conditions, removal of the region/reserve
> +		 * map could fail.  Correspondingly, the subpool and global
> +		 * reserve usage count can need to be adjusted.
> +		 */
> +		VM_BUG_ON(HPageRestoreReserve(&folio->page));
> +		hugetlb_delete_from_page_cache(&folio->page);
> +		ret = true;
> +		if (!truncate_op) {
> +			if (unlikely(hugetlb_unreserve_pages(inode, index,
> +								index + 1, 1)))
> +				hugetlb_fix_reserve_counts(inode);
> +		}
> +	}
> +
> +	folio_unlock(folio);
> +	return ret;
> +}

<snip>
> @@ -5584,9 +5585,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  		clear_huge_page(page, address, pages_per_huge_page(h));
>  		__SetPageUptodate(page);
>  		new_page = true;
> +		if (HPageRestoreReserve(page))
> +			reserve_alloc = true;
>  
>  		if (vma->vm_flags & VM_MAYSHARE) {
> -			int err = hugetlb_add_to_page_cache(page, mapping, idx);
> +			int err;
> +
> +			err = hugetlb_add_to_page_cache(page, mapping, idx);
>  			if (err) {
>  				restore_reserve_on_error(h, vma, haddr, page);
>  				put_page(page);
> @@ -5642,10 +5647,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  	}
>  
>  	ptl = huge_pte_lock(h, mm, ptep);
> -	size = i_size_read(mapping->host) >> huge_page_shift(h);
> -	if (idx >= size)
> -		goto backout;
> -
>  	ret = 0;
>  	/* If pte changed from under us, retry */
>  	if (!pte_same(huge_ptep_get(ptep), old_pte))
> @@ -5689,10 +5690,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  backout:
>  	spin_unlock(ptl);
>  backout_unlocked:
> -	unlock_page(page);
> -	/* restore reserve for newly allocated pages not in page cache */
> -	if (new_page && !new_pagecache_page)
> +	if (new_page && !new_pagecache_page) {
> +		/*
> +		 * If reserve was consumed, make sure flag is set so that it
> +		 * will be restored in free_huge_page().
> +		 */
> +		if (reserve_alloc)
> +			SetHPageRestoreReserve(page);

If code reaches here, it should be a newly allocated page and it's not added to the hugetlb page cache.
Note that failing to add the page to hugetlb page cache should have returned already. So the page must be
anon? If so, HPageRestoreReserve isn't cleared yet as it's cleared right before set_huge_pte. Thus above
check can be removed?

Anyway, the patch looks good to me.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks,
Miaohe Lin
Mike Kravetz Aug. 29, 2022, 9:53 p.m. UTC | #3
On 08/27/22 16:02, Miaohe Lin wrote:
> On 2022/8/25 1:57, Mike Kravetz wrote:
> > When page fault code needs to allocate and instantiate a new hugetlb
> > page (huegtlb_no_page), it checks early to determine if the fault is
> > beyond i_size.  When discovered early, it is easy to abort the fault and
> > return an error.  However, it becomes much more difficult to handle when
> > discovered later after allocating the page and consuming reservations
> > and adding to the page cache.  Backing out changes in such instances
> > becomes difficult and error prone.
> > 
> > Instead of trying to catch and backout all such races, use the hugetlb
> > fault mutex to handle truncate racing with page faults.  The most
> > significant change is modification of the routine remove_inode_hugepages
> > such that it will take the fault mutex for EVERY index in the truncated
> > range (or hole in the case of hole punch).  Since remove_inode_hugepages
> > is called in the truncate path after updating i_size, we can experience
> > races as follows.
> > - truncate code updates i_size and takes fault mutex before a racing
> >   fault.  After fault code takes mutex, it will notice fault beyond
> >   i_size and abort early.
> > - fault code obtains mutex, and truncate updates i_size after early
> >   checks in fault code.  fault code will add page beyond i_size.
> >   When truncate code takes mutex for page/index, it will remove the
> >   page.
> > - truncate updates i_size, but fault code obtains mutex first.  If
> >   fault code sees updated i_size it will abort early.  If fault code
> >   does not see updated i_size, it will add page beyond i_size and
> >   truncate code will remove page when it obtains fault mutex.
> > 
> > Note, for performance reasons remove_inode_hugepages will still use
> > filemap_get_folios for bulk folio lookups.  For indicies not returned in
> > the bulk lookup, it will need to lookup individual folios to check for
> > races with page fault.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> >  mm/hugetlb.c         |  41 +++++-----
> >  2 files changed, 152 insertions(+), 73 deletions(-)
> > 
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index d98c6edbd1a4..e83fd31671b3 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -411,6 +411,95 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> >  	}
> >  }
> >  
> > +/*
> > + * Called with hugetlb fault mutex held.
> > + * Returns true if page was actually removed, false otherwise.
> > + */
> > +static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
> > +					struct address_space *mapping,
> > +					struct folio *folio, pgoff_t index,
> > +					bool truncate_op)
> > +{
> > +	bool ret = false;
> > +
> > +	/*
> > +	 * If folio is mapped, it was faulted in after being
> > +	 * unmapped in caller.  Unmap (again) while holding
> > +	 * the fault mutex.  The mutex will prevent faults
> > +	 * until we finish removing the folio.
> > +	 */
> > +	if (unlikely(folio_mapped(folio))) {
> > +		i_mmap_lock_write(mapping);
> > +		hugetlb_vmdelete_list(&mapping->i_mmap,
> > +					index * pages_per_huge_page(h),
> > +					(index + 1) * pages_per_huge_page(h),
> > +					ZAP_FLAG_DROP_MARKER);
> > +		i_mmap_unlock_write(mapping);
> > +	}
> > +
> > +	folio_lock(folio);
> > +	/*
> > +	 * After locking page, make sure mapping is the same.
> > +	 * We could have raced with page fault populate and
> > +	 * backout code.
> > +	 */
> > +	if (folio_mapping(folio) == mapping) {
> 
> Could you explain this more? IIUC, page fault won't remove the hugetlb page from page
> cache anymore. So this check is unneeded? Or we should always check this in case future
> code changing?
> 

You are correct, with the updated code we should never hit this condition.
The faulting code will not remove pages from the page cache.
It can be removed.

> > +		/*
> > +		 * We must remove the folio from page cache before removing
> > +		 * the region/ reserve map (hugetlb_unreserve_pages).  In
> > +		 * rare out of memory conditions, removal of the region/reserve
> > +		 * map could fail.  Correspondingly, the subpool and global
> > +		 * reserve usage count can need to be adjusted.
> > +		 */
> > +		VM_BUG_ON(HPageRestoreReserve(&folio->page));
> > +		hugetlb_delete_from_page_cache(&folio->page);
> > +		ret = true;
> > +		if (!truncate_op) {
> > +			if (unlikely(hugetlb_unreserve_pages(inode, index,
> > +								index + 1, 1)))
> > +				hugetlb_fix_reserve_counts(inode);
> > +		}
> > +	}
> > +
> > +	folio_unlock(folio);
> > +	return ret;
> > +}
> 
> <snip>
> > @@ -5584,9 +5585,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >  		clear_huge_page(page, address, pages_per_huge_page(h));
> >  		__SetPageUptodate(page);
> >  		new_page = true;
> > +		if (HPageRestoreReserve(page))
> > +			reserve_alloc = true;
> >  
> >  		if (vma->vm_flags & VM_MAYSHARE) {
> > -			int err = hugetlb_add_to_page_cache(page, mapping, idx);
> > +			int err;
> > +
> > +			err = hugetlb_add_to_page_cache(page, mapping, idx);
> >  			if (err) {
> >  				restore_reserve_on_error(h, vma, haddr, page);
> >  				put_page(page);
> > @@ -5642,10 +5647,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >  	}
> >  
> >  	ptl = huge_pte_lock(h, mm, ptep);
> > -	size = i_size_read(mapping->host) >> huge_page_shift(h);
> > -	if (idx >= size)
> > -		goto backout;
> > -
> >  	ret = 0;
> >  	/* If pte changed from under us, retry */
> >  	if (!pte_same(huge_ptep_get(ptep), old_pte))
> > @@ -5689,10 +5690,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >  backout:
> >  	spin_unlock(ptl);
> >  backout_unlocked:
> > -	unlock_page(page);
> > -	/* restore reserve for newly allocated pages not in page cache */
> > -	if (new_page && !new_pagecache_page)
> > +	if (new_page && !new_pagecache_page) {
> > +		/*
> > +		 * If reserve was consumed, make sure flag is set so that it
> > +		 * will be restored in free_huge_page().
> > +		 */
> > +		if (reserve_alloc)
> > +			SetHPageRestoreReserve(page);
> 
> If code reaches here, it should be a newly allocated page and it's not added to the hugetlb page cache.
> Note that failing to add the page to hugetlb page cache should have returned already. So the page must be
> anon? If so, HPageRestoreReserve isn't cleared yet as it's cleared right before set_huge_pte. Thus above
> check can be removed?

You are correct again.  The above check can be removed.

Thanks!

I will remove them in a v2 series.
Sven Schnelle Sept. 6, 2022, 1:57 p.m. UTC | #4
Hi Mike,

Mike Kravetz <mike.kravetz@oracle.com> writes:

> When page fault code needs to allocate and instantiate a new hugetlb
> page (huegtlb_no_page), it checks early to determine if the fault is
> beyond i_size.  When discovered early, it is easy to abort the fault and
> return an error.  However, it becomes much more difficult to handle when
> discovered later after allocating the page and consuming reservations
> and adding to the page cache.  Backing out changes in such instances
> becomes difficult and error prone.
>
> Instead of trying to catch and backout all such races, use the hugetlb
> fault mutex to handle truncate racing with page faults.  The most
> significant change is modification of the routine remove_inode_hugepages
> such that it will take the fault mutex for EVERY index in the truncated
> range (or hole in the case of hole punch).  Since remove_inode_hugepages
> is called in the truncate path after updating i_size, we can experience
> races as follows.
> - truncate code updates i_size and takes fault mutex before a racing
>   fault.  After fault code takes mutex, it will notice fault beyond
>   i_size and abort early.
> - fault code obtains mutex, and truncate updates i_size after early
>   checks in fault code.  fault code will add page beyond i_size.
>   When truncate code takes mutex for page/index, it will remove the
>   page.
> - truncate updates i_size, but fault code obtains mutex first.  If
>   fault code sees updated i_size it will abort early.  If fault code
>   does not see updated i_size, it will add page beyond i_size and
>   truncate code will remove page when it obtains fault mutex.
>
> Note, for performance reasons remove_inode_hugepages will still use
> filemap_get_folios for bulk folio lookups.  For indicies not returned in
> the bulk lookup, it will need to lookup individual folios to check for
> races with page fault.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
>  mm/hugetlb.c         |  41 +++++-----
>  2 files changed, 152 insertions(+), 73 deletions(-)

With linux next starting from next-20220831 i see hangs with this
patch applied while running the glibc test suite. The patch doesn't
revert cleanly on top, so i checked out one commit before that one and
with that revision everything works.

It looks like the malloc test suite in glibc triggers this. I cannot
identify a single test causing it, but instead the combination of
multiple tests. Running the test suite on a single CPU works. Given the
subject of the patch that's likely not a surprise.

This is on s390, and the warning i get from RCU is:

[ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1951.907009] rcu:     60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
[ 1951.907018]  (t=6000 jiffies g=116125 q=1008072 ncpus=64)
[ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
[ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
[ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
[ 1951.907044]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[ 1951.907095] Call Trace:
[ 1951.907098]  [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
[ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
[ 1951.907107]  [<0000000000584876>] remove_inode_hugepages+0x236/0x280
[ 1951.907109]  [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
[ 1951.907111]  [<000000000044fe96>] evict+0xe6/0x1c0
[ 1951.907116]  [<000000000044a608>] __dentry_kill+0x108/0x1e0
[ 1951.907119]  [<000000000044ac64>] dentry_kill+0x6c/0x290
[ 1951.907121]  [<000000000044afec>] dput+0x164/0x1c0
[ 1951.907123]  [<000000000042a4d6>] __fput+0xee/0x290
[ 1951.907127]  [<00000000001794a8>] task_work_run+0x88/0xe0
[ 1951.907133]  [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
[ 1951.907137]  [<0000000000d0e42e>] __do_syscall+0x11e/0x200
[ 1951.907142]  [<0000000000d1d392>] system_call+0x82/0xb0
[ 1951.907145] Last Breaking-Event-Address:
[ 1951.907146]  [<0000038001d839c0>] 0x38001d839c0

One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.

Any thoughts?

Thanks,
Sven
Mike Kravetz Sept. 6, 2022, 4:48 p.m. UTC | #5
On 09/06/22 15:57, Sven Schnelle wrote:
> Hi Mike,
> 
> Mike Kravetz <mike.kravetz@oracle.com> writes:
> 
> > When page fault code needs to allocate and instantiate a new hugetlb
> > page (huegtlb_no_page), it checks early to determine if the fault is
> > beyond i_size.  When discovered early, it is easy to abort the fault and
> > return an error.  However, it becomes much more difficult to handle when
> > discovered later after allocating the page and consuming reservations
> > and adding to the page cache.  Backing out changes in such instances
> > becomes difficult and error prone.
> >
> > Instead of trying to catch and backout all such races, use the hugetlb
> > fault mutex to handle truncate racing with page faults.  The most
> > significant change is modification of the routine remove_inode_hugepages
> > such that it will take the fault mutex for EVERY index in the truncated
> > range (or hole in the case of hole punch).  Since remove_inode_hugepages
> > is called in the truncate path after updating i_size, we can experience
> > races as follows.
> > - truncate code updates i_size and takes fault mutex before a racing
> >   fault.  After fault code takes mutex, it will notice fault beyond
> >   i_size and abort early.
> > - fault code obtains mutex, and truncate updates i_size after early
> >   checks in fault code.  fault code will add page beyond i_size.
> >   When truncate code takes mutex for page/index, it will remove the
> >   page.
> > - truncate updates i_size, but fault code obtains mutex first.  If
> >   fault code sees updated i_size it will abort early.  If fault code
> >   does not see updated i_size, it will add page beyond i_size and
> >   truncate code will remove page when it obtains fault mutex.
> >
> > Note, for performance reasons remove_inode_hugepages will still use
> > filemap_get_folios for bulk folio lookups.  For indicies not returned in
> > the bulk lookup, it will need to lookup individual folios to check for
> > races with page fault.
> >
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> >  mm/hugetlb.c         |  41 +++++-----
> >  2 files changed, 152 insertions(+), 73 deletions(-)
> 
> With linux next starting from next-20220831 i see hangs with this
> patch applied while running the glibc test suite. The patch doesn't
> revert cleanly on top, so i checked out one commit before that one and
> with that revision everything works.
> 
> It looks like the malloc test suite in glibc triggers this. I cannot
> identify a single test causing it, but instead the combination of
> multiple tests. Running the test suite on a single CPU works. Given the
> subject of the patch that's likely not a surprise.
> 
> This is on s390, and the warning i get from RCU is:
> 
> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 1951.907009] rcu:     60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> [ 1951.907018]  (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> [ 1951.907044]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> [ 1951.907095] Call Trace:
> [ 1951.907098]  [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> [ 1951.907107]  [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> [ 1951.907109]  [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> [ 1951.907111]  [<000000000044fe96>] evict+0xe6/0x1c0
> [ 1951.907116]  [<000000000044a608>] __dentry_kill+0x108/0x1e0
> [ 1951.907119]  [<000000000044ac64>] dentry_kill+0x6c/0x290
> [ 1951.907121]  [<000000000044afec>] dput+0x164/0x1c0
> [ 1951.907123]  [<000000000042a4d6>] __fput+0xee/0x290
> [ 1951.907127]  [<00000000001794a8>] task_work_run+0x88/0xe0
> [ 1951.907133]  [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> [ 1951.907137]  [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> [ 1951.907142]  [<0000000000d1d392>] system_call+0x82/0xb0
> [ 1951.907145] Last Breaking-Event-Address:
> [ 1951.907146]  [<0000038001d839c0>] 0x38001d839c0
> 
> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> 
> Any thoughts?

Thanks for the report, I will take a look.

My first thought is that this fix may not be applied,
https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
However, I see that that is in next-20220831.

Hopefully, this will recreate on x86.
Mike Kravetz Sept. 6, 2022, 6:05 p.m. UTC | #6
On 09/06/22 09:48, Mike Kravetz wrote:
> On 09/06/22 15:57, Sven Schnelle wrote:
> > Hi Mike,
> > 
> > Mike Kravetz <mike.kravetz@oracle.com> writes:
> > 
> > > When page fault code needs to allocate and instantiate a new hugetlb
> > > page (huegtlb_no_page), it checks early to determine if the fault is
> > > beyond i_size.  When discovered early, it is easy to abort the fault and
> > > return an error.  However, it becomes much more difficult to handle when
> > > discovered later after allocating the page and consuming reservations
> > > and adding to the page cache.  Backing out changes in such instances
> > > becomes difficult and error prone.
> > >
> > > Instead of trying to catch and backout all such races, use the hugetlb
> > > fault mutex to handle truncate racing with page faults.  The most
> > > significant change is modification of the routine remove_inode_hugepages
> > > such that it will take the fault mutex for EVERY index in the truncated
> > > range (or hole in the case of hole punch).  Since remove_inode_hugepages
> > > is called in the truncate path after updating i_size, we can experience
> > > races as follows.
> > > - truncate code updates i_size and takes fault mutex before a racing
> > >   fault.  After fault code takes mutex, it will notice fault beyond
> > >   i_size and abort early.
> > > - fault code obtains mutex, and truncate updates i_size after early
> > >   checks in fault code.  fault code will add page beyond i_size.
> > >   When truncate code takes mutex for page/index, it will remove the
> > >   page.
> > > - truncate updates i_size, but fault code obtains mutex first.  If
> > >   fault code sees updated i_size it will abort early.  If fault code
> > >   does not see updated i_size, it will add page beyond i_size and
> > >   truncate code will remove page when it obtains fault mutex.
> > >
> > > Note, for performance reasons remove_inode_hugepages will still use
> > > filemap_get_folios for bulk folio lookups.  For indicies not returned in
> > > the bulk lookup, it will need to lookup individual folios to check for
> > > races with page fault.
> > >
> > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > ---
> > >  fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> > >  mm/hugetlb.c         |  41 +++++-----
> > >  2 files changed, 152 insertions(+), 73 deletions(-)
> > 
> > With linux next starting from next-20220831 i see hangs with this
> > patch applied while running the glibc test suite. The patch doesn't
> > revert cleanly on top, so i checked out one commit before that one and
> > with that revision everything works.
> > 
> > It looks like the malloc test suite in glibc triggers this. I cannot
> > identify a single test causing it, but instead the combination of
> > multiple tests. Running the test suite on a single CPU works. Given the
> > subject of the patch that's likely not a surprise.
> > 
> > This is on s390, and the warning i get from RCU is:
> > 
> > [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> > [ 1951.907009] rcu:     60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> > [ 1951.907018]  (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> > [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> > [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> > [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> > [ 1951.907044]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> > [ 1951.907095] Call Trace:
> > [ 1951.907098]  [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> > [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> > [ 1951.907107]  [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> > [ 1951.907109]  [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> > [ 1951.907111]  [<000000000044fe96>] evict+0xe6/0x1c0
> > [ 1951.907116]  [<000000000044a608>] __dentry_kill+0x108/0x1e0
> > [ 1951.907119]  [<000000000044ac64>] dentry_kill+0x6c/0x290
> > [ 1951.907121]  [<000000000044afec>] dput+0x164/0x1c0
> > [ 1951.907123]  [<000000000042a4d6>] __fput+0xee/0x290
> > [ 1951.907127]  [<00000000001794a8>] task_work_run+0x88/0xe0
> > [ 1951.907133]  [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> > [ 1951.907137]  [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> > [ 1951.907142]  [<0000000000d1d392>] system_call+0x82/0xb0
> > [ 1951.907145] Last Breaking-Event-Address:
> > [ 1951.907146]  [<0000038001d839c0>] 0x38001d839c0
> > 
> > One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> > 
> > Any thoughts?
> 
> Thanks for the report, I will take a look.
> 
> My first thought is that this fix may not be applied,
> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
> However, I see that that is in next-20220831.
> 
> Hopefully, this will recreate on x86.

One additional thought ...

With this patch, we will take the hugetlb fault mutex for EVERY index in the
range being truncated or hole punched.  In the case of a very large file, that
is no different than code today where we take the mutex when removing pages
from the file.  What is different is taking the mutex for indices that are
part of holes in the file.  Consider a very large file with only one page at
the very large offset.  We would then take the mutex for each index in that
very large hole.  Depending on the size of the hole, this could appear as a
hang.

For the above locking scheme to work, we need to take the mutex for indices
in holes in case there would happen to be a racing page fault.  However, there
are only a limited number of fault mutexes (it is a table).  So, we only really
need to take at a maximum num_fault_mutexes mutexes.  We could keep track of
these with a bitmap.

I am not sure this is the issue you are seeing, but a test named
tst-malloc-too-large-malloc-hugetlb2 may be doing this.

In any case, I think this issue needs to be addressed before this series can
move forward.
Mike Kravetz Sept. 6, 2022, 11:08 p.m. UTC | #7
On 09/06/22 11:05, Mike Kravetz wrote:
> On 09/06/22 09:48, Mike Kravetz wrote:
> > On 09/06/22 15:57, Sven Schnelle wrote:
> > > Hi Mike,
> > > 
> > > Mike Kravetz <mike.kravetz@oracle.com> writes:
> > > 
> > > > When page fault code needs to allocate and instantiate a new hugetlb
> > > > page (huegtlb_no_page), it checks early to determine if the fault is
> > > > beyond i_size.  When discovered early, it is easy to abort the fault and
> > > > return an error.  However, it becomes much more difficult to handle when
> > > > discovered later after allocating the page and consuming reservations
> > > > and adding to the page cache.  Backing out changes in such instances
> > > > becomes difficult and error prone.
> > > >
> > > > Instead of trying to catch and backout all such races, use the hugetlb
> > > > fault mutex to handle truncate racing with page faults.  The most
> > > > significant change is modification of the routine remove_inode_hugepages
> > > > such that it will take the fault mutex for EVERY index in the truncated
> > > > range (or hole in the case of hole punch).  Since remove_inode_hugepages
> > > > is called in the truncate path after updating i_size, we can experience
> > > > races as follows.
> > > > - truncate code updates i_size and takes fault mutex before a racing
> > > >   fault.  After fault code takes mutex, it will notice fault beyond
> > > >   i_size and abort early.
> > > > - fault code obtains mutex, and truncate updates i_size after early
> > > >   checks in fault code.  fault code will add page beyond i_size.
> > > >   When truncate code takes mutex for page/index, it will remove the
> > > >   page.
> > > > - truncate updates i_size, but fault code obtains mutex first.  If
> > > >   fault code sees updated i_size it will abort early.  If fault code
> > > >   does not see updated i_size, it will add page beyond i_size and
> > > >   truncate code will remove page when it obtains fault mutex.
> > > >
> > > > Note, for performance reasons remove_inode_hugepages will still use
> > > > filemap_get_folios for bulk folio lookups.  For indicies not returned in
> > > > the bulk lookup, it will need to lookup individual folios to check for
> > > > races with page fault.
> > > >
> > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > > ---
> > > >  fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> > > >  mm/hugetlb.c         |  41 +++++-----
> > > >  2 files changed, 152 insertions(+), 73 deletions(-)
> > > 
> > > With linux next starting from next-20220831 i see hangs with this
> > > patch applied while running the glibc test suite. The patch doesn't
> > > revert cleanly on top, so i checked out one commit before that one and
> > > with that revision everything works.
> > > 
> > > It looks like the malloc test suite in glibc triggers this. I cannot
> > > identify a single test causing it, but instead the combination of
> > > multiple tests. Running the test suite on a single CPU works. Given the
> > > subject of the patch that's likely not a surprise.
> > > 
> > > This is on s390, and the warning i get from RCU is:
> > > 
> > > [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> > > [ 1951.907009] rcu:     60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> > > [ 1951.907018]  (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> > > [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> > > [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> > > [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> > > [ 1951.907044]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> > > [ 1951.907095] Call Trace:
> > > [ 1951.907098]  [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> > > [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> > > [ 1951.907107]  [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> > > [ 1951.907109]  [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> > > [ 1951.907111]  [<000000000044fe96>] evict+0xe6/0x1c0
> > > [ 1951.907116]  [<000000000044a608>] __dentry_kill+0x108/0x1e0
> > > [ 1951.907119]  [<000000000044ac64>] dentry_kill+0x6c/0x290
> > > [ 1951.907121]  [<000000000044afec>] dput+0x164/0x1c0
> > > [ 1951.907123]  [<000000000042a4d6>] __fput+0xee/0x290
> > > [ 1951.907127]  [<00000000001794a8>] task_work_run+0x88/0xe0
> > > [ 1951.907133]  [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> > > [ 1951.907137]  [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> > > [ 1951.907142]  [<0000000000d1d392>] system_call+0x82/0xb0
> > > [ 1951.907145] Last Breaking-Event-Address:
> > > [ 1951.907146]  [<0000038001d839c0>] 0x38001d839c0
> > > 
> > > One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> > > 
> > > Any thoughts?
> > 
> > Thanks for the report, I will take a look.
> > 
> > My first thought is that this fix may not be applied,
> > https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
> > However, I see that that is in next-20220831.
> > 
> > Hopefully, this will recreate on x86.
> 
> One additional thought ...
> 
> With this patch, we will take the hugetlb fault mutex for EVERY index in the
> range being truncated or hole punched.  In the case of a very large file, that
> is no different than code today where we take the mutex when removing pages
> from the file.  What is different is taking the mutex for indices that are
> part of holes in the file.  Consider a very large file with only one page at
> the very large offset.  We would then take the mutex for each index in that
> very large hole.  Depending on the size of the hole, this could appear as a
> hang.
> 
> For the above locking scheme to work, we need to take the mutex for indices
> in holes in case there would happen to be a racing page fault.  However, there
> are only a limited number of fault mutexes (it is a table).  So, we only really
> need to take at a maximum num_fault_mutexes mutexes.  We could keep track of
> these with a bitmap.
> 
> I am not sure this is the issue you are seeing, but a test named
> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
> 
> In any case, I think this issue needs to be addressed before this series can
> move forward.

Well, even if we address the issue of taking the same mutex multiple times,
this new synchronization scheme requires a folio lookup for EVERY index in
the truncated or hole punched range.  This can easily 'stall' a CPU if there
is a really big hole in a file.  One can recreate this easily with fallocate
to add a single page to a file at a really big offset, and then remove the file.

I am trying to come up with another algorithm to make this work.

Andrew, I wanted to give you a heads up that this series may need to be
pulled if I can not come up with something quickly.
Miaohe Lin Sept. 7, 2022, 2:11 a.m. UTC | #8
On 2022/9/7 7:08, Mike Kravetz wrote:
> On 09/06/22 11:05, Mike Kravetz wrote:
>> On 09/06/22 09:48, Mike Kravetz wrote:
>>> On 09/06/22 15:57, Sven Schnelle wrote:
>>>> Hi Mike,
>>>>
>>>> Mike Kravetz <mike.kravetz@oracle.com> writes:
>>>>
>>>>> When page fault code needs to allocate and instantiate a new hugetlb
>>>>> page (huegtlb_no_page), it checks early to determine if the fault is
>>>>> beyond i_size.  When discovered early, it is easy to abort the fault and
>>>>> return an error.  However, it becomes much more difficult to handle when
>>>>> discovered later after allocating the page and consuming reservations
>>>>> and adding to the page cache.  Backing out changes in such instances
>>>>> becomes difficult and error prone.
>>>>>
>>>>> Instead of trying to catch and backout all such races, use the hugetlb
>>>>> fault mutex to handle truncate racing with page faults.  The most
>>>>> significant change is modification of the routine remove_inode_hugepages
>>>>> such that it will take the fault mutex for EVERY index in the truncated
>>>>> range (or hole in the case of hole punch).  Since remove_inode_hugepages
>>>>> is called in the truncate path after updating i_size, we can experience
>>>>> races as follows.
>>>>> - truncate code updates i_size and takes fault mutex before a racing
>>>>>   fault.  After fault code takes mutex, it will notice fault beyond
>>>>>   i_size and abort early.
>>>>> - fault code obtains mutex, and truncate updates i_size after early
>>>>>   checks in fault code.  fault code will add page beyond i_size.
>>>>>   When truncate code takes mutex for page/index, it will remove the
>>>>>   page.
>>>>> - truncate updates i_size, but fault code obtains mutex first.  If
>>>>>   fault code sees updated i_size it will abort early.  If fault code
>>>>>   does not see updated i_size, it will add page beyond i_size and
>>>>>   truncate code will remove page when it obtains fault mutex.
>>>>>
>>>>> Note, for performance reasons remove_inode_hugepages will still use
>>>>> filemap_get_folios for bulk folio lookups.  For indicies not returned in
>>>>> the bulk lookup, it will need to lookup individual folios to check for
>>>>> races with page fault.
>>>>>
>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>> ---
>>>>>  fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
>>>>>  mm/hugetlb.c         |  41 +++++-----
>>>>>  2 files changed, 152 insertions(+), 73 deletions(-)
>>>>
>>>> With linux next starting from next-20220831 i see hangs with this
>>>> patch applied while running the glibc test suite. The patch doesn't
>>>> revert cleanly on top, so i checked out one commit before that one and
>>>> with that revision everything works.
>>>>
>>>> It looks like the malloc test suite in glibc triggers this. I cannot
>>>> identify a single test causing it, but instead the combination of
>>>> multiple tests. Running the test suite on a single CPU works. Given the
>>>> subject of the patch that's likely not a surprise.
>>>>
>>>> This is on s390, and the warning i get from RCU is:
>>>>
>>>> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
>>>> [ 1951.907009] rcu:     60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
>>>> [ 1951.907018]  (t=6000 jiffies g=116125 q=1008072 ncpus=64)
>>>> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
>>>> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
>>>> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
>>>> [ 1951.907044]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>>>> [ 1951.907095] Call Trace:
>>>> [ 1951.907098]  [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
>>>> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
>>>> [ 1951.907107]  [<0000000000584876>] remove_inode_hugepages+0x236/0x280
>>>> [ 1951.907109]  [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
>>>> [ 1951.907111]  [<000000000044fe96>] evict+0xe6/0x1c0
>>>> [ 1951.907116]  [<000000000044a608>] __dentry_kill+0x108/0x1e0
>>>> [ 1951.907119]  [<000000000044ac64>] dentry_kill+0x6c/0x290
>>>> [ 1951.907121]  [<000000000044afec>] dput+0x164/0x1c0
>>>> [ 1951.907123]  [<000000000042a4d6>] __fput+0xee/0x290
>>>> [ 1951.907127]  [<00000000001794a8>] task_work_run+0x88/0xe0
>>>> [ 1951.907133]  [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
>>>> [ 1951.907137]  [<0000000000d0e42e>] __do_syscall+0x11e/0x200
>>>> [ 1951.907142]  [<0000000000d1d392>] system_call+0x82/0xb0
>>>> [ 1951.907145] Last Breaking-Event-Address:
>>>> [ 1951.907146]  [<0000038001d839c0>] 0x38001d839c0
>>>>
>>>> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
>>>>
>>>> Any thoughts?
>>>
>>> Thanks for the report, I will take a look.
>>>
>>> My first thought is that this fix may not be applied,
>>> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
>>> However, I see that that is in next-20220831.
>>>
>>> Hopefully, this will recreate on x86.
>>
>> One additional thought ...
>>
>> With this patch, we will take the hugetlb fault mutex for EVERY index in the
>> range being truncated or hole punched.  In the case of a very large file, that
>> is no different than code today where we take the mutex when removing pages
>> from the file.  What is different is taking the mutex for indices that are
>> part of holes in the file.  Consider a very large file with only one page at
>> the very large offset.  We would then take the mutex for each index in that
>> very large hole.  Depending on the size of the hole, this could appear as a
>> hang.
>>
>> For the above locking scheme to work, we need to take the mutex for indices
>> in holes in case there would happen to be a racing page fault.  However, there
>> are only a limited number of fault mutexes (it is a table).  So, we only really
>> need to take at a maximum num_fault_mutexes mutexes.  We could keep track of
>> these with a bitmap.
>>
>> I am not sure this is the issue you are seeing, but a test named
>> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
>>
>> In any case, I think this issue needs to be addressed before this series can
>> move forward.
> 
> Well, even if we address the issue of taking the same mutex multiple times,

Can we change to take all the hugetlb fault mutex at the same time to ensure every possible
future hugetlb page fault will see a truncated i_size? Then we could just drop all the hugetlb
fault mutex before doing any heavy stuff? It seems hugetlb fault mutex could be dropped when
new i_size is guaranteed to be visible for any future hugetlb page fault users?
But I might miss something...

> this new synchronization scheme requires a folio lookup for EVERY index in
> the truncated or hole punched range.  This can easily 'stall' a CPU if there

If above thought holds, we could do batch folio lookup instead. Hopes my thought will help. ;)

Thanks,
Miaohe Lin


> is a really big hole in a file.  One can recreate this easily with fallocate
> to add a single page to a file at a really big offset, and then remove the file.
> 
> I am trying to come up with another algorithm to make this work.
> 
> Andrew, I wanted to give you a heads up that this series may need to be
> pulled if I can not come up with something quickly.
>
Mike Kravetz Sept. 7, 2022, 2:37 a.m. UTC | #9
On 09/07/22 10:11, Miaohe Lin wrote:
> On 2022/9/7 7:08, Mike Kravetz wrote:
> > On 09/06/22 11:05, Mike Kravetz wrote:
> >> On 09/06/22 09:48, Mike Kravetz wrote:
> >>> On 09/06/22 15:57, Sven Schnelle wrote:
> >>>> Hi Mike,
> >>>>
> >>>> Mike Kravetz <mike.kravetz@oracle.com> writes:
> >>>>
> >>>>> When page fault code needs to allocate and instantiate a new hugetlb
> >>>>> page (huegtlb_no_page), it checks early to determine if the fault is
> >>>>> beyond i_size.  When discovered early, it is easy to abort the fault and
> >>>>> return an error.  However, it becomes much more difficult to handle when
> >>>>> discovered later after allocating the page and consuming reservations
> >>>>> and adding to the page cache.  Backing out changes in such instances
> >>>>> becomes difficult and error prone.
> >>>>>
> >>>>> Instead of trying to catch and backout all such races, use the hugetlb
> >>>>> fault mutex to handle truncate racing with page faults.  The most
> >>>>> significant change is modification of the routine remove_inode_hugepages
> >>>>> such that it will take the fault mutex for EVERY index in the truncated
> >>>>> range (or hole in the case of hole punch).  Since remove_inode_hugepages
> >>>>> is called in the truncate path after updating i_size, we can experience
> >>>>> races as follows.
> >>>>> - truncate code updates i_size and takes fault mutex before a racing
> >>>>>   fault.  After fault code takes mutex, it will notice fault beyond
> >>>>>   i_size and abort early.
> >>>>> - fault code obtains mutex, and truncate updates i_size after early
> >>>>>   checks in fault code.  fault code will add page beyond i_size.
> >>>>>   When truncate code takes mutex for page/index, it will remove the
> >>>>>   page.
> >>>>> - truncate updates i_size, but fault code obtains mutex first.  If
> >>>>>   fault code sees updated i_size it will abort early.  If fault code
> >>>>>   does not see updated i_size, it will add page beyond i_size and
> >>>>>   truncate code will remove page when it obtains fault mutex.
> >>>>>
> >>>>> Note, for performance reasons remove_inode_hugepages will still use
> >>>>> filemap_get_folios for bulk folio lookups.  For indicies not returned in
> >>>>> the bulk lookup, it will need to lookup individual folios to check for
> >>>>> races with page fault.
> >>>>>
> >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>>> ---
> >>>>>  fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> >>>>>  mm/hugetlb.c         |  41 +++++-----
> >>>>>  2 files changed, 152 insertions(+), 73 deletions(-)
> >>>>
> >>>> With linux next starting from next-20220831 i see hangs with this
> >>>> patch applied while running the glibc test suite. The patch doesn't
> >>>> revert cleanly on top, so i checked out one commit before that one and
> >>>> with that revision everything works.
> >>>>
> >>>> It looks like the malloc test suite in glibc triggers this. I cannot
> >>>> identify a single test causing it, but instead the combination of
> >>>> multiple tests. Running the test suite on a single CPU works. Given the
> >>>> subject of the patch that's likely not a surprise.
> >>>>
> >>>> This is on s390, and the warning i get from RCU is:
> >>>>
> >>>> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> >>>> [ 1951.907009] rcu:     60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> >>>> [ 1951.907018]  (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> >>>> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> >>>> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> >>>> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> >>>> [ 1951.907044]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> >>>> [ 1951.907095] Call Trace:
> >>>> [ 1951.907098]  [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> >>>> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> >>>> [ 1951.907107]  [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> >>>> [ 1951.907109]  [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> >>>> [ 1951.907111]  [<000000000044fe96>] evict+0xe6/0x1c0
> >>>> [ 1951.907116]  [<000000000044a608>] __dentry_kill+0x108/0x1e0
> >>>> [ 1951.907119]  [<000000000044ac64>] dentry_kill+0x6c/0x290
> >>>> [ 1951.907121]  [<000000000044afec>] dput+0x164/0x1c0
> >>>> [ 1951.907123]  [<000000000042a4d6>] __fput+0xee/0x290
> >>>> [ 1951.907127]  [<00000000001794a8>] task_work_run+0x88/0xe0
> >>>> [ 1951.907133]  [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> >>>> [ 1951.907137]  [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> >>>> [ 1951.907142]  [<0000000000d1d392>] system_call+0x82/0xb0
> >>>> [ 1951.907145] Last Breaking-Event-Address:
> >>>> [ 1951.907146]  [<0000038001d839c0>] 0x38001d839c0
> >>>>
> >>>> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> >>>>
> >>>> Any thoughts?
> >>>
> >>> Thanks for the report, I will take a look.
> >>>
> >>> My first thought is that this fix may not be applied,
> >>> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
> >>> However, I see that that is in next-20220831.
> >>>
> >>> Hopefully, this will recreate on x86.
> >>
> >> One additional thought ...
> >>
> >> With this patch, we will take the hugetlb fault mutex for EVERY index in the
> >> range being truncated or hole punched.  In the case of a very large file, that
> >> is no different than code today where we take the mutex when removing pages
> >> from the file.  What is different is taking the mutex for indices that are
> >> part of holes in the file.  Consider a very large file with only one page at
> >> the very large offset.  We would then take the mutex for each index in that
> >> very large hole.  Depending on the size of the hole, this could appear as a
> >> hang.
> >>
> >> For the above locking scheme to work, we need to take the mutex for indices
> >> in holes in case there would happen to be a racing page fault.  However, there
> >> are only a limited number of fault mutexes (it is a table).  So, we only really
> >> need to take at a maximum num_fault_mutexes mutexes.  We could keep track of
> >> these with a bitmap.
> >>
> >> I am not sure this is the issue you are seeing, but a test named
> >> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
> >>
> >> In any case, I think this issue needs to be addressed before this series can
> >> move forward.
> > 
> > Well, even if we address the issue of taking the same mutex multiple times,
> 
> Can we change to take all the hugetlb fault mutex at the same time to ensure every possible
> future hugetlb page fault will see a truncated i_size? Then we could just drop all the hugetlb
> fault mutex before doing any heavy stuff? It seems hugetlb fault mutex could be dropped when
> new i_size is guaranteed to be visible for any future hugetlb page fault users?
> But I might miss something...

Yes, that is the general direction and would work well for truncation.  However,
the same routine remove_inode_hugepages is used for hole punch, and I am pretty
sure we want to take the fault mutex there as it can race with page faults.

> 
> > this new synchronization scheme requires a folio lookup for EVERY index in
> > the truncated or hole punched range.  This can easily 'stall' a CPU if there
> 
> If above thought holds, we could do batch folio lookup instead. Hopes my thought will help. ;)
> 

Yes, I have some promising POC code with two batch lookups in case of holes.
Hope to send something soon.
Miaohe Lin Sept. 7, 2022, 3:07 a.m. UTC | #10
On 2022/9/7 10:37, Mike Kravetz wrote:
> On 09/07/22 10:11, Miaohe Lin wrote:
>> On 2022/9/7 7:08, Mike Kravetz wrote:
>>> On 09/06/22 11:05, Mike Kravetz wrote:
>>>> On 09/06/22 09:48, Mike Kravetz wrote:
>>>>> On 09/06/22 15:57, Sven Schnelle wrote:
>>>>>> Hi Mike,
>>>>>>
>>>>>> Mike Kravetz <mike.kravetz@oracle.com> writes:
>>>>>>
>>>>>>> When page fault code needs to allocate and instantiate a new hugetlb
>>>>>>> page (huegtlb_no_page), it checks early to determine if the fault is
>>>>>>> beyond i_size.  When discovered early, it is easy to abort the fault and
>>>>>>> return an error.  However, it becomes much more difficult to handle when
>>>>>>> discovered later after allocating the page and consuming reservations
>>>>>>> and adding to the page cache.  Backing out changes in such instances
>>>>>>> becomes difficult and error prone.
>>>>>>>
>>>>>>> Instead of trying to catch and backout all such races, use the hugetlb
>>>>>>> fault mutex to handle truncate racing with page faults.  The most
>>>>>>> significant change is modification of the routine remove_inode_hugepages
>>>>>>> such that it will take the fault mutex for EVERY index in the truncated
>>>>>>> range (or hole in the case of hole punch).  Since remove_inode_hugepages
>>>>>>> is called in the truncate path after updating i_size, we can experience
>>>>>>> races as follows.
>>>>>>> - truncate code updates i_size and takes fault mutex before a racing
>>>>>>>   fault.  After fault code takes mutex, it will notice fault beyond
>>>>>>>   i_size and abort early.
>>>>>>> - fault code obtains mutex, and truncate updates i_size after early
>>>>>>>   checks in fault code.  fault code will add page beyond i_size.
>>>>>>>   When truncate code takes mutex for page/index, it will remove the
>>>>>>>   page.
>>>>>>> - truncate updates i_size, but fault code obtains mutex first.  If
>>>>>>>   fault code sees updated i_size it will abort early.  If fault code
>>>>>>>   does not see updated i_size, it will add page beyond i_size and
>>>>>>>   truncate code will remove page when it obtains fault mutex.
>>>>>>>
>>>>>>> Note, for performance reasons remove_inode_hugepages will still use
>>>>>>> filemap_get_folios for bulk folio lookups.  For indicies not returned in
>>>>>>> the bulk lookup, it will need to lookup individual folios to check for
>>>>>>> races with page fault.
>>>>>>>
>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>>> ---
>>>>>>>  fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
>>>>>>>  mm/hugetlb.c         |  41 +++++-----
>>>>>>>  2 files changed, 152 insertions(+), 73 deletions(-)
>>>>>>
>>>>>> With linux next starting from next-20220831 i see hangs with this
>>>>>> patch applied while running the glibc test suite. The patch doesn't
>>>>>> revert cleanly on top, so i checked out one commit before that one and
>>>>>> with that revision everything works.
>>>>>>
>>>>>> It looks like the malloc test suite in glibc triggers this. I cannot
>>>>>> identify a single test causing it, but instead the combination of
>>>>>> multiple tests. Running the test suite on a single CPU works. Given the
>>>>>> subject of the patch that's likely not a surprise.
>>>>>>
>>>>>> This is on s390, and the warning i get from RCU is:
>>>>>>
>>>>>> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
>>>>>> [ 1951.907009] rcu:     60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
>>>>>> [ 1951.907018]  (t=6000 jiffies g=116125 q=1008072 ncpus=64)
>>>>>> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
>>>>>> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
>>>>>> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
>>>>>> [ 1951.907044]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>>>>>> [ 1951.907095] Call Trace:
>>>>>> [ 1951.907098]  [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
>>>>>> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
>>>>>> [ 1951.907107]  [<0000000000584876>] remove_inode_hugepages+0x236/0x280
>>>>>> [ 1951.907109]  [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
>>>>>> [ 1951.907111]  [<000000000044fe96>] evict+0xe6/0x1c0
>>>>>> [ 1951.907116]  [<000000000044a608>] __dentry_kill+0x108/0x1e0
>>>>>> [ 1951.907119]  [<000000000044ac64>] dentry_kill+0x6c/0x290
>>>>>> [ 1951.907121]  [<000000000044afec>] dput+0x164/0x1c0
>>>>>> [ 1951.907123]  [<000000000042a4d6>] __fput+0xee/0x290
>>>>>> [ 1951.907127]  [<00000000001794a8>] task_work_run+0x88/0xe0
>>>>>> [ 1951.907133]  [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
>>>>>> [ 1951.907137]  [<0000000000d0e42e>] __do_syscall+0x11e/0x200
>>>>>> [ 1951.907142]  [<0000000000d1d392>] system_call+0x82/0xb0
>>>>>> [ 1951.907145] Last Breaking-Event-Address:
>>>>>> [ 1951.907146]  [<0000038001d839c0>] 0x38001d839c0
>>>>>>
>>>>>> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
>>>>>>
>>>>>> Any thoughts?
>>>>>
>>>>> Thanks for the report, I will take a look.
>>>>>
>>>>> My first thought is that this fix may not be applied,
>>>>> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
>>>>> However, I see that that is in next-20220831.
>>>>>
>>>>> Hopefully, this will recreate on x86.
>>>>
>>>> One additional thought ...
>>>>
>>>> With this patch, we will take the hugetlb fault mutex for EVERY index in the
>>>> range being truncated or hole punched.  In the case of a very large file, that
>>>> is no different than code today where we take the mutex when removing pages
>>>> from the file.  What is different is taking the mutex for indices that are
>>>> part of holes in the file.  Consider a very large file with only one page at
>>>> the very large offset.  We would then take the mutex for each index in that
>>>> very large hole.  Depending on the size of the hole, this could appear as a
>>>> hang.
>>>>
>>>> For the above locking scheme to work, we need to take the mutex for indices
>>>> in holes in case there would happen to be a racing page fault.  However, there
>>>> are only a limited number of fault mutexes (it is a table).  So, we only really
>>>> need to take at a maximum num_fault_mutexes mutexes.  We could keep track of
>>>> these with a bitmap.
>>>>
>>>> I am not sure this is the issue you are seeing, but a test named
>>>> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
>>>>
>>>> In any case, I think this issue needs to be addressed before this series can
>>>> move forward.
>>>
>>> Well, even if we address the issue of taking the same mutex multiple times,
>>
>> Can we change to take all the hugetlb fault mutex at the same time to ensure every possible
>> future hugetlb page fault will see a truncated i_size? Then we could just drop all the hugetlb
>> fault mutex before doing any heavy stuff? It seems hugetlb fault mutex could be dropped when
>> new i_size is guaranteed to be visible for any future hugetlb page fault users?
>> But I might miss something...
> 
> Yes, that is the general direction and would work well for truncation.  However,
> the same routine remove_inode_hugepages is used for hole punch, and I am pretty
> sure we want to take the fault mutex there as it can race with page faults.

Oh, sorry. I missed that case.

> 
>>
>>> this new synchronization scheme requires a folio lookup for EVERY index in
>>> the truncated or hole punched range.  This can easily 'stall' a CPU if there
>>
>> If above thought holds, we could do batch folio lookup instead. Hopes my thought will help. ;)
>>
> 
> Yes, I have some promising POC code with two batch lookups in case of holes.
> Hope to send something soon.

That will be really nice. ;)

Thanks,
Miaohe Lin
Mike Kravetz Sept. 7, 2022, 3:30 a.m. UTC | #11
On 09/07/22 11:07, Miaohe Lin wrote:
> On 2022/9/7 10:37, Mike Kravetz wrote:
> > On 09/07/22 10:11, Miaohe Lin wrote:
> >> On 2022/9/7 7:08, Mike Kravetz wrote:
> >>> On 09/06/22 11:05, Mike Kravetz wrote:
> >>>> On 09/06/22 09:48, Mike Kravetz wrote:
> >>>>> On 09/06/22 15:57, Sven Schnelle wrote:
> >>>>>>
> >>>>>> With linux next starting from next-20220831 i see hangs with this
> >>>>>> patch applied while running the glibc test suite. The patch doesn't
> >>>>>> revert cleanly on top, so i checked out one commit before that one and
> >>>>>> with that revision everything works.
> >>>>>>
> >>>>>> It looks like the malloc test suite in glibc triggers this. I cannot
> >>>>>> identify a single test causing it, but instead the combination of
> >>>>>> multiple tests. Running the test suite on a single CPU works. Given the
> >>>>>> subject of the patch that's likely not a surprise.
> >>>>>>
> >>>>>> This is on s390, and the warning i get from RCU is:
> >>>>>>
> >>>>>> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> >>>>>> [ 1951.907009] rcu:     60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> >>>>>> [ 1951.907018]  (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> >>>>>> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> >>>>>> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> >>>>>> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> >>>>>> [ 1951.907044]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> >>>>>> [ 1951.907095] Call Trace:
> >>>>>> [ 1951.907098]  [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> >>>>>> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> >>>>>> [ 1951.907107]  [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> >>>>>> [ 1951.907109]  [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> >>>>>> [ 1951.907111]  [<000000000044fe96>] evict+0xe6/0x1c0
> >>>>>> [ 1951.907116]  [<000000000044a608>] __dentry_kill+0x108/0x1e0
> >>>>>> [ 1951.907119]  [<000000000044ac64>] dentry_kill+0x6c/0x290
> >>>>>> [ 1951.907121]  [<000000000044afec>] dput+0x164/0x1c0
> >>>>>> [ 1951.907123]  [<000000000042a4d6>] __fput+0xee/0x290
> >>>>>> [ 1951.907127]  [<00000000001794a8>] task_work_run+0x88/0xe0
> >>>>>> [ 1951.907133]  [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> >>>>>> [ 1951.907137]  [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> >>>>>> [ 1951.907142]  [<0000000000d1d392>] system_call+0x82/0xb0
> >>>>>> [ 1951.907145] Last Breaking-Event-Address:
> >>>>>> [ 1951.907146]  [<0000038001d839c0>] 0x38001d839c0
> >>>>>>
> >>>>>> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> >>>>>>
> >>>>>> Any thoughts?
> >>>>>
> >>>>> Thanks for the report, I will take a look.
> >>>>>
> >>>>> My first thought is that this fix may not be applied,
> >>>>> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
> >>>>> However, I see that that is in next-20220831.
> >>>>>
> >>>>> Hopefully, this will recreate on x86.
> >>>>
> >>>> One additional thought ...
> >>>>
> >>>> With this patch, we will take the hugetlb fault mutex for EVERY index in the
> >>>> range being truncated or hole punched.  In the case of a very large file, that
> >>>> is no different than code today where we take the mutex when removing pages
> >>>> from the file.  What is different is taking the mutex for indices that are
> >>>> part of holes in the file.  Consider a very large file with only one page at
> >>>> the very large offset.  We would then take the mutex for each index in that
> >>>> very large hole.  Depending on the size of the hole, this could appear as a
> >>>> hang.
> >>>>
> >>>> For the above locking scheme to work, we need to take the mutex for indices
> >>>> in holes in case there would happen to be a racing page fault.  However, there
> >>>> are only a limited number of fault mutexes (it is a table).  So, we only really
> >>>> need to take at a maximum num_fault_mutexes mutexes.  We could keep track of
> >>>> these with a bitmap.
> >>>>
> >>>> I am not sure this is the issue you are seeing, but a test named
> >>>> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
> >>>>
> >>>> In any case, I think this issue needs to be addressed before this series can
> >>>> move forward.
> >>>
> >>> Well, even if we address the issue of taking the same mutex multiple times,
> >>
> >> Can we change to take all the hugetlb fault mutex at the same time to ensure every possible
> >> future hugetlb page fault will see a truncated i_size? Then we could just drop all the hugetlb
> >> fault mutex before doing any heavy stuff? It seems hugetlb fault mutex could be dropped when
> >> new i_size is guaranteed to be visible for any future hugetlb page fault users?
> >> But I might miss something...
> > 
> > Yes, that is the general direction and would work well for truncation.  However,
> > the same routine remove_inode_hugepages is used for hole punch, and I am pretty
> > sure we want to take the fault mutex there as it can race with page faults.
> 
> Oh, sorry. I missed that case.
> 
> > 
> >>
> >>> this new synchronization scheme requires a folio lookup for EVERY index in
> >>> the truncated or hole punched range.  This can easily 'stall' a CPU if there
> >>
> >> If above thought holds, we could do batch folio lookup instead. Hopes my thought will help. ;)
> >>
> > 
> > Yes, I have some promising POC code with two batch lookups in case of holes.
> > Hope to send something soon.
> 
> That will be really nice. ;)
> 

Hi Sven,

Would you be willing to try the patch below in your environment?
It addresses the stall I can create with a file that has a VERY large hole.
In addition, it passes libhugetlbfs tests and has run for a while in my
truncate/page fault race stress test.  However, it is very early code.
It would be nice to see if it addresses the issue in your environment.


From 10c58195db9ed8aeff84c63ea2baf6c007651e42 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 6 Sep 2022 19:59:47 -0700
Subject: [PATCH] hugetlb: redo remove_inode_hugepages algorithm for racing
 page faults

Previously, remove_inode_hugepages would take the fault mutes for EVERY
index in a file and look for a folio.  This included holes in the file.
For very large sparse files, this could result in minutes(or more) or
kernel time.  Taking the fault mutex for every index is a requirement if
we want to use it for synchronization with page faults.

This patch adjusts the algorithm slightly to take large holes into
account.  It tracks which fault mutexes have been taken so that it will
not needlessly take the same mutex more than once.  Also, we skip
looking for folios in holes.  Instead, we make a second scan of the file
with a bulk lookup.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 70 ++++++++++++++++++++++-------------------
 include/linux/hugetlb.h | 16 ++++++++++
 mm/hugetlb.c            | 48 ++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2f1d6da1bafb..ce1eb6202179 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -578,37 +578,27 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
 
 /*
  * Take hugetlb fault mutex for a set of inode indicies.
- * Check for and remove any found folios.  Return the number of
- * any removed folios.
- *
  */
-static long fault_lock_inode_indicies(struct hstate *h,
+static void fault_lock_inode_indicies(struct hstate *h,
 					struct inode *inode,
 					struct address_space *mapping,
 					pgoff_t start, pgoff_t end,
-					bool truncate_op)
+					bool truncate_op,
+					struct hugetlb_fault_mutex_track *hfmt)
 {
-	struct folio *folio;
-	long freed = 0;
+	long holes = 0;
 	pgoff_t index;
 	u32 hash;
 
-	for (index = start; index < end; index++) {
+	for (index = start; index < end &&
+			    !hugetlb_fault_mutex_track_all_set(hfmt);
+			    index++) {
 		hash = hugetlb_fault_mutex_hash(mapping, index);
-		mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
-		folio = filemap_get_folio(mapping, index);
-		if (folio) {
-			if (remove_inode_single_folio(h, inode, mapping, folio,
-							index, truncate_op))
-				freed++;
-			folio_put(folio);
-		}
-
-		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		hugetlb_fault_mutex_track_lock(hfmt, hash, false);
+		hugetlb_fault_mutex_track_unlock(hfmt, hash, false);
+		holes++;
+		cond_resched();
 	}
-
-	return freed;
 }
 
 /*
@@ -656,7 +646,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 	long freed = 0;
 	u32 hash;
 	bool truncate_op = (lend == LLONG_MAX);
+	struct hugetlb_fault_mutex_track *hfmt;
+	bool rescan = true;
+	unsigned long holes;
 
+	hfmt = hugetlb_fault_mutex_track_alloc();
+
+rescan:
+	holes = 0;
 	folio_batch_init(&fbatch);
 	next = m_start = start;
 	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
@@ -665,36 +662,45 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 
 			index = folio->index;
 			/*
-			 * Take fault mutex for missing folios before index,
-			 * while checking folios that might have been added
-			 * due to a race with fault code.
+			 * Take fault mutex for missing folios before index
 			 */
-			freed += fault_lock_inode_indicies(h, inode, mapping,
-						m_start, index, truncate_op);
+			holes += (index - m_start);
+			if (rescan)	/* no need on second pass */
+				fault_lock_inode_indicies(h, inode, mapping,
+					m_start, index, truncate_op, hfmt);
 
 			/*
 			 * Remove folio that was part of folio_batch.
+			 * Force taking fault mutex.
 			 */
 			hash = hugetlb_fault_mutex_hash(mapping, index);
-			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+			hugetlb_fault_mutex_track_lock(hfmt, hash, true);
 			if (remove_inode_single_folio(h, inode, mapping, folio,
 							index, truncate_op))
 				freed++;
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			hugetlb_fault_mutex_track_unlock(hfmt, hash, true);
 		}
 		folio_batch_release(&fbatch);
 		cond_resched();
 	}
 
 	/*
-	 * Take fault mutex for missing folios at end of range while checking
-	 * for folios that might have been added due to a race with fault code.
+	 * Take fault mutex for missing folios at end of range
 	 */
-	freed += fault_lock_inode_indicies(h, inode, mapping, m_start, m_end,
-						truncate_op);
+	holes += (m_end - m_start);
+	if (rescan)
+		fault_lock_inode_indicies(h, inode, mapping, m_start, m_end,
+						truncate_op, hfmt);
+
+	if (holes && rescan) {
+		rescan = false;
+		goto rescan;	/* can happen at most once */
+	}
 
 	if (truncate_op)
 		(void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
+
+	hugetlb_fault_mutex_track_free(hfmt);
 }
 
 static void hugetlbfs_evict_inode(struct inode *inode)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 852f911d676e..dc532d2e42d0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -180,8 +180,24 @@ void putback_active_hugepage(struct page *page);
 void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
 void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode);
+
 extern struct mutex *hugetlb_fault_mutex_table;
 u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
+struct hugetlb_fault_mutex_track {
+	bool all_set;
+	unsigned long *track_bits;
+};
+struct hugetlb_fault_mutex_track *hugetlb_fault_mutex_track_alloc(void);
+void hugetlb_fault_mutex_track_lock(struct hugetlb_fault_mutex_track *hfmt,
+				u32 hash, bool force);
+void hugetlb_fault_mutex_track_unlock(struct hugetlb_fault_mutex_track *hfmt,
+				u32 hash, bool force);
+static inline bool hugetlb_fault_mutex_track_all_set(
+				struct hugetlb_fault_mutex_track *hfmt)
+{
+	return hfmt->all_set;
+}
+void hugetlb_fault_mutex_track_free(struct hugetlb_fault_mutex_track *hfmt);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d0617d64d718..d9dfc4736928 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5788,6 +5788,54 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
 }
 #endif
 
+struct hugetlb_fault_mutex_track *hugetlb_fault_mutex_track_alloc(void)
+{
+	struct hugetlb_fault_mutex_track *hfmt;
+
+	hfmt = kmalloc(ALIGN(sizeof(struct hugetlb_fault_mutex_track),
+					sizeof(unsigned long)) +
+			sizeof(unsigned long) *
+					BITS_TO_LONGS(num_fault_mutexes),
+			GFP_KERNEL);
+	if (hfmt) {
+		hfmt->track_bits = (void *)hfmt +
+			ALIGN(sizeof(struct hugetlb_fault_mutex_track),
+					sizeof(unsigned long));
+
+		hfmt->all_set = false;
+		bitmap_zero(hfmt->track_bits, num_fault_mutexes);
+	}
+
+	return hfmt;
+}
+
+void hugetlb_fault_mutex_track_lock(struct hugetlb_fault_mutex_track *hfmt,
+				u32 hash, bool force)
+{
+	if (!hfmt || !test_bit((int)hash, hfmt->track_bits) || force) {
+		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+		/* set bit when unlocking */
+	}
+}
+
+void hugetlb_fault_mutex_track_unlock(struct hugetlb_fault_mutex_track *hfmt,
+				u32 hash, bool force)
+{
+	if (!hfmt || !test_bit((int)hash, hfmt->track_bits) || force) {
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		if (hfmt && !hfmt->all_set) {
+			set_bit((int)hash, hfmt->track_bits);
+			if (bitmap_full(hfmt->track_bits, num_fault_mutexes))
+				hfmt->all_set = true;
+		}
+	}
+}
+
+void hugetlb_fault_mutex_track_free(struct hugetlb_fault_mutex_track *hfmt)
+{
+	kfree(hfmt);
+}
+
 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags)
 {
Sven Schnelle Sept. 7, 2022, 8:22 a.m. UTC | #12
Mike Kravetz <mike.kravetz@oracle.com> writes:

> Would you be willing to try the patch below in your environment?
> It addresses the stall I can create with a file that has a VERY large hole.
> In addition, it passes libhugetlbfs tests and has run for a while in my
> truncate/page fault race stress test.  However, it is very early code.
> It would be nice to see if it addresses the issue in your environment.

Yes, that fixes the issue for me. I added some debugging yesterday
evening after sending the initial report, and the end value in the loop
was indeed quite large - i didn't record the exact number, but it was
something like 0xffffffffff800001. Feel free to add my Tested-by.

Thanks
Sven
Mike Kravetz Sept. 7, 2022, 2:50 p.m. UTC | #13
On 09/07/22 10:22, Sven Schnelle wrote:
> Mike Kravetz <mike.kravetz@oracle.com> writes:
> 
> > Would you be willing to try the patch below in your environment?
> > It addresses the stall I can create with a file that has a VERY large hole.
> > In addition, it passes libhugetlbfs tests and has run for a while in my
> > truncate/page fault race stress test.  However, it is very early code.
> > It would be nice to see if it addresses the issue in your environment.
> 
> Yes, that fixes the issue for me. I added some debugging yesterday
> evening after sending the initial report, and the end value in the loop
> was indeed quite large - i didn't record the exact number, but it was
> something like 0xffffffffff800001. Feel free to add my Tested-by.
> 

Thank you!

When thinking about this some more, the new vma_lock introduced by this series
may address truncation/fault races without the need of involving the fault
mutex.

How?

Before truncating or hole punching, we need to unmap all users of that range.
To unmap, we need to acquire the vma_lock for each vma mapping the file.  This
same lock is acquired in the page fault path.  As such, it provides the same
type of synchronization around i_size as provided by the fault mutex in this
patch.  So, I think we can make the code much simpler (and faster) by removing
the code taking the fault mutex for holes in files.  Of course, this can not
happen until the vma_lock is actually put into use which is done in the last
patch of this series.
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d98c6edbd1a4..e83fd31671b3 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -411,6 +411,95 @@  hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 	}
 }
 
+/*
+ * Called with hugetlb fault mutex held.
+ * Returns true if page was actually removed, false otherwise.
+ */
+static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
+					struct address_space *mapping,
+					struct folio *folio, pgoff_t index,
+					bool truncate_op)
+{
+	bool ret = false;
+
+	/*
+	 * If folio is mapped, it was faulted in after being
+	 * unmapped in caller.  Unmap (again) while holding
+	 * the fault mutex.  The mutex will prevent faults
+	 * until we finish removing the folio.
+	 */
+	if (unlikely(folio_mapped(folio))) {
+		i_mmap_lock_write(mapping);
+		hugetlb_vmdelete_list(&mapping->i_mmap,
+					index * pages_per_huge_page(h),
+					(index + 1) * pages_per_huge_page(h),
+					ZAP_FLAG_DROP_MARKER);
+		i_mmap_unlock_write(mapping);
+	}
+
+	folio_lock(folio);
+	/*
+	 * After locking page, make sure mapping is the same.
+	 * We could have raced with page fault populate and
+	 * backout code.
+	 */
+	if (folio_mapping(folio) == mapping) {
+		/*
+		 * We must remove the folio from page cache before removing
+		 * the region/ reserve map (hugetlb_unreserve_pages).  In
+		 * rare out of memory conditions, removal of the region/reserve
+		 * map could fail.  Correspondingly, the subpool and global
+		 * reserve usage count can need to be adjusted.
+		 */
+		VM_BUG_ON(HPageRestoreReserve(&folio->page));
+		hugetlb_delete_from_page_cache(&folio->page);
+		ret = true;
+		if (!truncate_op) {
+			if (unlikely(hugetlb_unreserve_pages(inode, index,
+								index + 1, 1)))
+				hugetlb_fix_reserve_counts(inode);
+		}
+	}
+
+	folio_unlock(folio);
+	return ret;
+}
+
+/*
+ * Take hugetlb fault mutex for a set of inode indicies.
+ * Check for and remove any found folios.  Return the number of
+ * any removed folios.
+ *
+ */
+static long fault_lock_inode_indicies(struct hstate *h,
+					struct inode *inode,
+					struct address_space *mapping,
+					pgoff_t start, pgoff_t end,
+					bool truncate_op)
+{
+	struct folio *folio;
+	long freed = 0;
+	pgoff_t index;
+	u32 hash;
+
+	for (index = start; index < end; index++) {
+		hash = hugetlb_fault_mutex_hash(mapping, index);
+		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+		folio = filemap_get_folio(mapping, index);
+		if (folio) {
+			if (remove_inode_single_folio(h, inode, mapping, folio,
+							index, truncate_op))
+				freed++;
+			folio_put(folio);
+		}
+
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+	}
+
+	return freed;
+}
+
 /*
  * remove_inode_hugepages handles two distinct cases: truncation and hole
  * punch.  There are subtle differences in operation for each case.
@@ -418,11 +507,10 @@  hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
  * truncation is indicated by end of range being LLONG_MAX
  *	In this case, we first scan the range and release found pages.
  *	After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
- *	maps and global counts.  Page faults can not race with truncation
- *	in this routine.  hugetlb_no_page() prevents page faults in the
- *	truncated range.  It checks i_size before allocation, and again after
- *	with the page table lock for the page held.  The same lock must be
- *	acquired to unmap a page.
+ *	maps and global counts.  Page faults can race with truncation.
+ *	During faults, hugetlb_no_page() checks i_size before page allocation,
+ *	and again after	obtaining page table lock.  It will 'back out'
+ *	allocations in the truncated range.
  * hole punch is indicated if end is not LLONG_MAX
  *	In the hole punch case we scan the range and release found pages.
  *	Only when releasing a page is the associated region/reserve map
@@ -431,75 +519,69 @@  hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
  *	This is indicated if we find a mapped page.
  * Note: If the passed end of range value is beyond the end of file, but
  * not LLONG_MAX this routine still performs a hole punch operation.
+ *
+ * Since page faults can race with this routine, care must be taken as both
+ * modify huge page reservation data.  To somewhat synchronize these operations
+ * the hugetlb fault mutex is taken for EVERY index in the range to be hole
+ * punched or truncated.  In this way, we KNOW either:
+ * - fault code has added a page beyond i_size, and we will remove here
+ * - fault code will see updated i_size and not add a page beyond
+ * The parameter 'lm__end' indicates the offset of the end of hole or file
+ * before truncation.  For hole punch lm_end == lend.
  */
 static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
-				   loff_t lend)
+				   loff_t lend, loff_t lm_end)
 {
 	struct hstate *h = hstate_inode(inode);
 	struct address_space *mapping = &inode->i_data;
 	const pgoff_t start = lstart >> huge_page_shift(h);
 	const pgoff_t end = lend >> huge_page_shift(h);
+	pgoff_t m_end = lm_end >> huge_page_shift(h);
+	pgoff_t m_start, m_index;
 	struct folio_batch fbatch;
+	struct folio *folio;
 	pgoff_t next, index;
-	int i, freed = 0;
+	unsigned int i;
+	long freed = 0;
+	u32 hash;
 	bool truncate_op = (lend == LLONG_MAX);
 
 	folio_batch_init(&fbatch);
-	next = start;
+	next = m_start = start;
 	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
 		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
-			struct folio *folio = fbatch.folios[i];
-			u32 hash = 0;
+			folio = fbatch.folios[i];
 
 			index = folio->index;
-			hash = hugetlb_fault_mutex_hash(mapping, index);
-			mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
 			/*
-			 * If folio is mapped, it was faulted in after being
-			 * unmapped in caller.  Unmap (again) now after taking
-			 * the fault mutex.  The mutex will prevent faults
-			 * until we finish removing the folio.
-			 *
-			 * This race can only happen in the hole punch case.
-			 * Getting here in a truncate operation is a bug.
+			 * Take fault mutex for missing folios before index,
+			 * while checking folios that might have been added
+			 * due to a race with fault code.
 			 */
-			if (unlikely(folio_mapped(folio))) {
-				BUG_ON(truncate_op);
-
-				i_mmap_lock_write(mapping);
-				hugetlb_vmdelete_list(&mapping->i_mmap,
-					index * pages_per_huge_page(h),
-					(index + 1) * pages_per_huge_page(h),
-					ZAP_FLAG_DROP_MARKER);
-				i_mmap_unlock_write(mapping);
-			}
+			freed += fault_lock_inode_indicies(h, inode, mapping,
+						m_start, m_index, truncate_op);
 
-			folio_lock(folio);
 			/*
-			 * We must free the huge page and remove from page
-			 * cache BEFORE removing the * region/reserve map
-			 * (hugetlb_unreserve_pages).  In rare out of memory
-			 * conditions, removal of the region/reserve map could
-			 * fail. Correspondingly, the subpool and global
-			 * reserve usage count can need to be adjusted.
+			 * Remove folio that was part of folio_batch.
 			 */
-			VM_BUG_ON(HPageRestoreReserve(&folio->page));
-			hugetlb_delete_from_page_cache(&folio->page);
-			freed++;
-			if (!truncate_op) {
-				if (unlikely(hugetlb_unreserve_pages(inode,
-							index, index + 1, 1)))
-					hugetlb_fix_reserve_counts(inode);
-			}
-
-			folio_unlock(folio);
+			hash = hugetlb_fault_mutex_hash(mapping, index);
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+			if (remove_inode_single_folio(h, inode, mapping, folio,
+							index, truncate_op))
+				freed++;
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		folio_batch_release(&fbatch);
 		cond_resched();
 	}
 
+	/*
+	 * Take fault mutex for missing folios at end of range while checking
+	 * for folios that might have been added due to a race with fault code.
+	 */
+	freed += fault_lock_inode_indicies(h, inode, mapping, m_start, m_end,
+						truncate_op);
+
 	if (truncate_op)
 		(void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
 }
@@ -507,8 +589,9 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
 	struct resv_map *resv_map;
+	loff_t prev_size = i_size_read(inode);
 
-	remove_inode_hugepages(inode, 0, LLONG_MAX);
+	remove_inode_hugepages(inode, 0, LLONG_MAX, prev_size);
 
 	/*
 	 * Get the resv_map from the address space embedded in the inode.
@@ -528,6 +611,7 @@  static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
 	struct hstate *h = hstate_inode(inode);
+	loff_t prev_size = i_size_read(inode);
 
 	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
@@ -538,7 +622,7 @@  static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
 				      ZAP_FLAG_DROP_MARKER);
 	i_mmap_unlock_write(mapping);
-	remove_inode_hugepages(inode, offset, LLONG_MAX);
+	remove_inode_hugepages(inode, offset, LLONG_MAX, prev_size);
 }
 
 static void hugetlbfs_zero_partial_page(struct hstate *h,
@@ -610,7 +694,7 @@  static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 
 	/* Remove full pages from the file. */
 	if (hole_end > hole_start)
-		remove_inode_hugepages(inode, hole_start, hole_end);
+		remove_inode_hugepages(inode, hole_start, hole_end, hole_end);
 
 	inode_unlock(inode);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 11c02513588c..a6eb46c64baf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5527,6 +5527,7 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page, new_pagecache_page = false;
+	bool reserve_alloc = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -5584,9 +5585,13 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 		new_page = true;
+		if (HPageRestoreReserve(page))
+			reserve_alloc = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
-			int err = hugetlb_add_to_page_cache(page, mapping, idx);
+			int err;
+
+			err = hugetlb_add_to_page_cache(page, mapping, idx);
 			if (err) {
 				restore_reserve_on_error(h, vma, haddr, page);
 				put_page(page);
@@ -5642,10 +5647,6 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	}
 
 	ptl = huge_pte_lock(h, mm, ptep);
-	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
-		goto backout;
-
 	ret = 0;
 	/* If pte changed from under us, retry */
 	if (!pte_same(huge_ptep_get(ptep), old_pte))
@@ -5689,10 +5690,18 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 backout:
 	spin_unlock(ptl);
 backout_unlocked:
-	unlock_page(page);
-	/* restore reserve for newly allocated pages not in page cache */
-	if (new_page && !new_pagecache_page)
+	if (new_page && !new_pagecache_page) {
+		/*
+		 * If reserve was consumed, make sure flag is set so that it
+		 * will be restored in free_huge_page().
+		 */
+		if (reserve_alloc)
+			SetHPageRestoreReserve(page);
+
 		restore_reserve_on_error(h, vma, haddr, page);
+	}
+
+	unlock_page(page);
 	put_page(page);
 	goto out;
 }
@@ -6006,26 +6015,12 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
 	spin_lock(ptl);
 
-	/*
-	 * Recheck the i_size after holding PT lock to make sure not
-	 * to leave any page mapped (as page_mapped()) beyond the end
-	 * of the i_size (remove_inode_hugepages() is strict about
-	 * enforcing that). If we bail out here, we'll also leave a
-	 * page in the radix tree in the vm_shared case beyond the end
-	 * of the i_size, but remove_inode_hugepages() will take care
-	 * of it as soon as we drop the hugetlb_fault_mutex_table.
-	 */
-	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	ret = -EFAULT;
-	if (idx >= size)
-		goto out_release_unlock;
-
-	ret = -EEXIST;
 	/*
 	 * We allow to overwrite a pte marker: consider when both MISSING|WP
 	 * registered, we firstly wr-protect a none pte which has no page cache
 	 * page backing it, then access the page.
 	 */
+	ret = -EEXIST;
 	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;