diff mbox series

[v9,03/10] mm/rmap: Split try_to_munlock from try_to_unmap

Message ID 20210524132725.12697-4-apopple@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Add support for SVM atomics in Nouveau | expand

Commit Message

Alistair Popple May 24, 2021, 1:27 p.m. UTC
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used
in different combinations.

TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

---

v9:
* Improved comments

v8:
* Renamed try_to_munlock to page_mlock to better reflect what the
  function actually does.
* Removed the TODO from the documentation that this patch addresses.

v7:
* Added Christoph's Reviewed-by

v4:
* Removed redundant check for VM_LOCKED
---
 Documentation/vm/unevictable-lru.rst | 33 ++++++---------
 include/linux/rmap.h                 |  3 +-
 mm/mlock.c                           | 10 ++---
 mm/rmap.c                            | 61 ++++++++++++++++++++--------
 4 files changed, 63 insertions(+), 44 deletions(-)

Comments

Liam R. Howlett May 25, 2021, 6:39 p.m. UTC | #1
* Alistair Popple <apopple@nvidia.com> [210524 09:29]:
> The behaviour of try_to_unmap_one() is difficult to follow because it
> performs different operations based on a fairly large set of flags used
> in different combinations.
> 
> TTU_MUNLOCK is one such flag. However it is exclusively used by
> try_to_munlock() which specifies no other flags. Therefore rather than
> overload try_to_unmap_one() with unrelated behaviour split this out into
> it's own function and remove the flag.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> ---
> 
> v9:
> * Improved comments
> 
> v8:
> * Renamed try_to_munlock to page_mlock to better reflect what the
>   function actually does.
> * Removed the TODO from the documentation that this patch addresses.
> 
> v7:
> * Added Christoph's Reviewed-by
> 
> v4:
> * Removed redundant check for VM_LOCKED
> ---
>  Documentation/vm/unevictable-lru.rst | 33 ++++++---------
>  include/linux/rmap.h                 |  3 +-
>  mm/mlock.c                           | 10 ++---
>  mm/rmap.c                            | 61 ++++++++++++++++++++--------
>  4 files changed, 63 insertions(+), 44 deletions(-)
> 
> diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
> index 0e1490524f53..eae3af17f2d9 100644
> --- a/Documentation/vm/unevictable-lru.rst
> +++ b/Documentation/vm/unevictable-lru.rst
> @@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone statistics for the number of
>  mlocked pages.  Note, however, that at this point we haven't checked whether
>  the page is mapped by other VM_LOCKED VMAs.
>  
> -We can't call try_to_munlock(), the function that walks the reverse map to
> +We can't call page_mlock(), the function that walks the reverse map to
>  check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
> -try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
> +page_mlock() is a variant of try_to_unmap() and thus requires that the page
>  not be on an LRU list [more on these below].  However, the call to
> -isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
> +isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
>  we go ahead and clear PG_mlocked up front, as this might be the only chance we
> -have.  If we can successfully isolate the page, we go ahead and
> -try_to_munlock(), which will restore the PG_mlocked flag and update the zone
> +have.  If we can successfully isolate the page, we go ahead and call
> +page_mlock(), which will restore the PG_mlocked flag and update the zone
>  page statistics if it finds another VMA holding the page mlocked.  If we fail
>  to isolate the page, we'll have left a potentially mlocked page on the LRU.
>  This is fine, because we'll catch it later if and if vmscan tries to reclaim
> @@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
>  holepunching, and truncation of file pages and their anonymous COWed pages.
>  
>  
> -try_to_munlock() Reverse Map Scan
> +page_mlock() Reverse Map Scan
>  ---------------------------------
>  
> -.. warning::
> -   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
> -   page_referenced() reverse map walker.
> -
>  When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
>  Handling <munlock_munlockall_handling>` above] tries to munlock a
>  page, it needs to determine whether or not the page is mapped by any
>  VM_LOCKED VMA without actually attempting to unmap all PTEs from the
>  page.  For this purpose, the unevictable/mlock infrastructure
> -introduced a variant of try_to_unmap() called try_to_munlock().
> +introduced a variant of try_to_unmap() called page_mlock().
>  
> -try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
> -mapped file and KSM pages with a flag argument specifying unlock versus unmap
> -processing.  Again, these functions walk the respective reverse maps looking
> -for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
> -the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
> -undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
> +page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
> +such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
> +pre-clearing of the page's PG_mlocked done by munlock_vma_page.
>  
> -Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
> +Note that page_mlock()'s reverse map walk must visit every VMA in a page's
>  reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
>  However, the scan can terminate when it encounters a VM_LOCKED VMA.
> -Although try_to_munlock() might be called a great many times when munlocking a
> +Although page_mlock() might be called a great many times when munlocking a
>  large region or tearing down a large address space that has been mlocked via
>  mlockall(), overall this is a fairly rare event.
>  
> @@ -602,7 +595,7 @@ inactive lists to the appropriate node's unevictable list.
>  shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
>  after shrink_active_list() had moved them to the inactive list, or pages mapped
>  into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
> -recheck via try_to_munlock().  shrink_inactive_list() won't notice the latter,
> +recheck via page_mlock().  shrink_inactive_list() won't notice the latter,
>  but will pass on to shrink_page_list().
>  
>  shrink_page_list() again culls obviously unevictable pages that it could
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index def5c62c93b3..38a746787c2f 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -87,7 +87,6 @@ struct anon_vma_chain {
>  
>  enum ttu_flags {
>  	TTU_MIGRATION		= 0x1,	/* migration mode */
> -	TTU_MUNLOCK		= 0x2,	/* munlock mode */
>  
>  	TTU_SPLIT_HUGE_PMD	= 0x4,	/* split huge PMD if any */
>  	TTU_IGNORE_MLOCK	= 0x8,	/* ignore mlock */
> @@ -239,7 +238,7 @@ int page_mkclean(struct page *);
>   * called in munlock()/munmap() path to check for other vmas holding
>   * the page mlocked.
>   */
> -void try_to_munlock(struct page *);
> +void page_mlock(struct page *page);
>  
>  void remove_migration_ptes(struct page *old, struct page *new, bool locked);
>  
> diff --git a/mm/mlock.c b/mm/mlock.c
> index df590fda5688..a518d4c48e65 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -108,7 +108,7 @@ void mlock_vma_page(struct page *page)
>  /*
>   * Finish munlock after successful page isolation
>   *
> - * Page must be locked. This is a wrapper for try_to_munlock()
> + * Page must be locked. This is a wrapper for page_mlock()
>   * and putback_lru_page() with munlock accounting.
>   */
>  static void __munlock_isolated_page(struct page *page)
> @@ -118,7 +118,7 @@ static void __munlock_isolated_page(struct page *page)
>  	 * and we don't need to check all the other vmas.
>  	 */
>  	if (page_mapcount(page) > 1)
> -		try_to_munlock(page);
> +		page_mlock(page);
>  
>  	/* Did try_to_unlock() succeed or punt? */
>  	if (!PageMlocked(page))
> @@ -158,7 +158,7 @@ static void __munlock_isolation_failed(struct page *page)
>   * munlock()ed or munmap()ed, we want to check whether other vmas hold the
>   * page locked so that we can leave it on the unevictable lru list and not
>   * bother vmscan with it.  However, to walk the page's rmap list in
> - * try_to_munlock() we must isolate the page from the LRU.  If some other
> + * page_mlock() we must isolate the page from the LRU.  If some other
>   * task has removed the page from the LRU, we won't be able to do that.
>   * So we clear the PageMlocked as we might not get another chance.  If we
>   * can't isolate the page, we leave it for putback_lru_page() and vmscan
> @@ -168,7 +168,7 @@ unsigned int munlock_vma_page(struct page *page)
>  {
>  	int nr_pages;
>  
> -	/* For try_to_munlock() and to serialize with page migration */
> +	/* For page_mlock() and to serialize with page migration */
>  	BUG_ON(!PageLocked(page));
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
> @@ -205,7 +205,7 @@ static int __mlock_posix_error_return(long retval)
>   *
>   * The fast path is available only for evictable pages with single mapping.
>   * Then we can bypass the per-cpu pvec and get better performance.
> - * when mapcount > 1 we need try_to_munlock() which can fail.
> + * when mapcount > 1 we need page_mlock() which can fail.
>   * when !page_evictable(), we need the full redo logic of putback_lru_page to
>   * avoid leaving evictable page in unevictable list.
>   *
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bc08c4d4b58a..e88966903e1e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1405,10 +1405,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	struct mmu_notifier_range range;
>  	enum ttu_flags flags = (enum ttu_flags)(long)arg;
>  
> -	/* munlock has nothing to gain from examining un-locked vmas */
> -	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> -		return true;
> -
>  	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
>  	    is_zone_device_page(page) && !is_device_private_page(page))
>  		return true;
> @@ -1469,8 +1465,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				page_vma_mapped_walk_done(&pvmw);
>  				break;
>  			}
> -			if (flags & TTU_MUNLOCK)
> -				continue;
>  		}
>  
>  		/* Unexpected PMD-mapped THP? */
> @@ -1784,20 +1778,53 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
>  	return !page_mapcount(page) ? true : false;
>  }
>  
> +/*
> + * Walks the vma's mapping a page and mlocks the page if any locked vma's are
> + * found. Once one is found the page is locked and the scan can be terminated.
> + */

Can you please add that this requires the mmap_sem() lock to the
comments?

> +static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
> +				 unsigned long address, void *unused)
> +{
> +	struct page_vma_mapped_walk pvmw = {
> +		.page = page,
> +		.vma = vma,
> +		.address = address,
> +	};
> +
> +	/* An un-locked vma doesn't have any pages to lock, continue the scan */
> +	if (!(vma->vm_flags & VM_LOCKED))
> +		return true;
> +
> +	while (page_vma_mapped_walk(&pvmw)) {
> +		/* PTE-mapped THP are never mlocked */
> +		if (!PageTransCompound(page))
> +			mlock_vma_page(page);
> +		page_vma_mapped_walk_done(&pvmw);
> +
> +		/*
> +		 * no need to continue scanning other vma's if the page has
> +		 * been locked.
> +		 */
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
>  /**
> - * try_to_munlock - try to munlock a page
> - * @page: the page to be munlocked
> + * page_mlock - try to mlock a page
> + * @page: the page to be mlocked
>   *
> - * Called from munlock code.  Checks all of the VMAs mapping the page
> - * to make sure nobody else has this page mlocked. The page will be
> - * returned with PG_mlocked cleared if no other vmas have it mlocked.
> + * Called from munlock code. Checks all of the VMAs mapping the page and mlocks
> + * the page if any are found. The page will be returned with PG_mlocked cleared
> + * if it is not mapped by any locked vmas.
> + *
> + * mmap_lock should be held for read or write.
>   */
> -
> -void try_to_munlock(struct page *page)
> +void page_mlock(struct page *page)
>  {
>  	struct rmap_walk_control rwc = {
> -		.rmap_one = try_to_unmap_one,
> -		.arg = (void *)TTU_MUNLOCK,
> +		.rmap_one = page_mlock_one,
>  		.done = page_not_mapped,
>  		.anon_lock = page_lock_anon_vma_read,
>  
> @@ -1849,7 +1876,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
>   * Find all the mappings of a page using the mapping pointer and the vma chains
>   * contained in the anon_vma struct it points to.
>   *
> - * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
> + * When called from page_mlock(), the mmap_lock of the mm containing the vma
>   * where the page was found will be held for write.  So, we won't recheck
>   * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
>   * LOCKED.
> @@ -1901,7 +1928,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
>   * Find all the mappings of a page using the mapping pointer and the vma chains
>   * contained in the address_space struct it points to.
>   *
> - * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
> + * When called from page_mlock(), the mmap_lock of the mm containing the vma
>   * where the page was found will be held for write.  So, we won't recheck
>   * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
>   * LOCKED.
> -- 
> 2.20.1
> 
> 

I believe munlock_vma_pages_range() still references the old function
name?

Thanks,
Liam
Shakeel Butt May 25, 2021, 11:45 p.m. UTC | #2
On Tue, May 25, 2021 at 11:40 AM Liam Howlett <liam.howlett@oracle.com> wrote:
>
[...]
> >
> > +/*
> > + * Walks the vma's mapping a page and mlocks the page if any locked vma's are
> > + * found. Once one is found the page is locked and the scan can be terminated.
> > + */
>
> Can you please add that this requires the mmap_sem() lock to the
> comments?
>

Why does this require mmap_sem() lock? Also mmap_sem() lock of which mm_struct?

> > +static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
> > +                              unsigned long address, void *unused)
> > +{
> > +     struct page_vma_mapped_walk pvmw = {
> > +             .page = page,
> > +             .vma = vma,
> > +             .address = address,
> > +     };
> > +
> > +     /* An un-locked vma doesn't have any pages to lock, continue the scan */
> > +     if (!(vma->vm_flags & VM_LOCKED))
> > +             return true;
> > +
> > +     while (page_vma_mapped_walk(&pvmw)) {
> > +             /* PTE-mapped THP are never mlocked */
> > +             if (!PageTransCompound(page))
> > +                     mlock_vma_page(page);
> > +             page_vma_mapped_walk_done(&pvmw);
> > +
> > +             /*
> > +              * no need to continue scanning other vma's if the page has
> > +              * been locked.
> > +              */
> > +             return false;
> > +     }
> > +
> > +     return true;
> > +}
Liam R. Howlett June 4, 2021, 8:49 p.m. UTC | #3
* Shakeel Butt <shakeelb@google.com> [210525 19:45]:
> On Tue, May 25, 2021 at 11:40 AM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> [...]
> > >
> > > +/*
> > > + * Walks the vma's mapping a page and mlocks the page if any locked vma's are
> > > + * found. Once one is found the page is locked and the scan can be terminated.
> > > + */
> >
> > Can you please add that this requires the mmap_sem() lock to the
> > comments?
> >
> 
> Why does this require mmap_sem() lock? Also mmap_sem() lock of which mm_struct?


Doesn't the mlock_vma_page() require the mmap_sem() for reading?  The
mm_struct in vma->vm_mm;


From what I can see, at least the following paths have mmap_lock held
for writing:

munlock_vma_pages_range() from __do_munmap()
munlokc_vma_pages_range() from remap_file_pages()

> 
> > > +static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
> > > +                              unsigned long address, void *unused)
> > > +{
> > > +     struct page_vma_mapped_walk pvmw = {
> > > +             .page = page,
> > > +             .vma = vma,
> > > +             .address = address,
> > > +     };
> > > +
> > > +     /* An un-locked vma doesn't have any pages to lock, continue the scan */
> > > +     if (!(vma->vm_flags & VM_LOCKED))
> > > +             return true;
> > > +
> > > +     while (page_vma_mapped_walk(&pvmw)) {
> > > +             /* PTE-mapped THP are never mlocked */
> > > +             if (!PageTransCompound(page))
> > > +                     mlock_vma_page(page);
> > > +             page_vma_mapped_walk_done(&pvmw);
> > > +
> > > +             /*
> > > +              * no need to continue scanning other vma's if the page has
> > > +              * been locked.
> > > +              */
> > > +             return false;
> > > +     }
> > > +
> > > +     return true;
> > > +}

munlock_vma_pages_range() comments still references try_to_{munlock|unmap}
Shakeel Butt June 5, 2021, 12:41 a.m. UTC | #4
On Fri, Jun 4, 2021 at 1:49 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Shakeel Butt <shakeelb@google.com> [210525 19:45]:
> > On Tue, May 25, 2021 at 11:40 AM Liam Howlett <liam.howlett@oracle.com> wrote:
> > >
> > [...]
> > > >
> > > > +/*
> > > > + * Walks the vma's mapping a page and mlocks the page if any locked vma's are
> > > > + * found. Once one is found the page is locked and the scan can be terminated.
> > > > + */
> > >
> > > Can you please add that this requires the mmap_sem() lock to the
> > > comments?
> > >
> >
> > Why does this require mmap_sem() lock? Also mmap_sem() lock of which mm_struct?
>
>
> Doesn't the mlock_vma_page() require the mmap_sem() for reading?  The
> mm_struct in vma->vm_mm;
>

We are traversing all the vmas where this page is mapped of possibly
different mm_structs. I don't think we want to take mmap_sem() of all
those mm_structs. The commit b87537d9e2fe ("mm: rmap use pte lock not
mmap_sem to set PageMlocked") removed exactly that.

>
> From what I can see, at least the following paths have mmap_lock held
> for writing:
>
> munlock_vma_pages_range() from __do_munmap()
> munlokc_vma_pages_range() from remap_file_pages()
>

The following path does not hold mmap_sem:

exit_mmap() -> munlock_vma_pages_all() -> munlock_vma_pages_range().

I would really suggest all to carefully read the commit message of
b87537d9e2fe ("mm: rmap use pte lock not mmap_sem to set
PageMlocked").

Particularly the following paragraph:
...
    Vlastimil Babka points out another race which this patch protects against.
     try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
    moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
    leaving PageMlocked and unevictable when it should be evictable.  mmap_sem
    is ineffective because exit_mmap() does not hold it; page lock ineffective
    because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
    is effective because __munlock_pagevec_fill() takes it to get the page,
    after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.
...

Alistair, please bring back the VM_LOCKED check with pte lock held and
the comment "Holding pte lock, we do *not* need mmap_lock here".

One positive outcome of this cleanup patch is the removal of
unnecessary invalidation (unmapping for kvm case) of secondary mmus.
Liam R. Howlett June 5, 2021, 3:39 a.m. UTC | #5
* Shakeel Butt <shakeelb@google.com> [210604 20:41]:
> On Fri, Jun 4, 2021 at 1:49 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Shakeel Butt <shakeelb@google.com> [210525 19:45]:
> > > On Tue, May 25, 2021 at 11:40 AM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > >
> > > [...]
> > > > >
> > > > > +/*
> > > > > + * Walks the vma's mapping a page and mlocks the page if any locked vma's are
> > > > > + * found. Once one is found the page is locked and the scan can be terminated.
> > > > > + */
> > > >
> > > > Can you please add that this requires the mmap_sem() lock to the
> > > > comments?
> > > >
> > >
> > > Why does this require mmap_sem() lock? Also mmap_sem() lock of which mm_struct?
> >
> >
> > Doesn't the mlock_vma_page() require the mmap_sem() for reading?  The
> > mm_struct in vma->vm_mm;
> >
> 
> We are traversing all the vmas where this page is mapped of possibly
> different mm_structs. I don't think we want to take mmap_sem() of all
> those mm_structs. The commit b87537d9e2fe ("mm: rmap use pte lock not
> mmap_sem to set PageMlocked") removed exactly that.
> 
> >
> > From what I can see, at least the following paths have mmap_lock held
> > for writing:
> >
> > munlock_vma_pages_range() from __do_munmap()
> > munlokc_vma_pages_range() from remap_file_pages()
> >
> 
> The following path does not hold mmap_sem:
> 
> exit_mmap() -> munlock_vma_pages_all() -> munlock_vma_pages_range().

Isn't this the benign race referenced by Hugh in the commit you point to
below?

> 
> I would really suggest all to carefully read the commit message of
> b87537d9e2fe ("mm: rmap use pte lock not mmap_sem to set
> PageMlocked").
> 
> Particularly the following paragraph:
> ...
>     Vlastimil Babka points out another race which this patch protects against.
>      try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
>     moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
>     leaving PageMlocked and unevictable when it should be evictable.  mmap_sem
>     is ineffective because exit_mmap() does not hold it; page lock ineffective
>     because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
>     is effective because __munlock_pagevec_fill() takes it to get the page,
>     after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.
> ...

So this is saying the race with exit_mmap() isn't benign after all?

> 
> Alistair, please bring back the VM_LOCKED check with pte lock held and
> the comment "Holding pte lock, we do *not* need mmap_lock here".
> 
> One positive outcome of this cleanup patch is the removal of
> unnecessary invalidation (unmapping for kvm case) of secondary mmus.
Shakeel Butt June 5, 2021, 4:19 a.m. UTC | #6
On Fri, Jun 4, 2021 at 8:39 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> > Particularly the following paragraph:
> > ...
> >     Vlastimil Babka points out another race which this patch protects against.
> >      try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
> >     moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
> >     leaving PageMlocked and unevictable when it should be evictable.  mmap_sem
> >     is ineffective because exit_mmap() does not hold it; page lock ineffective
> >     because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
> >     is effective because __munlock_pagevec_fill() takes it to get the page,
> >     after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.
> > ...
>
> So this is saying the race with exit_mmap() isn't benign after all?
>

Yes, not benign at all.
Alistair Popple June 7, 2021, 4:51 a.m. UTC | #7
On Saturday, 5 June 2021 10:41:03 AM AEST Shakeel Butt wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Jun 4, 2021 at 1:49 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Shakeel Butt <shakeelb@google.com> [210525 19:45]:
> > > On Tue, May 25, 2021 at 11:40 AM Liam Howlett <liam.howlett@oracle.com> 
wrote:
> > > >
> > > [...]
> > > > >
> > > > > +/*
> > > > > + * Walks the vma's mapping a page and mlocks the page if any locked 
vma's are
> > > > > + * found. Once one is found the page is locked and the scan can be 
terminated.
> > > > > + */
> > > >
> > > > Can you please add that this requires the mmap_sem() lock to the
> > > > comments?
> > > >
> > >
> > > Why does this require mmap_sem() lock? Also mmap_sem() lock of which 
mm_struct?
> >
> >
> > Doesn't the mlock_vma_page() require the mmap_sem() for reading?  The
> > mm_struct in vma->vm_mm;
> >
> 
> We are traversing all the vmas where this page is mapped of possibly
> different mm_structs. I don't think we want to take mmap_sem() of all
> those mm_structs. The commit b87537d9e2fe ("mm: rmap use pte lock not
> mmap_sem to set PageMlocked") removed exactly that.
> 
> >
> > From what I can see, at least the following paths have mmap_lock held
> > for writing:
> >
> > munlock_vma_pages_range() from __do_munmap()
> > munlokc_vma_pages_range() from remap_file_pages()
> >
> 
> The following path does not hold mmap_sem:
> 
> exit_mmap() -> munlock_vma_pages_all() -> munlock_vma_pages_range().
> 
> I would really suggest all to carefully read the commit message of
> b87537d9e2fe ("mm: rmap use pte lock not mmap_sem to set
> PageMlocked").
> 
> Particularly the following paragraph:
> ...
>     Vlastimil Babka points out another race which this patch protects 
against.
>      try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked 
a
>     moment after munlock_vma_pages_all() did its Phase 1 
TestClearPageMlocked:
>     leaving PageMlocked and unevictable when it should be evictable.  
mmap_sem
>     is ineffective because exit_mmap() does not hold it; page lock 
ineffective
>     because __munlock_pagevec() only takes it afterwards, in Phase 2; pte 
lock
>     is effective because __munlock_pagevec_fill() takes it to get the page,
>     after VM_LOCKED was cleared from vm_flags, so visible to 
try_to_unmap_one.
> ...
>
> Alistair, please bring back the VM_LOCKED check with pte lock held and
> the comment "Holding pte lock, we do *not* need mmap_lock here".

Actually thanks for highlighting that paragraph. I have gone back through the 
code again in munlock_vma_pages_range() and think I have a better 
understanding of it now. So now I agree - the check of VM_LOCKED under the PTL 
is important to ensure mlock_vma_page() does not run after VM_LOCKED has been 
cleared and __munlock_pagevec_fill() has run.

Will post v10 to fix this and the try_to_munlock reference pointed out by Liam 
which I missed for v9. Thanks Shakeel for taking the time to point this out.

> One positive outcome of this cleanup patch is the removal of
> unnecessary invalidation (unmapping for kvm case) of secondary mmus.
diff mbox series

Patch

diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 0e1490524f53..eae3af17f2d9 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -389,14 +389,14 @@  mlocked, munlock_vma_page() updates that zone statistics for the number of
 mlocked pages.  Note, however, that at this point we haven't checked whether
 the page is mapped by other VM_LOCKED VMAs.
 
-We can't call try_to_munlock(), the function that walks the reverse map to
+We can't call page_mlock(), the function that walks the reverse map to
 check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
-try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
+page_mlock() is a variant of try_to_unmap() and thus requires that the page
 not be on an LRU list [more on these below].  However, the call to
-isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
+isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
 we go ahead and clear PG_mlocked up front, as this might be the only chance we
-have.  If we can successfully isolate the page, we go ahead and
-try_to_munlock(), which will restore the PG_mlocked flag and update the zone
+have.  If we can successfully isolate the page, we go ahead and call
+page_mlock(), which will restore the PG_mlocked flag and update the zone
 page statistics if it finds another VMA holding the page mlocked.  If we fail
 to isolate the page, we'll have left a potentially mlocked page on the LRU.
 This is fine, because we'll catch it later if and if vmscan tries to reclaim
@@ -545,31 +545,24 @@  munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
 holepunching, and truncation of file pages and their anonymous COWed pages.
 
 
-try_to_munlock() Reverse Map Scan
+page_mlock() Reverse Map Scan
 ---------------------------------
 
-.. warning::
-   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
-   page_referenced() reverse map walker.
-
 When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
 Handling <munlock_munlockall_handling>` above] tries to munlock a
 page, it needs to determine whether or not the page is mapped by any
 VM_LOCKED VMA without actually attempting to unmap all PTEs from the
 page.  For this purpose, the unevictable/mlock infrastructure
-introduced a variant of try_to_unmap() called try_to_munlock().
+introduced a variant of try_to_unmap() called page_mlock().
 
-try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file and KSM pages with a flag argument specifying unlock versus unmap
-processing.  Again, these functions walk the respective reverse maps looking
-for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
-the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
-undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
+page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
+such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
 
-Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+Note that page_mlock()'s reverse map walk must visit every VMA in a page's
 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
 However, the scan can terminate when it encounters a VM_LOCKED VMA.
-Although try_to_munlock() might be called a great many times when munlocking a
+Although page_mlock() might be called a great many times when munlocking a
 large region or tearing down a large address space that has been mlocked via
 mlockall(), overall this is a fairly rare event.
 
@@ -602,7 +595,7 @@  inactive lists to the appropriate node's unevictable list.
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
 into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
-recheck via try_to_munlock().  shrink_inactive_list() won't notice the latter,
+recheck via page_mlock().  shrink_inactive_list() won't notice the latter,
 but will pass on to shrink_page_list().
 
 shrink_page_list() again culls obviously unevictable pages that it could
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index def5c62c93b3..38a746787c2f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -87,7 +87,6 @@  struct anon_vma_chain {
 
 enum ttu_flags {
 	TTU_MIGRATION		= 0x1,	/* migration mode */
-	TTU_MUNLOCK		= 0x2,	/* munlock mode */
 
 	TTU_SPLIT_HUGE_PMD	= 0x4,	/* split huge PMD if any */
 	TTU_IGNORE_MLOCK	= 0x8,	/* ignore mlock */
@@ -239,7 +238,7 @@  int page_mkclean(struct page *);
  * called in munlock()/munmap() path to check for other vmas holding
  * the page mlocked.
  */
-void try_to_munlock(struct page *);
+void page_mlock(struct page *page);
 
 void remove_migration_ptes(struct page *old, struct page *new, bool locked);
 
diff --git a/mm/mlock.c b/mm/mlock.c
index df590fda5688..a518d4c48e65 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -108,7 +108,7 @@  void mlock_vma_page(struct page *page)
 /*
  * Finish munlock after successful page isolation
  *
- * Page must be locked. This is a wrapper for try_to_munlock()
+ * Page must be locked. This is a wrapper for page_mlock()
  * and putback_lru_page() with munlock accounting.
  */
 static void __munlock_isolated_page(struct page *page)
@@ -118,7 +118,7 @@  static void __munlock_isolated_page(struct page *page)
 	 * and we don't need to check all the other vmas.
 	 */
 	if (page_mapcount(page) > 1)
-		try_to_munlock(page);
+		page_mlock(page);
 
 	/* Did try_to_unlock() succeed or punt? */
 	if (!PageMlocked(page))
@@ -158,7 +158,7 @@  static void __munlock_isolation_failed(struct page *page)
  * munlock()ed or munmap()ed, we want to check whether other vmas hold the
  * page locked so that we can leave it on the unevictable lru list and not
  * bother vmscan with it.  However, to walk the page's rmap list in
- * try_to_munlock() we must isolate the page from the LRU.  If some other
+ * page_mlock() we must isolate the page from the LRU.  If some other
  * task has removed the page from the LRU, we won't be able to do that.
  * So we clear the PageMlocked as we might not get another chance.  If we
  * can't isolate the page, we leave it for putback_lru_page() and vmscan
@@ -168,7 +168,7 @@  unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
 
-	/* For try_to_munlock() and to serialize with page migration */
+	/* For page_mlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
@@ -205,7 +205,7 @@  static int __mlock_posix_error_return(long retval)
  *
  * The fast path is available only for evictable pages with single mapping.
  * Then we can bypass the per-cpu pvec and get better performance.
- * when mapcount > 1 we need try_to_munlock() which can fail.
+ * when mapcount > 1 we need page_mlock() which can fail.
  * when !page_evictable(), we need the full redo logic of putback_lru_page to
  * avoid leaving evictable page in unevictable list.
  *
diff --git a/mm/rmap.c b/mm/rmap.c
index bc08c4d4b58a..e88966903e1e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,10 +1405,6 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 
-	/* munlock has nothing to gain from examining un-locked vmas */
-	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
-		return true;
-
 	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
 	    is_zone_device_page(page) && !is_device_private_page(page))
 		return true;
@@ -1469,8 +1465,6 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
-			if (flags & TTU_MUNLOCK)
-				continue;
 		}
 
 		/* Unexpected PMD-mapped THP? */
@@ -1784,20 +1778,53 @@  bool try_to_unmap(struct page *page, enum ttu_flags flags)
 	return !page_mapcount(page) ? true : false;
 }
 
+/*
+ * Walks the vma's mapping a page and mlocks the page if any locked vma's are
+ * found. Once one is found the page is locked and the scan can be terminated.
+ */
+static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address, void *unused)
+{
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
+
+	/* An un-locked vma doesn't have any pages to lock, continue the scan */
+	if (!(vma->vm_flags & VM_LOCKED))
+		return true;
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		/* PTE-mapped THP are never mlocked */
+		if (!PageTransCompound(page))
+			mlock_vma_page(page);
+		page_vma_mapped_walk_done(&pvmw);
+
+		/*
+		 * no need to continue scanning other vma's if the page has
+		 * been locked.
+		 */
+		return false;
+	}
+
+	return true;
+}
+
 /**
- * try_to_munlock - try to munlock a page
- * @page: the page to be munlocked
+ * page_mlock - try to mlock a page
+ * @page: the page to be mlocked
  *
- * Called from munlock code.  Checks all of the VMAs mapping the page
- * to make sure nobody else has this page mlocked. The page will be
- * returned with PG_mlocked cleared if no other vmas have it mlocked.
+ * Called from munlock code. Checks all of the VMAs mapping the page and mlocks
+ * the page if any are found. The page will be returned with PG_mlocked cleared
+ * if it is not mapped by any locked vmas.
+ *
+ * mmap_lock should be held for read or write.
  */
-
-void try_to_munlock(struct page *page)
+void page_mlock(struct page *page)
 {
 	struct rmap_walk_control rwc = {
-		.rmap_one = try_to_unmap_one,
-		.arg = (void *)TTU_MUNLOCK,
+		.rmap_one = page_mlock_one,
 		.done = page_not_mapped,
 		.anon_lock = page_lock_anon_vma_read,
 
@@ -1849,7 +1876,7 @@  static struct anon_vma *rmap_walk_anon_lock(struct page *page,
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the anon_vma struct it points to.
  *
- * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
+ * When called from page_mlock(), the mmap_lock of the mm containing the vma
  * where the page was found will be held for write.  So, we won't recheck
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * LOCKED.
@@ -1901,7 +1928,7 @@  static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
  *
- * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
+ * When called from page_mlock(), the mmap_lock of the mm containing the vma
  * where the page was found will be held for write.  So, we won't recheck
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * LOCKED.