diff mbox series

[11/13] mm/munlock: page migration needs mlock pagevec drained

Message ID 90c8962-d188-8687-dc70-628293316343@google.com (mailing list archive)
State New
Headers show
Series mm/munlock: rework of mlock+munlock page handling | expand

Commit Message

Hugh Dickins Feb. 6, 2022, 9:49 p.m. UTC
Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.  If the mlock pagevec raises doubts, we can easily add a
sysctl to tune its length to 1, which reverts to synchronous operation.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/migrate.c | 2 ++
 mm/rmap.c    | 4 ++++
 2 files changed, 6 insertions(+)

Comments

Vlastimil Babka Feb. 11, 2022, 6:49 p.m. UTC | #1
On 2/6/22 22:49, Hugh Dickins wrote:
> Page migration of a VM_LOCKED page tends to fail, because when the old
> page is unmapped, it is put on the mlock pagevec with raised refcount,
> which then fails the freeze.
> 
> At first I thought this would be fixed by a local mlock_page_drain() at
> the upper rmap_walk() level - which would have nicely batched all the
> munlocks of that page; but tests show that the task can too easily move
> to another cpu, leaving pagevec residue behind which fails the migration.
> 
> So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
> from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
> TTU_IGNORE_MLOCK users would want the same treatment; and do the same
> in remove_migration_pte() - not important when successfully inserting
> a new page, but necessary when hoping to retry after failure.
> 
> Any new pagevec runs the risk of adding a new way of stranding, and we
> might discover other corners where mlock_page_drain() or lru_add_drain()
> would now help.  If the mlock pagevec raises doubts, we can easily add a
> sysctl to tune its length to 1, which reverts to synchronous operation.

Not a fan of adding new sysctls like those as that just pushes the failure
of kernel devs to poor admins :)
The old pagevec usage deleted by patch 1 was limited to the naturally larger
munlock_vma_pages_range() operation. The new per-cpu based one is more
general, which obviously has its advantages, but then it might bring new
corner cases.
So if this turns out to be an big problem, I would rather go back to the
limited scenario pagevec than a sysctl?

> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/migrate.c | 2 ++
>  mm/rmap.c    | 4 ++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f4bcf1541b62..e7d0b68d5dcb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>  				page_add_file_rmap(new, vma, false);
>  			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
>  		}
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_page_drain(smp_processor_id());
>  
>  		/* No need to invalidate - it was non-present before */
>  		update_mmu_cache(vma, pvmw.address, pvmw.pte);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5442a5c97a85..714bfdc72c7b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		 * See Documentation/vm/mmu_notifier.rst
>  		 */
>  		page_remove_rmap(subpage, vma, PageHuge(page));
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_page_drain(smp_processor_id());
>  		put_page(page);
>  	}
>  
> @@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
>  		 * See Documentation/vm/mmu_notifier.rst
>  		 */
>  		page_remove_rmap(subpage, vma, PageHuge(page));
> +		if (vma->vm_flags & VM_LOCKED)
> +			mlock_page_drain(smp_processor_id());
>  		put_page(page);
>  	}
>
Hugh Dickins Feb. 14, 2022, 5:34 a.m. UTC | #2
On Fri, 11 Feb 2022, Vlastimil Babka wrote:
> On 2/6/22 22:49, Hugh Dickins wrote:
> > 
> > Any new pagevec runs the risk of adding a new way of stranding, and we
> > might discover other corners where mlock_page_drain() or lru_add_drain()
> > would now help.  If the mlock pagevec raises doubts, we can easily add a
> > sysctl to tune its length to 1, which reverts to synchronous operation.
> 
> Not a fan of adding new sysctls like those as that just pushes the failure
> of kernel devs to poor admins :)
> The old pagevec usage deleted by patch 1 was limited to the naturally larger
> munlock_vma_pages_range() operation. The new per-cpu based one is more
> general, which obviously has its advantages, but then it might bring new
> corner cases.
> So if this turns out to be an big problem, I would rather go back to the
> limited scenario pagevec than a sysctl?

Okay, I'll delete that comment proposing a sysctl, which was more as
a possible safety measure for our internal experimentation than for
general use.  I just thought it was an easy way to force synchronous.

I don't expect a big problem.  The flush in this commit deals, I think,
with the only way it was treading on its own toes, using a pagevec at
a level which relied on the unlikelihood of a pagevec reference.

But I can imagine that such-and-such a test will expect Mlocked or
Unevictable to be exact, and this pagevec now delay it becoming exact.
I've not seen any example so far, but it does seem possible.  Maybe
we shall fix the test, maybe we shall add a drain.

I hadn't thought of limiting the use of pagevec to the mass operation
(if a significant problem emerges).  Yes, that might be a better idea,
thanks.  Anyway, we don't need to worry in advance.

A change I also had in this patch, orginally, was for
/proc/sys/vm/stat_refresh to lru_add_drain_all() first.  I'm surprised
that tests see good numbers without it doing so; but since I've not
actually seen the need for it yet, dropped that - we can always add
it later if need emerges.

Hugh
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index f4bcf1541b62..e7d0b68d5dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -251,6 +251,8 @@  static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5442a5c97a85..714bfdc72c7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,6 +1656,8 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}
 
@@ -1930,6 +1932,8 @@  static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}