Message ID | 90c8962-d188-8687-dc70-628293316343@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/munlock: rework of mlock+munlock page handling | expand |
On 2/6/22 22:49, Hugh Dickins wrote: > Page migration of a VM_LOCKED page tends to fail, because when the old > page is unmapped, it is put on the mlock pagevec with raised refcount, > which then fails the freeze. > > At first I thought this would be fixed by a local mlock_page_drain() at > the upper rmap_walk() level - which would have nicely batched all the > munlocks of that page; but tests show that the task can too easily move > to another cpu, leaving pagevec residue behind which fails the migration. > > So try_to_migrate_one() drain the local pagevec after page_remove_rmap() > from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose > TTU_IGNORE_MLOCK users would want the same treatment; and do the same > in remove_migration_pte() - not important when successfully inserting > a new page, but necessary when hoping to retry after failure. > > Any new pagevec runs the risk of adding a new way of stranding, and we > might discover other corners where mlock_page_drain() or lru_add_drain() > would now help. If the mlock pagevec raises doubts, we can easily add a > sysctl to tune its length to 1, which reverts to synchronous operation. Not a fan of adding new sysctls like those as that just pushes the failure of kernel devs to poor admins :) The old pagevec usage deleted by patch 1 was limited to the naturally larger munlock_vma_pages_range() operation. The new per-cpu based one is more general, which obviously has its advantages, but then it might bring new corner cases. So if this turns out to be an big problem, I would rather go back to the limited scenario pagevec than a sysctl? > Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > mm/migrate.c | 2 ++ > mm/rmap.c | 4 ++++ > 2 files changed, 6 insertions(+) > > diff --git a/mm/migrate.c b/mm/migrate.c > index f4bcf1541b62..e7d0b68d5dcb 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, > page_add_file_rmap(new, vma, false); > set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); > } > + if (vma->vm_flags & VM_LOCKED) > + mlock_page_drain(smp_processor_id()); > > /* No need to invalidate - it was non-present before */ > update_mmu_cache(vma, pvmw.address, pvmw.pte); > diff --git a/mm/rmap.c b/mm/rmap.c > index 5442a5c97a85..714bfdc72c7b 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > * See Documentation/vm/mmu_notifier.rst > */ > page_remove_rmap(subpage, vma, PageHuge(page)); > + if (vma->vm_flags & VM_LOCKED) > + mlock_page_drain(smp_processor_id()); > put_page(page); > } > > @@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, > * See Documentation/vm/mmu_notifier.rst > */ > page_remove_rmap(subpage, vma, PageHuge(page)); > + if (vma->vm_flags & VM_LOCKED) > + mlock_page_drain(smp_processor_id()); > put_page(page); > } >
On Fri, 11 Feb 2022, Vlastimil Babka wrote: > On 2/6/22 22:49, Hugh Dickins wrote: > > > > Any new pagevec runs the risk of adding a new way of stranding, and we > > might discover other corners where mlock_page_drain() or lru_add_drain() > > would now help. If the mlock pagevec raises doubts, we can easily add a > > sysctl to tune its length to 1, which reverts to synchronous operation. > > Not a fan of adding new sysctls like those as that just pushes the failure > of kernel devs to poor admins :) > The old pagevec usage deleted by patch 1 was limited to the naturally larger > munlock_vma_pages_range() operation. The new per-cpu based one is more > general, which obviously has its advantages, but then it might bring new > corner cases. > So if this turns out to be an big problem, I would rather go back to the > limited scenario pagevec than a sysctl? Okay, I'll delete that comment proposing a sysctl, which was more as a possible safety measure for our internal experimentation than for general use. I just thought it was an easy way to force synchronous. I don't expect a big problem. The flush in this commit deals, I think, with the only way it was treading on its own toes, using a pagevec at a level which relied on the unlikelihood of a pagevec reference. But I can imagine that such-and-such a test will expect Mlocked or Unevictable to be exact, and this pagevec now delay it becoming exact. I've not seen any example so far, but it does seem possible. Maybe we shall fix the test, maybe we shall add a drain. I hadn't thought of limiting the use of pagevec to the mass operation (if a significant problem emerges). Yes, that might be a better idea, thanks. Anyway, we don't need to worry in advance. A change I also had in this patch, orginally, was for /proc/sys/vm/stat_refresh to lru_add_drain_all() first. I'm surprised that tests see good numbers without it doing so; but since I've not actually seen the need for it yet, dropped that - we can always add it later if need emerges. Hugh
diff --git a/mm/migrate.c b/mm/migrate.c index f4bcf1541b62..e7d0b68d5dcb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, page_add_file_rmap(new, vma, false); set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); } + if (vma->vm_flags & VM_LOCKED) + mlock_page_drain(smp_processor_id()); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, pvmw.address, pvmw.pte); diff --git a/mm/rmap.c b/mm/rmap.c index 5442a5c97a85..714bfdc72c7b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * See Documentation/vm/mmu_notifier.rst */ page_remove_rmap(subpage, vma, PageHuge(page)); + if (vma->vm_flags & VM_LOCKED) + mlock_page_drain(smp_processor_id()); put_page(page); } @@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, * See Documentation/vm/mmu_notifier.rst */ page_remove_rmap(subpage, vma, PageHuge(page)); + if (vma->vm_flags & VM_LOCKED) + mlock_page_drain(smp_processor_id()); put_page(page); }
Page migration of a VM_LOCKED page tends to fail, because when the old page is unmapped, it is put on the mlock pagevec with raised refcount, which then fails the freeze. At first I thought this would be fixed by a local mlock_page_drain() at the upper rmap_walk() level - which would have nicely batched all the munlocks of that page; but tests show that the task can too easily move to another cpu, leaving pagevec residue behind which fails the migration. So try_to_migrate_one() drain the local pagevec after page_remove_rmap() from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose TTU_IGNORE_MLOCK users would want the same treatment; and do the same in remove_migration_pte() - not important when successfully inserting a new page, but necessary when hoping to retry after failure. Any new pagevec runs the risk of adding a new way of stranding, and we might discover other corners where mlock_page_drain() or lru_add_drain() would now help. If the mlock pagevec raises doubts, we can easily add a sysctl to tune its length to 1, which reverts to synchronous operation. Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/migrate.c | 2 ++ mm/rmap.c | 4 ++++ 2 files changed, 6 insertions(+)