Message ID | 8bc3ee8c-7f1-d812-7f22-4f9f6d436bc@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/munlock: rework of mlock+munlock page handling | expand |
On Sun, 6 Feb 2022 13:42:09 -0800 (PST) Hugh Dickins wrote: > +static void mlock_vma_pages_range(struct vm_area_struct *vma, > + unsigned long start, unsigned long end, vm_flags_t newflags) > { > - /* Reimplementation to follow in later commit */ > + static const struct mm_walk_ops mlock_walk_ops = { > + .pmd_entry = mlock_pte_range, > + }; > + > + /* > + * There is a slight chance that concurrent page migration, > + * or page reclaim finding a page of this now-VM_LOCKED vma, > + * will call mlock_vma_page() and raise page's mlock_count: > + * double counting, leaving the page unevictable indefinitely. > + * Communicate this danger to mlock_vma_page() with VM_IO, > + * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas. > + * mmap_lock is held in write mode here, so this weird > + * combination should not be visible to others. > + */ > + if (newflags & VM_LOCKED) > + newflags |= VM_IO; > + WRITE_ONCE(vma->vm_flags, newflags); Nit The WRITE_ONCE is not needed, given the certainty of invisibility to others - it will quiesce syzbot reporting the case of visibility. Hillf > + > + lru_add_drain(); > + walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL); > + lru_add_drain(); > + > + if (newflags & VM_IO) { > + newflags &= ~VM_IO; > + WRITE_ONCE(vma->vm_flags, newflags); > + } > } >
On Mon, 7 Feb 2022, Hillf Danton wrote: > On Sun, 6 Feb 2022 13:42:09 -0800 (PST) Hugh Dickins wrote: > > +static void mlock_vma_pages_range(struct vm_area_struct *vma, > > + unsigned long start, unsigned long end, vm_flags_t newflags) > > { > > - /* Reimplementation to follow in later commit */ > > + static const struct mm_walk_ops mlock_walk_ops = { > > + .pmd_entry = mlock_pte_range, > > + }; > > + > > + /* > > + * There is a slight chance that concurrent page migration, > > + * or page reclaim finding a page of this now-VM_LOCKED vma, > > + * will call mlock_vma_page() and raise page's mlock_count: > > + * double counting, leaving the page unevictable indefinitely. > > + * Communicate this danger to mlock_vma_page() with VM_IO, > > + * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas. > > + * mmap_lock is held in write mode here, so this weird > > + * combination should not be visible to others. > > + */ > > + if (newflags & VM_LOCKED) > > + newflags |= VM_IO; > > + WRITE_ONCE(vma->vm_flags, newflags); > > Nit > > The WRITE_ONCE is not needed, given the certainty of invisibility to > others - it will quiesce syzbot reporting the case of visibility. Ah, maybe I can rewrite that comment better: when I said "visible to others", I meant visible to "the outside world", those participating in the usual mmap_lock'ed access, syscalls and /proc/pid/maps and smaps etc. The point here is that some kernel low-level internals (page migration and page reclaim) peek at vma->vm_flags without mmap_lock (but with anon_vma lock or i_mmap_rwsem). Originally I had VM_LOCKED set in vma->vm_flags before calling mlock_vma_pages_range(), no need for a newflags parameter. Then realized that left a tiny window in which VM_LOCKED was visible to migration and reclaim without the safening VM_IO, so changed it to pass in newflags, then "newflags |= VM_IO", then "vma->vm_flags = newflags" there. Then realized that perhaps an uncooperative compiler might be inspired to mutate that into "vma->vm_flags = newflags" followed by "vma->vm_flags |= VM_IO". I hope it would not, but can I be sure that it would not? That's why I ended up with WRITE_ONCE() there. Maybe all rather overkill: but trying to ensure that we undercount mmap_locked rather than risk overcounting it. Hugh
On 2/6/22 22:42, Hugh Dickins wrote: > Fill in missing pieces: reimplementation of munlock_vma_pages_range(), > required to lower the mlock_counts when munlocking without munmapping; > and its complement, implementation of mlock_vma_pages_range(), required > to raise the mlock_counts on pages already there when a range is mlocked. > > Combine them into just the one function mlock_vma_pages_range(), using > walk_page_range() to run mlock_pte_range(). This approach fixes the > "Very slow unlockall()" of unpopulated PROT_NONE areas, reported in > https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/ > > Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if > a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the > pte first, it will not try to munlock the page: leaving release_pages() > to correct it when the last reference to the page is gone - that's okay, > a page is not evictable anyway while it is held by an extra reference. > > Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if > a racing remove_migration_pte() or try_to_unmap_one() (depending on > i_mmap_rwsem) gets to the pte first, it will try to mlock the page, > then mlock_pte_range() mlock it a second time. This is harder to > reproduce, but a more serious race because it could leave the page > unevictable indefinitely though the area is munlocked afterwards. > Guard against it by setting the (inappropriate) VM_IO flag, > and modifying mlock_vma_page() to decline such vmas. > > Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> > @@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, > pgoff_t pgoff; > int nr_pages; > int ret = 0; > - int lock = !!(newflags & VM_LOCKED); > - vm_flags_t old_flags = vma->vm_flags; > + vm_flags_t oldflags = vma->vm_flags; > > if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) || Nit: can use oldflags instead of vma->vm_flags above?
On Fri, 11 Feb 2022, Vlastimil Babka wrote: > On 2/6/22 22:42, Hugh Dickins wrote: > > @@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, > > pgoff_t pgoff; > > int nr_pages; > > int ret = 0; > > - int lock = !!(newflags & VM_LOCKED); > > - vm_flags_t old_flags = vma->vm_flags; > > + vm_flags_t oldflags = vma->vm_flags; > > > > if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) || > > Nit: can use oldflags instead of vma->vm_flags above? Yes thanks, that is nicer, I'm making that change now. Hugh
diff --git a/mm/internal.h b/mm/internal.h index a43d79335c16..b3f0dd3ffba2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -412,7 +412,8 @@ void mlock_page(struct page *page); static inline void mlock_vma_page(struct page *page, struct vm_area_struct *vma, bool compound) { - if (unlikely(vma->vm_flags & VM_LOCKED) && + /* VM_IO check prevents migration from double-counting during mlock */ + if (unlikely((vma->vm_flags & (VM_LOCKED|VM_IO)) == VM_LOCKED) && (compound || !PageTransCompound(page))) mlock_page(page); } diff --git a/mm/mlock.c b/mm/mlock.c index 0d3ae04b1f4e..f8e5dcff21ae 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -14,6 +14,7 @@ #include <linux/swapops.h> #include <linux/pagemap.h> #include <linux/pagevec.h> +#include <linux/pagewalk.h> #include <linux/mempolicy.h> #include <linux/syscalls.h> #include <linux/sched.h> @@ -127,23 +128,90 @@ void munlock_page(struct page *page) unlock_page_memcg(page); } +static int mlock_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) + +{ + struct vm_area_struct *vma = walk->vma; + spinlock_t *ptl; + pte_t *start_pte, *pte; + struct page *page; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + if (!pmd_present(*pmd)) + goto out; + if (is_huge_zero_pmd(*pmd)) + goto out; + page = pmd_page(*pmd); + if (vma->vm_flags & VM_LOCKED) + mlock_page(page); + else + munlock_page(page); + goto out; + } + + start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { + if (!pte_present(*pte)) + continue; + page = vm_normal_page(vma, addr, *pte); + if (!page) + continue; + if (PageTransCompound(page)) + continue; + if (vma->vm_flags & VM_LOCKED) + mlock_page(page); + else + munlock_page(page); + } + pte_unmap(start_pte); +out: + spin_unlock(ptl); + cond_resched(); + return 0; +} + /* - * munlock_vma_pages_range() - munlock all pages in the vma range.' - * @vma - vma containing range to be munlock()ed. + * mlock_vma_pages_range() - mlock any pages already in the range, + * or munlock all pages in the range. + * @vma - vma containing range to be mlock()ed or munlock()ed * @start - start address in @vma of the range - * @end - end of range in @vma. - * - * For mremap(), munmap() and exit(). + * @end - end of range in @vma + * @newflags - the new set of flags for @vma. * - * Called with @vma VM_LOCKED. - * - * Returns with VM_LOCKED cleared. Callers must be prepared to - * deal with this. + * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED; + * called for munlock() and munlockall(), to clear VM_LOCKED from @vma. */ -static void munlock_vma_pages_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end) +static void mlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, vm_flags_t newflags) { - /* Reimplementation to follow in later commit */ + static const struct mm_walk_ops mlock_walk_ops = { + .pmd_entry = mlock_pte_range, + }; + + /* + * There is a slight chance that concurrent page migration, + * or page reclaim finding a page of this now-VM_LOCKED vma, + * will call mlock_vma_page() and raise page's mlock_count: + * double counting, leaving the page unevictable indefinitely. + * Communicate this danger to mlock_vma_page() with VM_IO, + * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas. + * mmap_lock is held in write mode here, so this weird + * combination should not be visible to others. + */ + if (newflags & VM_LOCKED) + newflags |= VM_IO; + WRITE_ONCE(vma->vm_flags, newflags); + + lru_add_drain(); + walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL); + lru_add_drain(); + + if (newflags & VM_IO) { + newflags &= ~VM_IO; + WRITE_ONCE(vma->vm_flags, newflags); + } } /* @@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, pgoff_t pgoff; int nr_pages; int ret = 0; - int lock = !!(newflags & VM_LOCKED); - vm_flags_t old_flags = vma->vm_flags; + vm_flags_t oldflags = vma->vm_flags; if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) || @@ -197,9 +264,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, * Keep track of amount of locked VM. */ nr_pages = (end - start) >> PAGE_SHIFT; - if (!lock) + if (!(newflags & VM_LOCKED)) nr_pages = -nr_pages; - else if (old_flags & VM_LOCKED) + else if (oldflags & VM_LOCKED) nr_pages = 0; mm->locked_vm += nr_pages; @@ -209,11 +276,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, * set VM_LOCKED, populate_vma_page_range will bring it back. */ - if (lock) + if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) { + /* No work to do, and mlocking twice would be wrong */ vma->vm_flags = newflags; - else - munlock_vma_pages_range(vma, start, end); - + } else { + mlock_vma_pages_range(vma, start, end, newflags); + } out: *prev = vma; return ret;
Fill in missing pieces: reimplementation of munlock_vma_pages_range(), required to lower the mlock_counts when munlocking without munmapping; and its complement, implementation of mlock_vma_pages_range(), required to raise the mlock_counts on pages already there when a range is mlocked. Combine them into just the one function mlock_vma_pages_range(), using walk_page_range() to run mlock_pte_range(). This approach fixes the "Very slow unlockall()" of unpopulated PROT_NONE areas, reported in https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/ Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the pte first, it will not try to munlock the page: leaving release_pages() to correct it when the last reference to the page is gone - that's okay, a page is not evictable anyway while it is held by an extra reference. Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if a racing remove_migration_pte() or try_to_unmap_one() (depending on i_mmap_rwsem) gets to the pte first, it will try to mlock the page, then mlock_pte_range() mlock it a second time. This is harder to reproduce, but a more serious race because it could leave the page unevictable indefinitely though the area is munlocked afterwards. Guard against it by setting the (inappropriate) VM_IO flag, and modifying mlock_vma_page() to decline such vmas. Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/internal.h | 3 +- mm/mlock.c | 108 ++++++++++++++++++++++++++++++++++++++++---------- 2 files changed, 90 insertions(+), 21 deletions(-)