Message ID | alpine.LSU.2.11.2101161409470.2022@eggly.anvils (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: thp: fix MADV_REMOVE deadlock on shmem THP | expand |
on Sat, 16 Jan 2021 14:16:24 -0800 (PST) Hugh Dickins wrote: > > Sergey reported deadlock between kswapd correctly doing its usual > lock_page(page) followed by down_read(page->mapping->i_mmap_rwsem), > and madvise(MADV_REMOVE) on an madvise(MADV_HUGEPAGE) area doing > down_write(page->mapping->i_mmap_rwsem) followed by lock_page(page). > > This happened when shmem_fallocate(punch hole)'s unmap_mapping_range() > reaches zap_pmd_range()'s call to __split_huge_pmd(). The same deadlock > could occur when partially truncating a mapped huge tmpfs file, or using > fallocate(FALLOC_FL_PUNCH_HOLE) on it. > > __split_huge_pmd()'s page lock was added in 5.8, to make sure that any > concurrent use of reuse_swap_page() (holding page lock) could not catch > the anon THP's mapcounts and swapcounts while they were being split. > > Fortunately, reuse_swap_page() is never applied to a shmem or file THP > (not even by khugepaged, which checks PageSwapCache before calling), > and anonymous THPs are never created in shmem or file areas: so that > __split_huge_pmd()'s page lock can only be necessary for anonymous THPs, > on which there is no risk of deadlock with i_mmap_rwsem. CPU0 CPU1 ---- ---- kswapd madvise lock_page down_write(i_mmap_rwsem) down_read lock_page Given nothing wrong on the reclaimer's side, it is the reverse locking order on CPU1 that paves a brick for the peril to run another term, a long one maybe, if I dont misread you. > Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> > Fixes: c444eb564fb1 ("mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()") > Signed-off-by: Hugh Dickins <hughd@google.com> > Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> > Cc: stable@vger.kernel.org > --- > > The status of reuse_swap_page(), and its use on THPs, is currently under > discussion, and may need to be changed: but this patch is a simple fix > to the reported deadlock, which can go in now, and be easily backported > to whichever stable and longterm releases took in 5.8's c444eb564fb1. > > mm/huge_memory.c | 37 +++++++++++++++++++++++-------------- > 1 file changed, 23 insertions(+), 14 deletions(-) > > --- 5.11-rc3/mm/huge_memory.c 2020-12-27 20:39:37.667932292 -0800 > +++ linux/mm/huge_memory.c 2021-01-16 08:02:08.265551393 -0800 > @@ -2202,7 +2202,7 @@ void __split_huge_pmd(struct vm_area_str > { > spinlock_t *ptl; > struct mmu_notifier_range range; > - bool was_locked = false; > + bool do_unlock_page = false; > pmd_t _pmd; > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, > @@ -2218,7 +2218,6 @@ void __split_huge_pmd(struct vm_area_str > VM_BUG_ON(freeze && !page); > if (page) { > VM_WARN_ON_ONCE(!PageLocked(page)); > - was_locked = true; > if (page != pmd_page(*pmd)) > goto out; > } > @@ -2227,19 +2226,29 @@ repeat: > if (pmd_trans_huge(*pmd)) { > if (!page) { > page = pmd_page(*pmd); > - if (unlikely(!trylock_page(page))) { > - get_page(page); > - _pmd = *pmd; > - spin_unlock(ptl); > - lock_page(page); > - spin_lock(ptl); > - if (unlikely(!pmd_same(*pmd, _pmd))) { > - unlock_page(page); > + /* > + * An anonymous page must be locked, to ensure that a > + * concurrent reuse_swap_page() sees stable mapcount; > + * but reuse_swap_page() is not used on shmem or file, > + * and page lock must not be taken when zap_pmd_range() > + * calls __split_huge_pmd() while i_mmap_lock is held. > + */ > + if (PageAnon(page)) { > + if (unlikely(!trylock_page(page))) { > + get_page(page); > + _pmd = *pmd; > + spin_unlock(ptl); > + lock_page(page); > + spin_lock(ptl); > + if (unlikely(!pmd_same(*pmd, _pmd))) { > + unlock_page(page); > + put_page(page); > + page = NULL; > + goto repeat; > + } > put_page(page); > - page = NULL; > - goto repeat; > } > - put_page(page); > + do_unlock_page = true; > } > } > if (PageMlocked(page)) > @@ -2249,7 +2258,7 @@ repeat: > __split_huge_pmd_locked(vma, pmd, range.start, freeze); > out: > spin_unlock(ptl); > - if (!was_locked && page) > + if (do_unlock_page) > unlock_page(page); > /* > * No need to double call mmu_notifier->invalidate_range() callback. >
--- 5.11-rc3/mm/huge_memory.c 2020-12-27 20:39:37.667932292 -0800 +++ linux/mm/huge_memory.c 2021-01-16 08:02:08.265551393 -0800 @@ -2202,7 +2202,7 @@ void __split_huge_pmd(struct vm_area_str { spinlock_t *ptl; struct mmu_notifier_range range; - bool was_locked = false; + bool do_unlock_page = false; pmd_t _pmd; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, @@ -2218,7 +2218,6 @@ void __split_huge_pmd(struct vm_area_str VM_BUG_ON(freeze && !page); if (page) { VM_WARN_ON_ONCE(!PageLocked(page)); - was_locked = true; if (page != pmd_page(*pmd)) goto out; } @@ -2227,19 +2226,29 @@ repeat: if (pmd_trans_huge(*pmd)) { if (!page) { page = pmd_page(*pmd); - if (unlikely(!trylock_page(page))) { - get_page(page); - _pmd = *pmd; - spin_unlock(ptl); - lock_page(page); - spin_lock(ptl); - if (unlikely(!pmd_same(*pmd, _pmd))) { - unlock_page(page); + /* + * An anonymous page must be locked, to ensure that a + * concurrent reuse_swap_page() sees stable mapcount; + * but reuse_swap_page() is not used on shmem or file, + * and page lock must not be taken when zap_pmd_range() + * calls __split_huge_pmd() while i_mmap_lock is held. + */ + if (PageAnon(page)) { + if (unlikely(!trylock_page(page))) { + get_page(page); + _pmd = *pmd; + spin_unlock(ptl); + lock_page(page); + spin_lock(ptl); + if (unlikely(!pmd_same(*pmd, _pmd))) { + unlock_page(page); + put_page(page); + page = NULL; + goto repeat; + } put_page(page); - page = NULL; - goto repeat; } - put_page(page); + do_unlock_page = true; } } if (PageMlocked(page)) @@ -2249,7 +2258,7 @@ repeat: __split_huge_pmd_locked(vma, pmd, range.start, freeze); out: spin_unlock(ptl); - if (!was_locked && page) + if (do_unlock_page) unlock_page(page); /* * No need to double call mmu_notifier->invalidate_range() callback.