Message ID | 20230306092259.3507807-2-fengwei.yin@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | batched remove rmap in try_to_unmap_one() | expand |
On 03/06/23 17:22, Yin Fengwei wrote: > It's to prepare the batched rmap update for large folio. > No need to looped handle hugetlb. Just handle hugetlb and > bail out early. > > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > --- > mm/rmap.c | 200 +++++++++++++++++++++++++++++++++--------------------- > 1 file changed, 121 insertions(+), 79 deletions(-) Looks good, Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> A few nits below. > > diff --git a/mm/rmap.c b/mm/rmap.c > index ba901c416785..508d141dacc5 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, > munlock_vma_folio(folio, vma, compound); > } > > +static bool try_to_unmap_one_hugetlb(struct folio *folio, > + struct vm_area_struct *vma, struct mmu_notifier_range range, > + struct page_vma_mapped_walk pvmw, unsigned long address, > + enum ttu_flags flags) > +{ > + struct mm_struct *mm = vma->vm_mm; > + pte_t pteval; > + bool ret = true, anon = folio_test_anon(folio); > + > + /* > + * The try_to_unmap() is only passed a hugetlb page > + * in the case where the hugetlb page is poisoned. > + */ > + VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio); > + /* > + * huge_pmd_unshare may unmap an entire PMD page. > + * There is no way of knowing exactly which PMDs may > + * be cached for this mm, so we must flush them all. > + * start/end were already adjusted above to cover this nit, start/end are adjusted in caller (try_to_unmap_one) not above. > + * range. > + */ > + flush_cache_range(vma, range.start, range.end); > + > + /* > + * To call huge_pmd_unshare, i_mmap_rwsem must be > + * held in write mode. Caller needs to explicitly > + * do this outside rmap routines. > + * > + * We also must hold hugetlb vma_lock in write mode. > + * Lock order dictates acquiring vma_lock BEFORE > + * i_mmap_rwsem. We can only try lock here and fail > + * if unsuccessful. > + */ > + if (!anon) { > + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > + if (!hugetlb_vma_trylock_write(vma)) { > + ret = false; > + goto out; > + } > + if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { > + hugetlb_vma_unlock_write(vma); > + flush_tlb_range(vma, > + range.start, range.end); > + mmu_notifier_invalidate_range(mm, > + range.start, range.end); > + /* > + * The ref count of the PMD page was > + * dropped which is part of the way map > + * counting is done for shared PMDs. > + * Return 'true' here. When there is > + * no other sharing, huge_pmd_unshare > + * returns false and we will unmap the > + * actual page and drop map count > + * to zero. > + */ > + goto out; > + } > + hugetlb_vma_unlock_write(vma); > + } > + pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); > + > + /* > + * Now the pte is cleared. If this pte was uffd-wp armed, > + * we may want to replace a none pte with a marker pte if > + * it's file-backed, so we don't lose the tracking info. > + */ > + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval); > + > + /* Set the dirty flag on the folio now the pte is gone. */ > + if (pte_dirty(pteval)) nit, technically, I suppose this should be huge_pte_dirty but it really is the same as pte_dirty which is why it works in current code. > + folio_mark_dirty(folio); > + > + /* Update high watermark before we lower rss */ > + update_hiwater_rss(mm); > + > + /* Poisoned hugetlb folio with TTU_HWPOISON always cleared in flags */ > + pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page)); > + set_huge_pte_at(mm, address, pvmw.pte, pteval); > + hugetlb_count_sub(folio_nr_pages(folio), mm); > + > + /* > + * No need to call mmu_notifier_invalidate_range() it has be > + * done above for all cases requiring it to happen under page > + * table lock before mmu_notifier_invalidate_range_end() > + * > + * See Documentation/mm/mmu_notifier.rst > + */ > + page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio)); nit, we KNOW folio_test_hugetlb(folio) is true here so can just pass 'true'. In addition, the same call in try_to_unmap_one is now known to always be false. No need to check folio_test_hugetlb(folio) there as well.
On Wed, 2023-03-08 at 13:38 -0800, Mike Kravetz wrote: > On 03/06/23 17:22, Yin Fengwei wrote: > > It's to prepare the batched rmap update for large folio. > > No need to looped handle hugetlb. Just handle hugetlb and > > bail out early. > > > > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > > --- > > mm/rmap.c | 200 +++++++++++++++++++++++++++++++++----------------- > > ---- > > 1 file changed, 121 insertions(+), 79 deletions(-) > > Looks good, > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Thanks a lot for reviewing. > > A few nits below. > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index ba901c416785..508d141dacc5 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page, > > struct vm_area_struct *vma, > > munlock_vma_folio(folio, vma, compound); > > } > > > > +static bool try_to_unmap_one_hugetlb(struct folio *folio, > > + struct vm_area_struct *vma, struct > > mmu_notifier_range range, > > + struct page_vma_mapped_walk pvmw, unsigned long > > address, > > + enum ttu_flags flags) > > +{ > > + struct mm_struct *mm = vma->vm_mm; > > + pte_t pteval; > > + bool ret = true, anon = folio_test_anon(folio); > > + > > + /* > > + * The try_to_unmap() is only passed a hugetlb page > > + * in the case where the hugetlb page is poisoned. > > + */ > > + VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio); > > + /* > > + * huge_pmd_unshare may unmap an entire PMD page. > > + * There is no way of knowing exactly which PMDs may > > + * be cached for this mm, so we must flush them all. > > + * start/end were already adjusted above to cover this > > nit, start/end are adjusted in caller (try_to_unmap_one) not above. Yes. Will update the comment. > > > + * range. > > + */ > > + flush_cache_range(vma, range.start, range.end); > > + > > + /* > > + * To call huge_pmd_unshare, i_mmap_rwsem must be > > + * held in write mode. Caller needs to explicitly > > + * do this outside rmap routines. > > + * > > + * We also must hold hugetlb vma_lock in write mode. > > + * Lock order dictates acquiring vma_lock BEFORE > > + * i_mmap_rwsem. We can only try lock here and fail > > + * if unsuccessful. > > + */ > > + if (!anon) { > > + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); > > + if (!hugetlb_vma_trylock_write(vma)) { > > + ret = false; > > + goto out; > > + } > > + if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { > > + hugetlb_vma_unlock_write(vma); > > + flush_tlb_range(vma, > > + range.start, range.end); > > + mmu_notifier_invalidate_range(mm, > > + range.start, range.end); > > + /* > > + * The ref count of the PMD page was > > + * dropped which is part of the way map > > + * counting is done for shared PMDs. > > + * Return 'true' here. When there is > > + * no other sharing, huge_pmd_unshare > > + * returns false and we will unmap the > > + * actual page and drop map count > > + * to zero. > > + */ > > + goto out; > > + } > > + hugetlb_vma_unlock_write(vma); > > + } > > + pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); > > + > > + /* > > + * Now the pte is cleared. If this pte was uffd-wp armed, > > + * we may want to replace a none pte with a marker pte if > > + * it's file-backed, so we don't lose the tracking info. > > + */ > > + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, > > pteval); > > + > > + /* Set the dirty flag on the folio now the pte is gone. */ > > + if (pte_dirty(pteval)) > > nit, technically, I suppose this should be huge_pte_dirty but it > really is > the same as pte_dirty which is why it works in current code. Yes. Will update to use huge_pte_dirty(). > > > + folio_mark_dirty(folio); > > + > > + /* Update high watermark before we lower rss */ > > + update_hiwater_rss(mm); > > + > > + /* Poisoned hugetlb folio with TTU_HWPOISON always cleared > > in flags */ > > + pteval = swp_entry_to_pte(make_hwpoison_entry(&folio- > > >page)); > > + set_huge_pte_at(mm, address, pvmw.pte, pteval); > > + hugetlb_count_sub(folio_nr_pages(folio), mm); > > + > > + /* > > + * No need to call mmu_notifier_invalidate_range() it has > > be > > + * done above for all cases requiring it to happen under > > page > > + * table lock before mmu_notifier_invalidate_range_end() > > + * > > + * See Documentation/mm/mmu_notifier.rst > > + */ > > + page_remove_rmap(&folio->page, vma, > > folio_test_hugetlb(folio)); > > nit, we KNOW folio_test_hugetlb(folio) is true here so can just pass > 'true'. In addition, the same call in try_to_unmap_one is now known > to > always be false. No need to check folio_test_hugetlb(folio) there as > well. The "folio_test_hugetlb(folio) -> true" changes was in patch 3. I tried to apply "one patch for code moving and one patch for code change)" to make review easy. But this patch already changed the code, I will move "folio_test_hugetlb(folio)->true" from patch 3 to this one. Thanks. Regards Yin, Fengwei >
diff --git a/mm/rmap.c b/mm/rmap.c index ba901c416785..508d141dacc5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, munlock_vma_folio(folio, vma, compound); } +static bool try_to_unmap_one_hugetlb(struct folio *folio, + struct vm_area_struct *vma, struct mmu_notifier_range range, + struct page_vma_mapped_walk pvmw, unsigned long address, + enum ttu_flags flags) +{ + struct mm_struct *mm = vma->vm_mm; + pte_t pteval; + bool ret = true, anon = folio_test_anon(folio); + + /* + * The try_to_unmap() is only passed a hugetlb page + * in the case where the hugetlb page is poisoned. + */ + VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio); + /* + * huge_pmd_unshare may unmap an entire PMD page. + * There is no way of knowing exactly which PMDs may + * be cached for this mm, so we must flush them all. + * start/end were already adjusted above to cover this + * range. + */ + flush_cache_range(vma, range.start, range.end); + + /* + * To call huge_pmd_unshare, i_mmap_rwsem must be + * held in write mode. Caller needs to explicitly + * do this outside rmap routines. + * + * We also must hold hugetlb vma_lock in write mode. + * Lock order dictates acquiring vma_lock BEFORE + * i_mmap_rwsem. We can only try lock here and fail + * if unsuccessful. + */ + if (!anon) { + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); + if (!hugetlb_vma_trylock_write(vma)) { + ret = false; + goto out; + } + if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { + hugetlb_vma_unlock_write(vma); + flush_tlb_range(vma, + range.start, range.end); + mmu_notifier_invalidate_range(mm, + range.start, range.end); + /* + * The ref count of the PMD page was + * dropped which is part of the way map + * counting is done for shared PMDs. + * Return 'true' here. When there is + * no other sharing, huge_pmd_unshare + * returns false and we will unmap the + * actual page and drop map count + * to zero. + */ + goto out; + } + hugetlb_vma_unlock_write(vma); + } + pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); + + /* + * Now the pte is cleared. If this pte was uffd-wp armed, + * we may want to replace a none pte with a marker pte if + * it's file-backed, so we don't lose the tracking info. + */ + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval); + + /* Set the dirty flag on the folio now the pte is gone. */ + if (pte_dirty(pteval)) + folio_mark_dirty(folio); + + /* Update high watermark before we lower rss */ + update_hiwater_rss(mm); + + /* Poisoned hugetlb folio with TTU_HWPOISON always cleared in flags */ + pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page)); + set_huge_pte_at(mm, address, pvmw.pte, pteval); + hugetlb_count_sub(folio_nr_pages(folio), mm); + + /* + * No need to call mmu_notifier_invalidate_range() it has be + * done above for all cases requiring it to happen under page + * table lock before mmu_notifier_invalidate_range_end() + * + * See Documentation/mm/mmu_notifier.rst + */ + page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio)); + /* No VM_LOCKED set in vma->vm_flags for hugetlb. So not + * necessary to call mlock_drain_local(). + */ + folio_put(folio); + +out: + return ret; +} + /* * @arg: enum ttu_flags will be passed to this argument */ @@ -1504,86 +1601,37 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, break; } + address = pvmw.address; + if (folio_test_hugetlb(folio)) { + ret = try_to_unmap_one_hugetlb(folio, vma, range, + pvmw, address, flags); + + /* no need to loop for hugetlb */ + page_vma_mapped_walk_done(&pvmw); + break; + } + subpage = folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio)); - address = pvmw.address; anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(subpage); - if (folio_test_hugetlb(folio)) { - bool anon = folio_test_anon(folio); - - /* - * The try_to_unmap() is only passed a hugetlb page - * in the case where the hugetlb page is poisoned. - */ - VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage); + flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); + /* Nuke the page table entry. */ + if (should_defer_flush(mm, flags)) { /* - * huge_pmd_unshare may unmap an entire PMD page. - * There is no way of knowing exactly which PMDs may - * be cached for this mm, so we must flush them all. - * start/end were already adjusted above to cover this - * range. + * We clear the PTE but do not flush so potentially + * a remote CPU could still be writing to the folio. + * If the entry was previously clean then the + * architecture must guarantee that a clear->dirty + * transition on a cached TLB entry is written through + * and traps if the PTE is unmapped. */ - flush_cache_range(vma, range.start, range.end); + pteval = ptep_get_and_clear(mm, address, pvmw.pte); - /* - * To call huge_pmd_unshare, i_mmap_rwsem must be - * held in write mode. Caller needs to explicitly - * do this outside rmap routines. - * - * We also must hold hugetlb vma_lock in write mode. - * Lock order dictates acquiring vma_lock BEFORE - * i_mmap_rwsem. We can only try lock here and fail - * if unsuccessful. - */ - if (!anon) { - VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); - if (!hugetlb_vma_trylock_write(vma)) { - page_vma_mapped_walk_done(&pvmw); - ret = false; - break; - } - if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { - hugetlb_vma_unlock_write(vma); - flush_tlb_range(vma, - range.start, range.end); - mmu_notifier_invalidate_range(mm, - range.start, range.end); - /* - * The ref count of the PMD page was - * dropped which is part of the way map - * counting is done for shared PMDs. - * Return 'true' here. When there is - * no other sharing, huge_pmd_unshare - * returns false and we will unmap the - * actual page and drop map count - * to zero. - */ - page_vma_mapped_walk_done(&pvmw); - break; - } - hugetlb_vma_unlock_write(vma); - } - pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); } else { - flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); - /* Nuke the page table entry. */ - if (should_defer_flush(mm, flags)) { - /* - * We clear the PTE but do not flush so potentially - * a remote CPU could still be writing to the folio. - * If the entry was previously clean then the - * architecture must guarantee that a clear->dirty - * transition on a cached TLB entry is written through - * and traps if the PTE is unmapped. - */ - pteval = ptep_get_and_clear(mm, address, pvmw.pte); - - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); - } else { - pteval = ptep_clear_flush(vma, address, pvmw.pte); - } + pteval = ptep_clear_flush(vma, address, pvmw.pte); } /* @@ -1602,14 +1650,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) { pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); - if (folio_test_hugetlb(folio)) { - hugetlb_count_sub(folio_nr_pages(folio), mm); - set_huge_pte_at(mm, address, pvmw.pte, pteval); - } else { - dec_mm_counter(mm, mm_counter(&folio->page)); - set_pte_at(mm, address, pvmw.pte, pteval); - } - + dec_mm_counter(mm, mm_counter(&folio->page)); + set_pte_at(mm, address, pvmw.pte, pteval); } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) { /* * The guest indicated that the page content is of no
It's to prepare the batched rmap update for large folio. No need to looped handle hugetlb. Just handle hugetlb and bail out early. Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> --- mm/rmap.c | 200 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 121 insertions(+), 79 deletions(-)