Message ID | 20230809083256.699513-1-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [mm-unstable,v1] mm: add a total mapcount for large folios | expand |
On 9 Aug 2023, at 4:32, David Hildenbrand wrote: > Let's track the total mapcount for all large folios in the first subpage. > > The total mapcount is what we actually want to know in folio_mapcount() > and it is also sufficient for implementing folio_mapped(). This also > gets rid of any "raceiness" concerns as expressed in > folio_total_mapcount(). > > With sub-PMD THP becoming more important and things looking promising > that we will soon get support for such anon THP, we want to avoid looping > over all pages of a folio just to calculate the total mapcount. Further, > we may soon want to use the total mapcount in other context more > frequently, so prepare for reading it efficiently and atomically. > > Make room for the total mapcount in page[1] of the folio by moving > _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb > -- and with the total mapcount in place probably also not desirable even > if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we > can overlay the hugetlb fields. > > Note that we currently don't expect any order-1 compound pages / THP in > rmap code, and that such support is not planned. If ever desired, we could > move the compound mapcount to another page, because it only applies to > PMD-mappable folios that are definitely larger. Let's avoid consuming > more space elsewhere for now -- we might need more space for a different > purpose in some subpages soon. > > Maintain the total mapcount also for hugetlb pages. Use the total mapcount > to implement folio_mapcount(), total_mapcount(), folio_mapped() and > page_mapped(). > > We can now get rid of folio_total_mapcount() and > folio_large_is_mapped(), by just inlining reading of the total mapcount. > > _nr_pages_mapped is now only used in rmap code, so not accidentially > externally where it might be used on arbitrary order-1 pages. The remaining > usage is: > > (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED > -> If we would account the total folio as mapped when mapping a > page (based on the total mapcount), we could remove that usage. > > (2) Detect when to add a folio to the deferred split queue > -> If we would apply a different heuristic, or scan using the rmap on > the memory reclaim path for partially mapped anon folios to > split them, we could remove that usage as well. > > So maybe, we can simplify things in the future and remove > _nr_pages_mapped. For now, leave these things as they are, they need more > thought. Hugh really did a nice job implementing that precise tracking > after all. > > Note: Not adding order-1 sanity checks to the file_rmap functions for > now. For anon pages, they are certainly not required right now. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Yin Fengwei <fengwei.yin@intel.com> > Cc: Yang Shi <shy828301@gmail.com> > Cc: Zi Yan <ziy@nvidia.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > Documentation/mm/transhuge.rst | 12 +++++----- > include/linux/mm.h | 31 ++++++------------------ > include/linux/mm_types.h | 6 +++-- > include/linux/rmap.h | 7 ++++-- > mm/debug.c | 4 ++-- > mm/huge_memory.c | 2 ++ > mm/hugetlb.c | 4 ++-- > mm/internal.h | 17 ++++--------- > mm/page_alloc.c | 4 ++-- > mm/rmap.c | 44 ++++++++++++++-------------------- > 10 files changed, 52 insertions(+), 79 deletions(-) > LGTM. Thanks. Reviewed-by: Zi Yan <ziy@nvidia.com> -- Best Regards, Yan, Zi
On 09/08/2023 09:32, David Hildenbrand wrote: > Let's track the total mapcount for all large folios in the first subpage. > > The total mapcount is what we actually want to know in folio_mapcount() > and it is also sufficient for implementing folio_mapped(). This also > gets rid of any "raceiness" concerns as expressed in > folio_total_mapcount(). > > With sub-PMD THP becoming more important and things looking promising > that we will soon get support for such anon THP, we want to avoid looping > over all pages of a folio just to calculate the total mapcount. Further, > we may soon want to use the total mapcount in other context more > frequently, so prepare for reading it efficiently and atomically. > > Make room for the total mapcount in page[1] of the folio by moving > _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb > -- and with the total mapcount in place probably also not desirable even > if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we > can overlay the hugetlb fields. > > Note that we currently don't expect any order-1 compound pages / THP in > rmap code, and that such support is not planned. If ever desired, we could > move the compound mapcount to another page, because it only applies to > PMD-mappable folios that are definitely larger. Let's avoid consuming > more space elsewhere for now -- we might need more space for a different > purpose in some subpages soon. > > Maintain the total mapcount also for hugetlb pages. Use the total mapcount > to implement folio_mapcount(), total_mapcount(), folio_mapped() and > page_mapped(). > > We can now get rid of folio_total_mapcount() and > folio_large_is_mapped(), by just inlining reading of the total mapcount. > > _nr_pages_mapped is now only used in rmap code, so not accidentially > externally where it might be used on arbitrary order-1 pages. The remaining > usage is: > > (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED > -> If we would account the total folio as mapped when mapping a > page (based on the total mapcount), we could remove that usage. > > (2) Detect when to add a folio to the deferred split queue > -> If we would apply a different heuristic, or scan using the rmap on > the memory reclaim path for partially mapped anon folios to > split them, we could remove that usage as well. > > So maybe, we can simplify things in the future and remove > _nr_pages_mapped. For now, leave these things as they are, they need more > thought. Hugh really did a nice job implementing that precise tracking > after all. > > Note: Not adding order-1 sanity checks to the file_rmap functions for > now. For anon pages, they are certainly not required right now. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Yin Fengwei <fengwei.yin@intel.com> > Cc: Yang Shi <shy828301@gmail.com> > Cc: Zi Yan <ziy@nvidia.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing is correct: Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> > --- > Documentation/mm/transhuge.rst | 12 +++++----- > include/linux/mm.h | 31 ++++++------------------ > include/linux/mm_types.h | 6 +++-- > include/linux/rmap.h | 7 ++++-- > mm/debug.c | 4 ++-- > mm/huge_memory.c | 2 ++ > mm/hugetlb.c | 4 ++-- > mm/internal.h | 17 ++++--------- > mm/page_alloc.c | 4 ++-- > mm/rmap.c | 44 ++++++++++++++-------------------- > 10 files changed, 52 insertions(+), 79 deletions(-) > > diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst > index 9a607059ea11..b0d3b1d3e8ea 100644 > --- a/Documentation/mm/transhuge.rst > +++ b/Documentation/mm/transhuge.rst > @@ -116,14 +116,14 @@ pages: > succeeds on tail pages. > > - map/unmap of a PMD entry for the whole THP increment/decrement > - folio->_entire_mapcount and also increment/decrement > - folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount > - goes from -1 to 0 or 0 to -1. > + folio->_entire_mapcount, increment/decrement folio->_total_mapcount > + and also increment/decrement folio->_nr_pages_mapped by COMPOUND_MAPPED > + when _entire_mapcount goes from -1 to 0 or 0 to -1. > > - map/unmap of individual pages with PTE entry increment/decrement > - page->_mapcount and also increment/decrement folio->_nr_pages_mapped > - when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts > - the number of pages mapped by PTE. > + page->_mapcount, increment/decrement folio->_total_mapcount and also > + increment/decrement folio->_nr_pages_mapped when page->_mapcount goes > + from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE. > > split_huge_page internally has to distribute the refcounts in the head > page to the tail pages before clearing all PG_head/tail bits from the page > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 6a95dfed4957..30ac004fa0ef 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1187,41 +1187,24 @@ static inline int page_mapcount(struct page *page) > return mapcount; > } > > -int folio_total_mapcount(struct folio *folio); > - > /** > - * folio_mapcount() - Calculate the number of mappings of this folio. > + * folio_mapcount() - Return the total number of mappings of this folio. > * @folio: The folio. > * > - * A large folio tracks both how many times the entire folio is mapped, > - * and how many times each individual page in the folio is mapped. > - * This function calculates the total number of times the folio is > - * mapped. > - * > - * Return: The number of times this folio is mapped. > + * Return: The total number of times this folio is mapped. > */ > static inline int folio_mapcount(struct folio *folio) > { > if (likely(!folio_test_large(folio))) > return atomic_read(&folio->_mapcount) + 1; > - return folio_total_mapcount(folio); > + return atomic_read(&folio->_total_mapcount) + 1; > } > > static inline int total_mapcount(struct page *page) nit: couldn't total_mapcount() just be implemented as a wrapper around folio_mapcount()? What's the benefit of PageCompound() check instead of just getting the folio and checking if it's large? i.e: static inline int total_mapcount(struct page *page) { return folio_mapcount(page_folio(page)); } > { > if (likely(!PageCompound(page))) > return atomic_read(&page->_mapcount) + 1; > - return folio_total_mapcount(page_folio(page)); > -} > - > -static inline bool folio_large_is_mapped(struct folio *folio) > -{ > - /* > - * Reading _entire_mapcount below could be omitted if hugetlb > - * participated in incrementing nr_pages_mapped when compound mapped. > - */ > - return atomic_read(&folio->_nr_pages_mapped) > 0 || > - atomic_read(&folio->_entire_mapcount) >= 0; > + return atomic_read(&page_folio(page)->_total_mapcount) + 1; > } > > /** > @@ -1234,7 +1217,7 @@ static inline bool folio_mapped(struct folio *folio) > { > if (likely(!folio_test_large(folio))) > return atomic_read(&folio->_mapcount) >= 0; > - return folio_large_is_mapped(folio); > + return atomic_read(&folio->_total_mapcount) >= 0; > } > > /* > @@ -1246,7 +1229,7 @@ static inline bool page_mapped(struct page *page) > { > if (likely(!PageCompound(page))) > return atomic_read(&page->_mapcount) >= 0; > - return folio_large_is_mapped(page_folio(page)); > + return atomic_read(&page_folio(page)->_total_mapcount) >= 0; nit: same here; couldn't this be a wrapper around folio_mapped()? > } > > static inline struct page *virt_to_head_page(const void *x) > @@ -2148,7 +2131,7 @@ static inline size_t folio_size(struct folio *folio) > * looking at the precise mapcount of the first subpage in the folio, and > * assuming the other subpages are the same. This may not be true for large > * folios. If you want exact mapcounts for exact calculations, look at > - * page_mapcount() or folio_total_mapcount(). > + * page_mapcount() or folio_mapcount(). > * > * Return: The estimated number of processes sharing a folio. > */ > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 291c05cacd48..16e66b3332c6 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -267,7 +267,8 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page) > * @_folio_dtor: Which destructor to use for this folio. > * @_folio_order: Do not use directly, call folio_order(). > * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). > - * @_nr_pages_mapped: Do not use directly, call folio_mapcount(). > + * @_total_mapcount: Do not use directly, call folio_mapcount(). > + * @_nr_pages_mapped: Do not use outside of rmap code (and not for hugetlb). > * @_pincount: Do not use directly, call folio_maybe_dma_pinned(). > * @_folio_nr_pages: Do not use directly, call folio_nr_pages(). > * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h. > @@ -321,7 +322,7 @@ struct folio { > unsigned char _folio_dtor; > unsigned char _folio_order; > atomic_t _entire_mapcount; > - atomic_t _nr_pages_mapped; > + atomic_t _total_mapcount; > atomic_t _pincount; > #ifdef CONFIG_64BIT > unsigned int _folio_nr_pages; > @@ -346,6 +347,7 @@ struct folio { > unsigned long _head_2a; > /* public: */ > struct list_head _deferred_list; > + atomic_t _nr_pages_mapped; > /* private: the union with struct page is transitional */ > }; > struct page __page_2; > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index a3825ce81102..a7b1c005e0c9 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -210,14 +210,17 @@ void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *, > > static inline void __page_dup_rmap(struct page *page, bool compound) > { > - if (compound) { > - struct folio *folio = (struct folio *)page; > + struct folio *folio = page_folio(page); > > + if (compound) { > VM_BUG_ON_PAGE(compound && !PageHead(page), page); > atomic_inc(&folio->_entire_mapcount); > } else { > atomic_inc(&page->_mapcount); > } > + > + if (folio_test_large(folio)) > + atomic_inc(&folio->_total_mapcount); > } > > static inline void page_dup_file_rmap(struct page *page, bool compound) > diff --git a/mm/debug.c b/mm/debug.c > index ee533a5ceb79..97f6f6b32ae7 100644 > --- a/mm/debug.c > +++ b/mm/debug.c > @@ -99,10 +99,10 @@ static void __dump_page(struct page *page) > page, page_ref_count(head), mapcount, mapping, > page_to_pgoff(page), page_to_pfn(page)); > if (compound) { > - pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n", > + pr_warn("head:%p order:%u entire_mapcount:%d total_mapcount:%d pincount:%d\n", > head, compound_order(head), > folio_entire_mapcount(folio), > - folio_nr_pages_mapped(folio), > + folio_mapcount(folio), > atomic_read(&folio->_pincount)); > } > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 154c210892a1..43a2150e41e3 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -583,6 +583,7 @@ void prep_transhuge_page(struct page *page) > > VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); > INIT_LIST_HEAD(&folio->_deferred_list); > + atomic_set(&folio->_nr_pages_mapped, 0); > folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR); > } > > @@ -2795,6 +2796,7 @@ void free_transhuge_page(struct page *page) > } > spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > } > + VM_WARN_ON_ONCE(atomic_read(&folio->_nr_pages_mapped)); > free_compound_page(page); > } > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 5f498e8025cc..6a614c559ccf 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, > struct page *p; > > atomic_set(&folio->_entire_mapcount, 0); > - atomic_set(&folio->_nr_pages_mapped, 0); > + atomic_set(&folio->_total_mapcount, 0); Just checking this is definitely what you intended? _total_mapcount is -1 when it means "no pages mapped", so 0 means 1 page mapped? > atomic_set(&folio->_pincount, 0); > > for (i = 1; i < nr_pages; i++) { > @@ -2027,7 +2027,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > /* we rely on prep_new_hugetlb_folio to set the destructor */ > folio_set_order(folio, order); > atomic_set(&folio->_entire_mapcount, -1); > - atomic_set(&folio->_nr_pages_mapped, 0); > + atomic_set(&folio->_total_mapcount, -1); > atomic_set(&folio->_pincount, 0); > return true; > > diff --git a/mm/internal.h b/mm/internal.h > index 02a6fd36f46e..e4646fed44a5 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -53,10 +53,10 @@ struct folio_batch; > void page_writeback_init(void); > > /* > - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages, > + * If a 16GB folio were mapped by PTEs of all of its 4kB pages, > * its nr_pages_mapped would be 0x400000: choose the COMPOUND_MAPPED bit > - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Hugetlb currently > - * leaves nr_pages_mapped at 0, but avoid surprise if it participates later. > + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Not applicable to > + * hugetlb. > */ > #define COMPOUND_MAPPED 0x800000 > #define FOLIO_PAGES_MAPPED (COMPOUND_MAPPED - 1) > @@ -67,15 +67,6 @@ void page_writeback_init(void); > */ > #define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ > > -/* > - * How many individual pages have an elevated _mapcount. Excludes > - * the folio's entire_mapcount. > - */ > -static inline int folio_nr_pages_mapped(struct folio *folio) > -{ > - return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED; > -} > - > static inline void *folio_raw_mapping(struct folio *folio) > { > unsigned long mapping = (unsigned long)folio->mapping; > @@ -420,7 +411,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order) > folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); > folio_set_order(folio, order); > atomic_set(&folio->_entire_mapcount, -1); > - atomic_set(&folio->_nr_pages_mapped, 0); > + atomic_set(&folio->_total_mapcount, -1); > atomic_set(&folio->_pincount, 0); > } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 96b7c1a7d1f2..5bbd5782b543 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1009,8 +1009,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) > bad_page(page, "nonzero entire_mapcount"); > goto out; > } > - if (unlikely(atomic_read(&folio->_nr_pages_mapped))) { > - bad_page(page, "nonzero nr_pages_mapped"); > + if (unlikely(atomic_read(&folio->_total_mapcount) + 1)) { > + bad_page(page, "nonzero total_mapcount"); > goto out; > } > if (unlikely(atomic_read(&folio->_pincount))) { > diff --git a/mm/rmap.c b/mm/rmap.c > index 1f04debdc87a..a7dcae48245b 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1070,29 +1070,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, > return page_vma_mkclean_one(&pvmw); > } > > -int folio_total_mapcount(struct folio *folio) > -{ > - int mapcount = folio_entire_mapcount(folio); > - int nr_pages; > - int i; > - > - /* In the common case, avoid the loop when no pages mapped by PTE */ > - if (folio_nr_pages_mapped(folio) == 0) > - return mapcount; > - /* > - * Add all the PTE mappings of those pages mapped by PTE. > - * Limit the loop to folio_nr_pages_mapped()? > - * Perhaps: given all the raciness, that may be a good or a bad idea. > - */ > - nr_pages = folio_nr_pages(folio); > - for (i = 0; i < nr_pages; i++) > - mapcount += atomic_read(&folio_page(folio, i)->_mapcount); > - > - /* But each of those _mapcounts was based on -1 */ > - mapcount += nr_pages; > - return mapcount; > -} > - > /** > * page_move_anon_rmap - move a page to our anon_vma > * @page: the page to move to our anon_vma > @@ -1236,6 +1213,9 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > } > } > > + if (folio_test_large(folio)) > + atomic_inc(&folio->_total_mapcount); > + > VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); > VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); > > @@ -1289,6 +1269,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); > } > > + if (folio_test_large(folio)) > + /* increment count (starts at -1) */ > + atomic_set(&folio->_total_mapcount, 0); > + > __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); > __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } > @@ -1310,14 +1294,14 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, > bool compound) > { > atomic_t *mapped = &folio->_nr_pages_mapped; > - unsigned int nr_pmdmapped = 0, first; > + unsigned int nr_pmdmapped = 0, i, first; > int nr = 0; > > VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); > > /* Is page being mapped by PTE? Is this its first map to be added? */ > if (likely(!compound)) { > - do { > + for (i = 0; i < nr_pages; i++, page++) { > first = atomic_inc_and_test(&page->_mapcount); > if (first && folio_test_large(folio)) { > first = atomic_inc_return_relaxed(mapped); > @@ -1326,7 +1310,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, > > if (first) > nr++; > - } while (page++, --nr_pages > 0); > + } > } else if (folio_test_pmd_mappable(folio)) { > /* That test is redundant: it's for safety or to optimize out */ > > @@ -1346,6 +1330,9 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, > } > } > > + if (folio_test_large(folio)) > + atomic_add(compound ? 1 : nr_pages, &folio->_total_mapcount); > + > if (nr_pmdmapped) > __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ? > NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped); > @@ -1398,6 +1385,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, > > VM_BUG_ON_PAGE(compound && !PageHead(page), page); > > + if (folio_test_large(folio)) > + atomic_dec(&folio->_total_mapcount); > + > /* Hugetlb pages are not counted in NR_*MAPPED */ > if (unlikely(folio_test_hugetlb(folio))) { > /* hugetlb pages are always mapped with pmds */ > @@ -2538,6 +2528,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > BUG_ON(!anon_vma); > /* address might be in next vma when migration races vma_merge */ > first = atomic_inc_and_test(&folio->_entire_mapcount); > + atomic_inc(&folio->_total_mapcount); > VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); > VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); > if (first) > @@ -2551,6 +2542,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio, > BUG_ON(address < vma->vm_start || address >= vma->vm_end); > /* increment count (starts at -1) */ > atomic_set(&folio->_entire_mapcount, 0); > + atomic_set(&folio->_total_mapcount, 0); > folio_clear_hugetlb_restore_reserve(folio); > __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > }
On 09.08.23 21:07, Ryan Roberts wrote: > On 09/08/2023 09:32, David Hildenbrand wrote: >> Let's track the total mapcount for all large folios in the first subpage. >> >> The total mapcount is what we actually want to know in folio_mapcount() >> and it is also sufficient for implementing folio_mapped(). This also >> gets rid of any "raceiness" concerns as expressed in >> folio_total_mapcount(). >> >> With sub-PMD THP becoming more important and things looking promising >> that we will soon get support for such anon THP, we want to avoid looping >> over all pages of a folio just to calculate the total mapcount. Further, >> we may soon want to use the total mapcount in other context more >> frequently, so prepare for reading it efficiently and atomically. >> >> Make room for the total mapcount in page[1] of the folio by moving >> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb >> -- and with the total mapcount in place probably also not desirable even >> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we >> can overlay the hugetlb fields. >> >> Note that we currently don't expect any order-1 compound pages / THP in >> rmap code, and that such support is not planned. If ever desired, we could >> move the compound mapcount to another page, because it only applies to >> PMD-mappable folios that are definitely larger. Let's avoid consuming >> more space elsewhere for now -- we might need more space for a different >> purpose in some subpages soon. >> >> Maintain the total mapcount also for hugetlb pages. Use the total mapcount >> to implement folio_mapcount(), total_mapcount(), folio_mapped() and >> page_mapped(). >> >> We can now get rid of folio_total_mapcount() and >> folio_large_is_mapped(), by just inlining reading of the total mapcount. >> >> _nr_pages_mapped is now only used in rmap code, so not accidentially >> externally where it might be used on arbitrary order-1 pages. The remaining >> usage is: >> >> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED >> -> If we would account the total folio as mapped when mapping a >> page (based on the total mapcount), we could remove that usage. >> >> (2) Detect when to add a folio to the deferred split queue >> -> If we would apply a different heuristic, or scan using the rmap on >> the memory reclaim path for partially mapped anon folios to >> split them, we could remove that usage as well. >> >> So maybe, we can simplify things in the future and remove >> _nr_pages_mapped. For now, leave these things as they are, they need more >> thought. Hugh really did a nice job implementing that precise tracking >> after all. >> >> Note: Not adding order-1 sanity checks to the file_rmap functions for >> now. For anon pages, they are certainly not required right now. >> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Jonathan Corbet <corbet@lwn.net> >> Cc: Mike Kravetz <mike.kravetz@oracle.com> >> Cc: Hugh Dickins <hughd@google.com> >> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> >> Cc: Ryan Roberts <ryan.roberts@arm.com> >> Cc: Yin Fengwei <fengwei.yin@intel.com> >> Cc: Yang Shi <shy828301@gmail.com> >> Cc: Zi Yan <ziy@nvidia.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing > is correct: > > Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Thanks for the review! [...] >> >> static inline int total_mapcount(struct page *page) > > nit: couldn't total_mapcount() just be implemented as a wrapper around > folio_mapcount()? What's the benefit of PageCompound() check instead of just > getting the folio and checking if it's large? i.e: Good point, let me take a look tomorrow if the compiler can optimize in both cases equally well. [...] >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 5f498e8025cc..6a614c559ccf 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, >> struct page *p; >> >> atomic_set(&folio->_entire_mapcount, 0); >> - atomic_set(&folio->_nr_pages_mapped, 0); >> + atomic_set(&folio->_total_mapcount, 0); > > Just checking this is definitely what you intended? _total_mapcount is -1 when > it means "no pages mapped", so 0 means 1 page mapped? I was blindly doing what _entire_mapcount is doing: zeroing out the values. ;) But let's look into the details: in __destroy_compound_gigantic_folio(), we're manually dissolving the whole compound page. So instead of actually returning a compound page to the buddy (where we would make sure the mapcounts are -1, to then zero them out !), we simply zero-out the fields we use and then dissolve the compound page: to be left with a bunch of order-0 pages where the memmap is in a clean state. (the buddy doesn't handle that page order, so we have to do things manually to get to order-0 pages we can reuse or free)
On Wed, Aug 09, 2023 at 08:07:43PM +0100, Ryan Roberts wrote: > > +++ b/mm/hugetlb.c > > @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, > > struct page *p; > > > > atomic_set(&folio->_entire_mapcount, 0); > > - atomic_set(&folio->_nr_pages_mapped, 0); > > + atomic_set(&folio->_total_mapcount, 0); > > Just checking this is definitely what you intended? _total_mapcount is -1 when > it means "no pages mapped", so 0 means 1 page mapped? We're destroying the page here, so rather than setting the meaning of this, we're setting the contents of this memory to 0. Other thoughts that ran through my mind ... can we wrap? I don't think we can; we always increment total_mapcount by 1, no matter whether we're incrementing entire_mapcount or an individual page's mapcount, and we always call folio_get() first, so we can't increment total_mapcount past 2^32 because folio_get() will die first. We might be able to wrap past 2^31, but I don't think so. I had some other thoughts, but I convinced myself they were all OK.
On 09.08.23 21:21, Matthew Wilcox wrote: > On Wed, Aug 09, 2023 at 08:07:43PM +0100, Ryan Roberts wrote: >>> +++ b/mm/hugetlb.c >>> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, >>> struct page *p; >>> >>> atomic_set(&folio->_entire_mapcount, 0); >>> - atomic_set(&folio->_nr_pages_mapped, 0); >>> + atomic_set(&folio->_total_mapcount, 0); >> >> Just checking this is definitely what you intended? _total_mapcount is -1 when >> it means "no pages mapped", so 0 means 1 page mapped? > > We're destroying the page here, so rather than setting the meaning of > this, we're setting the contents of this memory to 0. > > > Other thoughts that ran through my mind ... can we wrap? I don't think > we can; we always increment total_mapcount by 1, no matter whether we're > incrementing entire_mapcount or an individual page's mapcount, and we > always call folio_get() first, so we can't increment total_mapcount > past 2^32 because folio_get() will die first. We might be able to > wrap past 2^31, but I don't think so. From my understanding, if we wrap the total mapcount, we already wrapped the refcount -- as you say, grabbing a reference ahead of time for each mapping is mandatory. Both are 31bit values. We could treat the total mapcount as an unsigned int, but that's rather future work. Also, even folio_mapcount() and total_mapcount() return an "int" as of now. But yes, I also thought about that. In the future we might want (at least) for bigger folios refcount+total_mapcount to be 64bit. Or we manage to decouple both and only have the total_mapcount be 64bit only.
Hi, David, Some pure questions below.. On Wed, Aug 09, 2023 at 10:32:56AM +0200, David Hildenbrand wrote: > Let's track the total mapcount for all large folios in the first subpage. > > The total mapcount is what we actually want to know in folio_mapcount() > and it is also sufficient for implementing folio_mapped(). This also > gets rid of any "raceiness" concerns as expressed in > folio_total_mapcount(). Any more information for that "raciness" described here? > > With sub-PMD THP becoming more important and things looking promising > that we will soon get support for such anon THP, we want to avoid looping > over all pages of a folio just to calculate the total mapcount. Further, > we may soon want to use the total mapcount in other context more > frequently, so prepare for reading it efficiently and atomically. Any (perhaps existing) discussion on reduced loops vs added atomic field/ops? I had a feeling that there's some discussion behind the proposal of this patch, if that's the case it'll be great to attach the link in the commit log. Thanks,
On 8/10/23 03:26, David Hildenbrand wrote: > On 09.08.23 21:21, Matthew Wilcox wrote: >> On Wed, Aug 09, 2023 at 08:07:43PM +0100, Ryan Roberts wrote: >>>> +++ b/mm/hugetlb.c >>>> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, >>>>      struct page *p; >>>>       atomic_set(&folio->_entire_mapcount, 0); >>>> -   atomic_set(&folio->_nr_pages_mapped, 0); >>>> +   atomic_set(&folio->_total_mapcount, 0); >>> >>> Just checking this is definitely what you intended? _total_mapcount is -1 when >>> it means "no pages mapped", so 0 means 1 page mapped? >> >> We're destroying the page here, so rather than setting the meaning of >> this, we're setting the contents of this memory to 0. >> >> >> Other thoughts that ran through my mind ... can we wrap? I don't think >> we can; we always increment total_mapcount by 1, no matter whether we're >> incrementing entire_mapcount or an individual page's mapcount, and we >> always call folio_get() first, so we can't increment total_mapcount >> past 2^32 because folio_get() will die first. We might be able to >> wrap past 2^31, but I don't think so. > > From my understanding, if we wrap the total mapcount, we already wrapped the refcount -- as you say, grabbing a reference ahead of time for each mapping is mandatory. Both are 31bit values. We could treat the total mapcount as an unsigned int, but that's rather future work. > > Also, even folio_mapcount() and total_mapcount() return an "int" as of now. > > But yes, I also thought about that. In the future we might want (at least) for bigger folios refcount+total_mapcount to be 64bit. Or we manage to decouple both and only have the total_mapcount be 64bit only. This means pvmw may need to check more than 2^32 pte entries. :). I hope we have other way to get bigger folio and keep mapcount not too big. Regards Yin, Fengwei >
On 8/9/23 16:32, David Hildenbrand wrote: > Let's track the total mapcount for all large folios in the first subpage. > > The total mapcount is what we actually want to know in folio_mapcount() > and it is also sufficient for implementing folio_mapped(). This also > gets rid of any "raceiness" concerns as expressed in > folio_total_mapcount(). > > With sub-PMD THP becoming more important and things looking promising > that we will soon get support for such anon THP, we want to avoid looping > over all pages of a folio just to calculate the total mapcount. Further, > we may soon want to use the total mapcount in other context more > frequently, so prepare for reading it efficiently and atomically. > > Make room for the total mapcount in page[1] of the folio by moving > _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb > -- and with the total mapcount in place probably also not desirable even > if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we > can overlay the hugetlb fields. > > Note that we currently don't expect any order-1 compound pages / THP in > rmap code, and that such support is not planned. If ever desired, we could > move the compound mapcount to another page, because it only applies to > PMD-mappable folios that are definitely larger. Let's avoid consuming > more space elsewhere for now -- we might need more space for a different > purpose in some subpages soon. > > Maintain the total mapcount also for hugetlb pages. Use the total mapcount > to implement folio_mapcount(), total_mapcount(), folio_mapped() and > page_mapped(). > > We can now get rid of folio_total_mapcount() and > folio_large_is_mapped(), by just inlining reading of the total mapcount. > > _nr_pages_mapped is now only used in rmap code, so not accidentially > externally where it might be used on arbitrary order-1 pages. The remaining > usage is: > > (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED > -> If we would account the total folio as mapped when mapping a > page (based on the total mapcount), we could remove that usage. > > (2) Detect when to add a folio to the deferred split queue > -> If we would apply a different heuristic, or scan using the rmap on > the memory reclaim path for partially mapped anon folios to > split them, we could remove that usage as well. > > So maybe, we can simplify things in the future and remove > _nr_pages_mapped. For now, leave these things as they are, they need more > thought. Hugh really did a nice job implementing that precise tracking > after all. > > Note: Not adding order-1 sanity checks to the file_rmap functions for > now. For anon pages, they are certainly not required right now. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Yin Fengwei <fengwei.yin@intel.com> > Cc: Yang Shi <shy828301@gmail.com> > Cc: Zi Yan <ziy@nvidia.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Regards Yin, Fengwei > --- > Documentation/mm/transhuge.rst | 12 +++++----- > include/linux/mm.h | 31 ++++++------------------ > include/linux/mm_types.h | 6 +++-- > include/linux/rmap.h | 7 ++++-- > mm/debug.c | 4 ++-- > mm/huge_memory.c | 2 ++ > mm/hugetlb.c | 4 ++-- > mm/internal.h | 17 ++++--------- > mm/page_alloc.c | 4 ++-- > mm/rmap.c | 44 ++++++++++++++-------------------- > 10 files changed, 52 insertions(+), 79 deletions(-) > > diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst > index 9a607059ea11..b0d3b1d3e8ea 100644 > --- a/Documentation/mm/transhuge.rst > +++ b/Documentation/mm/transhuge.rst > @@ -116,14 +116,14 @@ pages: > succeeds on tail pages. > > - map/unmap of a PMD entry for the whole THP increment/decrement > - folio->_entire_mapcount and also increment/decrement > - folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount > - goes from -1 to 0 or 0 to -1. > + folio->_entire_mapcount, increment/decrement folio->_total_mapcount > + and also increment/decrement folio->_nr_pages_mapped by COMPOUND_MAPPED > + when _entire_mapcount goes from -1 to 0 or 0 to -1. > > - map/unmap of individual pages with PTE entry increment/decrement > - page->_mapcount and also increment/decrement folio->_nr_pages_mapped > - when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts > - the number of pages mapped by PTE. > + page->_mapcount, increment/decrement folio->_total_mapcount and also > + increment/decrement folio->_nr_pages_mapped when page->_mapcount goes > + from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE. > > split_huge_page internally has to distribute the refcounts in the head > page to the tail pages before clearing all PG_head/tail bits from the page > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 6a95dfed4957..30ac004fa0ef 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1187,41 +1187,24 @@ static inline int page_mapcount(struct page *page) > return mapcount; > } > > -int folio_total_mapcount(struct folio *folio); > - > /** > - * folio_mapcount() - Calculate the number of mappings of this folio. > + * folio_mapcount() - Return the total number of mappings of this folio. > * @folio: The folio. > * > - * A large folio tracks both how many times the entire folio is mapped, > - * and how many times each individual page in the folio is mapped. > - * This function calculates the total number of times the folio is > - * mapped. > - * > - * Return: The number of times this folio is mapped. > + * Return: The total number of times this folio is mapped. > */ > static inline int folio_mapcount(struct folio *folio) > { > if (likely(!folio_test_large(folio))) > return atomic_read(&folio->_mapcount) + 1; > - return folio_total_mapcount(folio); > + return atomic_read(&folio->_total_mapcount) + 1; > } > > static inline int total_mapcount(struct page *page) > { > if (likely(!PageCompound(page))) > return atomic_read(&page->_mapcount) + 1; > - return folio_total_mapcount(page_folio(page)); > -} > - > -static inline bool folio_large_is_mapped(struct folio *folio) > -{ > - /* > - * Reading _entire_mapcount below could be omitted if hugetlb > - * participated in incrementing nr_pages_mapped when compound mapped. > - */ > - return atomic_read(&folio->_nr_pages_mapped) > 0 || > - atomic_read(&folio->_entire_mapcount) >= 0; > + return atomic_read(&page_folio(page)->_total_mapcount) + 1; > } > > /** > @@ -1234,7 +1217,7 @@ static inline bool folio_mapped(struct folio *folio) > { > if (likely(!folio_test_large(folio))) > return atomic_read(&folio->_mapcount) >= 0; > - return folio_large_is_mapped(folio); > + return atomic_read(&folio->_total_mapcount) >= 0; > } > > /* > @@ -1246,7 +1229,7 @@ static inline bool page_mapped(struct page *page) > { > if (likely(!PageCompound(page))) > return atomic_read(&page->_mapcount) >= 0; > - return folio_large_is_mapped(page_folio(page)); > + return atomic_read(&page_folio(page)->_total_mapcount) >= 0; > } > > static inline struct page *virt_to_head_page(const void *x) > @@ -2148,7 +2131,7 @@ static inline size_t folio_size(struct folio *folio) > * looking at the precise mapcount of the first subpage in the folio, and > * assuming the other subpages are the same. This may not be true for large > * folios. If you want exact mapcounts for exact calculations, look at > - * page_mapcount() or folio_total_mapcount(). > + * page_mapcount() or folio_mapcount(). > * > * Return: The estimated number of processes sharing a folio. > */ > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 291c05cacd48..16e66b3332c6 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -267,7 +267,8 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page) > * @_folio_dtor: Which destructor to use for this folio. > * @_folio_order: Do not use directly, call folio_order(). > * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). > - * @_nr_pages_mapped: Do not use directly, call folio_mapcount(). > + * @_total_mapcount: Do not use directly, call folio_mapcount(). > + * @_nr_pages_mapped: Do not use outside of rmap code (and not for hugetlb). > * @_pincount: Do not use directly, call folio_maybe_dma_pinned(). > * @_folio_nr_pages: Do not use directly, call folio_nr_pages(). > * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h. > @@ -321,7 +322,7 @@ struct folio { > unsigned char _folio_dtor; > unsigned char _folio_order; > atomic_t _entire_mapcount; > - atomic_t _nr_pages_mapped; > + atomic_t _total_mapcount; > atomic_t _pincount; > #ifdef CONFIG_64BIT > unsigned int _folio_nr_pages; > @@ -346,6 +347,7 @@ struct folio { > unsigned long _head_2a; > /* public: */ > struct list_head _deferred_list; > + atomic_t _nr_pages_mapped; > /* private: the union with struct page is transitional */ > }; > struct page __page_2; > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index a3825ce81102..a7b1c005e0c9 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -210,14 +210,17 @@ void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *, > > static inline void __page_dup_rmap(struct page *page, bool compound) > { > - if (compound) { > - struct folio *folio = (struct folio *)page; > + struct folio *folio = page_folio(page); > > + if (compound) { > VM_BUG_ON_PAGE(compound && !PageHead(page), page); > atomic_inc(&folio->_entire_mapcount); > } else { > atomic_inc(&page->_mapcount); > } > + > + if (folio_test_large(folio)) > + atomic_inc(&folio->_total_mapcount); > } > > static inline void page_dup_file_rmap(struct page *page, bool compound) > diff --git a/mm/debug.c b/mm/debug.c > index ee533a5ceb79..97f6f6b32ae7 100644 > --- a/mm/debug.c > +++ b/mm/debug.c > @@ -99,10 +99,10 @@ static void __dump_page(struct page *page) > page, page_ref_count(head), mapcount, mapping, > page_to_pgoff(page), page_to_pfn(page)); > if (compound) { > - pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n", > + pr_warn("head:%p order:%u entire_mapcount:%d total_mapcount:%d pincount:%d\n", > head, compound_order(head), > folio_entire_mapcount(folio), > - folio_nr_pages_mapped(folio), > + folio_mapcount(folio), > atomic_read(&folio->_pincount)); > } > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 154c210892a1..43a2150e41e3 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -583,6 +583,7 @@ void prep_transhuge_page(struct page *page) > > VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); > INIT_LIST_HEAD(&folio->_deferred_list); > + atomic_set(&folio->_nr_pages_mapped, 0); > folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR); > } > > @@ -2795,6 +2796,7 @@ void free_transhuge_page(struct page *page) > } > spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > } > + VM_WARN_ON_ONCE(atomic_read(&folio->_nr_pages_mapped)); > free_compound_page(page); > } > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 5f498e8025cc..6a614c559ccf 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, > struct page *p; > > atomic_set(&folio->_entire_mapcount, 0); > - atomic_set(&folio->_nr_pages_mapped, 0); > + atomic_set(&folio->_total_mapcount, 0); > atomic_set(&folio->_pincount, 0); > > for (i = 1; i < nr_pages; i++) { > @@ -2027,7 +2027,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > /* we rely on prep_new_hugetlb_folio to set the destructor */ > folio_set_order(folio, order); > atomic_set(&folio->_entire_mapcount, -1); > - atomic_set(&folio->_nr_pages_mapped, 0); > + atomic_set(&folio->_total_mapcount, -1); > atomic_set(&folio->_pincount, 0); > return true; > > diff --git a/mm/internal.h b/mm/internal.h > index 02a6fd36f46e..e4646fed44a5 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -53,10 +53,10 @@ struct folio_batch; > void page_writeback_init(void); > > /* > - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages, > + * If a 16GB folio were mapped by PTEs of all of its 4kB pages, > * its nr_pages_mapped would be 0x400000: choose the COMPOUND_MAPPED bit > - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Hugetlb currently > - * leaves nr_pages_mapped at 0, but avoid surprise if it participates later. > + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Not applicable to > + * hugetlb. > */ > #define COMPOUND_MAPPED 0x800000 > #define FOLIO_PAGES_MAPPED (COMPOUND_MAPPED - 1) > @@ -67,15 +67,6 @@ void page_writeback_init(void); > */ > #define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ > > -/* > - * How many individual pages have an elevated _mapcount. Excludes > - * the folio's entire_mapcount. > - */ > -static inline int folio_nr_pages_mapped(struct folio *folio) > -{ > - return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED; > -} > - > static inline void *folio_raw_mapping(struct folio *folio) > { > unsigned long mapping = (unsigned long)folio->mapping; > @@ -420,7 +411,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order) > folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); > folio_set_order(folio, order); > atomic_set(&folio->_entire_mapcount, -1); > - atomic_set(&folio->_nr_pages_mapped, 0); > + atomic_set(&folio->_total_mapcount, -1); > atomic_set(&folio->_pincount, 0); > } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 96b7c1a7d1f2..5bbd5782b543 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1009,8 +1009,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) > bad_page(page, "nonzero entire_mapcount"); > goto out; > } > - if (unlikely(atomic_read(&folio->_nr_pages_mapped))) { > - bad_page(page, "nonzero nr_pages_mapped"); > + if (unlikely(atomic_read(&folio->_total_mapcount) + 1)) { > + bad_page(page, "nonzero total_mapcount"); > goto out; > } > if (unlikely(atomic_read(&folio->_pincount))) { > diff --git a/mm/rmap.c b/mm/rmap.c > index 1f04debdc87a..a7dcae48245b 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1070,29 +1070,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, > return page_vma_mkclean_one(&pvmw); > } > > -int folio_total_mapcount(struct folio *folio) > -{ > - int mapcount = folio_entire_mapcount(folio); > - int nr_pages; > - int i; > - > - /* In the common case, avoid the loop when no pages mapped by PTE */ > - if (folio_nr_pages_mapped(folio) == 0) > - return mapcount; > - /* > - * Add all the PTE mappings of those pages mapped by PTE. > - * Limit the loop to folio_nr_pages_mapped()? > - * Perhaps: given all the raciness, that may be a good or a bad idea. > - */ > - nr_pages = folio_nr_pages(folio); > - for (i = 0; i < nr_pages; i++) > - mapcount += atomic_read(&folio_page(folio, i)->_mapcount); > - > - /* But each of those _mapcounts was based on -1 */ > - mapcount += nr_pages; > - return mapcount; > -} > - > /** > * page_move_anon_rmap - move a page to our anon_vma > * @page: the page to move to our anon_vma > @@ -1236,6 +1213,9 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > } > } > > + if (folio_test_large(folio)) > + atomic_inc(&folio->_total_mapcount); > + > VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); > VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); > > @@ -1289,6 +1269,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); > } > > + if (folio_test_large(folio)) > + /* increment count (starts at -1) */ > + atomic_set(&folio->_total_mapcount, 0); > + > __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); > __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } > @@ -1310,14 +1294,14 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, > bool compound) > { > atomic_t *mapped = &folio->_nr_pages_mapped; > - unsigned int nr_pmdmapped = 0, first; > + unsigned int nr_pmdmapped = 0, i, first; > int nr = 0; > > VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); > > /* Is page being mapped by PTE? Is this its first map to be added? */ > if (likely(!compound)) { > - do { > + for (i = 0; i < nr_pages; i++, page++) { > first = atomic_inc_and_test(&page->_mapcount); > if (first && folio_test_large(folio)) { > first = atomic_inc_return_relaxed(mapped); > @@ -1326,7 +1310,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, > > if (first) > nr++; > - } while (page++, --nr_pages > 0); > + } > } else if (folio_test_pmd_mappable(folio)) { > /* That test is redundant: it's for safety or to optimize out */ > > @@ -1346,6 +1330,9 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, > } > } > > + if (folio_test_large(folio)) > + atomic_add(compound ? 1 : nr_pages, &folio->_total_mapcount); > + > if (nr_pmdmapped) > __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ? > NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped); > @@ -1398,6 +1385,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, > > VM_BUG_ON_PAGE(compound && !PageHead(page), page); > > + if (folio_test_large(folio)) > + atomic_dec(&folio->_total_mapcount); > + > /* Hugetlb pages are not counted in NR_*MAPPED */ > if (unlikely(folio_test_hugetlb(folio))) { > /* hugetlb pages are always mapped with pmds */ > @@ -2538,6 +2528,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > BUG_ON(!anon_vma); > /* address might be in next vma when migration races vma_merge */ > first = atomic_inc_and_test(&folio->_entire_mapcount); > + atomic_inc(&folio->_total_mapcount); > VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); > VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); > if (first) > @@ -2551,6 +2542,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio, > BUG_ON(address < vma->vm_start || address >= vma->vm_end); > /* increment count (starts at -1) */ > atomic_set(&folio->_entire_mapcount, 0); > + atomic_set(&folio->_total_mapcount, 0); > folio_clear_hugetlb_restore_reserve(folio); > __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > }
On Wed, Aug 09, 2023 at 05:23:46PM -0400, Peter Xu wrote: > Hi, David, > > Some pure questions below.. > > On Wed, Aug 09, 2023 at 10:32:56AM +0200, David Hildenbrand wrote: > > Let's track the total mapcount for all large folios in the first subpage. > > > > The total mapcount is what we actually want to know in folio_mapcount() > > and it is also sufficient for implementing folio_mapped(). This also > > gets rid of any "raceiness" concerns as expressed in > > folio_total_mapcount(). > > Any more information for that "raciness" described here? UTSL. /* * Add all the PTE mappings of those pages mapped by PTE. * Limit the loop to folio_nr_pages_mapped()? * Perhaps: given all the raciness, that may be a good or a bad idea. */
On 10.08.23 05:25, Matthew Wilcox wrote: > On Wed, Aug 09, 2023 at 05:23:46PM -0400, Peter Xu wrote: >> Hi, David, >> >> Some pure questions below.. >> >> On Wed, Aug 09, 2023 at 10:32:56AM +0200, David Hildenbrand wrote: >>> Let's track the total mapcount for all large folios in the first subpage. >>> >>> The total mapcount is what we actually want to know in folio_mapcount() >>> and it is also sufficient for implementing folio_mapped(). This also >>> gets rid of any "raceiness" concerns as expressed in >>> folio_total_mapcount(). >> >> Any more information for that "raciness" described here? > > UTSL. > > /* > * Add all the PTE mappings of those pages mapped by PTE. > * Limit the loop to folio_nr_pages_mapped()? > * Perhaps: given all the raciness, that may be a good or a bad idea. > */ > Yes, that comment from Hugh primarily discusses how we could possibly optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the iterations would be racy. As far as I can see, there are cases where "it would be certainly a bad idea" :) In the other comment in that function, it's also made clear what the traditional behavior with PMD-mappable THP was "In the common case, avoid the loop when no pages mapped by PTE", which will no longer hold with sub-PMD THP.
On 09.08.23 23:23, Peter Xu wrote: > Hi, David, > > Some pure questions below.. Hi Peter, thanks for having a look! [...] >> With sub-PMD THP becoming more important and things looking promising >> that we will soon get support for such anon THP, we want to avoid looping >> over all pages of a folio just to calculate the total mapcount. Further, >> we may soon want to use the total mapcount in other context more >> frequently, so prepare for reading it efficiently and atomically. > > Any (perhaps existing) discussion on reduced loops vs added atomic > field/ops? So far it's not been raised as a concern, so no existing discussion. For order-0 pages the behavior is unchanged. For PMD-mapped THP and hugetlb it's most certainly noise compared to the other activities when (un)mapping these large pages. For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is really significant (judging from my experience on managing PageAnonExclusive using set_bit/test_bit/clear_bit when (un)mapping anon pages). As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be batching where possible (and Ryan is working on some more rmap batching). There, managing the subpage mapcount dominates all other overhead significantly. > > I had a feeling that there's some discussion behind the proposal of this > patch, if that's the case it'll be great to attach the link in the commit > log. There were (mostly offline) discussions on how to sort out some other issues that PTE-mapped THP are facing, and how to eventually get rid of the subpage mapcounts (once consumer being _nr_pages_mapped as spelled out in the patch description). Having a proper total mapcount available was discussed as one building block. I don't think I have anything of value to link that would make sense for the patch as is, as this patch is mostly independent from all that.
On 09/08/2023 20:17, David Hildenbrand wrote: > On 09.08.23 21:07, Ryan Roberts wrote: >> On 09/08/2023 09:32, David Hildenbrand wrote: >>> Let's track the total mapcount for all large folios in the first subpage. >>> >>> The total mapcount is what we actually want to know in folio_mapcount() >>> and it is also sufficient for implementing folio_mapped(). This also >>> gets rid of any "raceiness" concerns as expressed in >>> folio_total_mapcount(). >>> >>> With sub-PMD THP becoming more important and things looking promising >>> that we will soon get support for such anon THP, we want to avoid looping >>> over all pages of a folio just to calculate the total mapcount. Further, >>> we may soon want to use the total mapcount in other context more >>> frequently, so prepare for reading it efficiently and atomically. >>> >>> Make room for the total mapcount in page[1] of the folio by moving >>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb >>> -- and with the total mapcount in place probably also not desirable even >>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we >>> can overlay the hugetlb fields. >>> >>> Note that we currently don't expect any order-1 compound pages / THP in >>> rmap code, and that such support is not planned. If ever desired, we could >>> move the compound mapcount to another page, because it only applies to >>> PMD-mappable folios that are definitely larger. Let's avoid consuming >>> more space elsewhere for now -- we might need more space for a different >>> purpose in some subpages soon. >>> >>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount >>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and >>> page_mapped(). >>> >>> We can now get rid of folio_total_mapcount() and >>> folio_large_is_mapped(), by just inlining reading of the total mapcount. >>> >>> _nr_pages_mapped is now only used in rmap code, so not accidentially >>> externally where it might be used on arbitrary order-1 pages. The remaining >>> usage is: >>> >>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED >>> Â -> If we would account the total folio as mapped when mapping a >>> Â Â Â Â page (based on the total mapcount), we could remove that usage. >>> >>> (2) Detect when to add a folio to the deferred split queue >>> Â -> If we would apply a different heuristic, or scan using the rmap on >>> Â Â Â Â the memory reclaim path for partially mapped anon folios to >>> Â Â Â Â split them, we could remove that usage as well. >>> >>> So maybe, we can simplify things in the future and remove >>> _nr_pages_mapped. For now, leave these things as they are, they need more >>> thought. Hugh really did a nice job implementing that precise tracking >>> after all. >>> >>> Note: Not adding order-1 sanity checks to the file_rmap functions for >>> Â Â Â Â Â Â now. For anon pages, they are certainly not required right now. >>> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: Jonathan Corbet <corbet@lwn.net> >>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>> Cc: Hugh Dickins <hughd@google.com> >>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> >>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>> Cc: Yin Fengwei <fengwei.yin@intel.com> >>> Cc: Yang Shi <shy828301@gmail.com> >>> Cc: Zi Yan <ziy@nvidia.com> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >> >> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing >> is correct: >> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> > > Thanks for the review! > > [...] > >>> Â Â static inline int total_mapcount(struct page *page) >> >> nit: couldn't total_mapcount() just be implemented as a wrapper around >> folio_mapcount()? What's the benefit of PageCompound() check instead of just >> getting the folio and checking if it's large? i.e: > > Good point, let me take a look tomorrow if the compiler can optimize in both > cases equally well. > > [...] > >>> Â diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 5f498e8025cc..6a614c559ccf 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct >>> folio *folio, >>> Â Â Â Â Â struct page *p; >>> Â Â Â Â Â Â atomic_set(&folio->_entire_mapcount, 0); >>> -Â Â Â atomic_set(&folio->_nr_pages_mapped, 0); >>> +Â Â Â atomic_set(&folio->_total_mapcount, 0); >> >> Just checking this is definitely what you intended? _total_mapcount is -1 when >> it means "no pages mapped", so 0 means 1 page mapped? > > I was blindly doing what _entire_mapcount is doing: zeroing out the values. ;) > > But let's look into the details: in __destroy_compound_gigantic_folio(), we're > manually dissolving the whole compound page. So instead of actually returning a > compound page to the buddy (where we would make sure the mapcounts are -1, to > then zero them out !), we simply zero-out the fields we use and then dissolve > the compound page: to be left with a bunch of order-0 pages where the memmap is > in a clean state. > > (the buddy doesn't handle that page order, so we have to do things manually to > get to order-0 pages we can reuse or free) > Yeah fair enough, thanks for the explanation.
On 10/08/2023 09:59, David Hildenbrand wrote: > On 09.08.23 23:23, Peter Xu wrote: >> Hi, David, >> >> Some pure questions below.. > > Hi Peter, > > thanks for having a look! > > [...] > >>> With sub-PMD THP becoming more important and things looking promising >>> that we will soon get support for such anon THP, we want to avoid looping >>> over all pages of a folio just to calculate the total mapcount. Further, >>> we may soon want to use the total mapcount in other context more >>> frequently, so prepare for reading it efficiently and atomically. >> >> Any (perhaps existing) discussion on reduced loops vs added atomic >> field/ops? > > So far it's not been raised as a concern, so no existing discussion. > > For order-0 pages the behavior is unchanged. > > For PMD-mapped THP and hugetlb it's most certainly noise compared to the other > activities when (un)mapping these large pages. > > For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is > really significant (judging from my experience on managing PageAnonExclusive > using set_bit/test_bit/clear_bit when (un)mapping anon pages). > > As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be > batching where possible (and Ryan is working on some more rmap batching). Yes, I've just posted [1] which batches the rmap removal. That would allow you to convert the per-page atomic_dec() into a (usually) single per-large-folio atomic_sub(). [1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/ > There, > managing the subpage mapcount dominates all other overhead significantly. > >> >> I had a feeling that there's some discussion behind the proposal of this >> patch, if that's the case it'll be great to attach the link in the commit >> log. > > There were (mostly offline) discussions on how to sort out some other issues > that PTE-mapped THP are facing, and how to eventually get rid of the subpage > mapcounts (once consumer being _nr_pages_mapped as spelled out in the patch > description). Having a proper total mapcount available was discussed as one > building block. > > I don't think I have anything of value to link that would make sense for the > patch as is, as this patch is mostly independent from all that. >
On 09.08.23 21:17, David Hildenbrand wrote: > On 09.08.23 21:07, Ryan Roberts wrote: >> On 09/08/2023 09:32, David Hildenbrand wrote: >>> Let's track the total mapcount for all large folios in the first subpage. >>> >>> The total mapcount is what we actually want to know in folio_mapcount() >>> and it is also sufficient for implementing folio_mapped(). This also >>> gets rid of any "raceiness" concerns as expressed in >>> folio_total_mapcount(). >>> >>> With sub-PMD THP becoming more important and things looking promising >>> that we will soon get support for such anon THP, we want to avoid looping >>> over all pages of a folio just to calculate the total mapcount. Further, >>> we may soon want to use the total mapcount in other context more >>> frequently, so prepare for reading it efficiently and atomically. >>> >>> Make room for the total mapcount in page[1] of the folio by moving >>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb >>> -- and with the total mapcount in place probably also not desirable even >>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we >>> can overlay the hugetlb fields. >>> >>> Note that we currently don't expect any order-1 compound pages / THP in >>> rmap code, and that such support is not planned. If ever desired, we could >>> move the compound mapcount to another page, because it only applies to >>> PMD-mappable folios that are definitely larger. Let's avoid consuming >>> more space elsewhere for now -- we might need more space for a different >>> purpose in some subpages soon. >>> >>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount >>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and >>> page_mapped(). >>> >>> We can now get rid of folio_total_mapcount() and >>> folio_large_is_mapped(), by just inlining reading of the total mapcount. >>> >>> _nr_pages_mapped is now only used in rmap code, so not accidentially >>> externally where it might be used on arbitrary order-1 pages. The remaining >>> usage is: >>> >>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED >>> -> If we would account the total folio as mapped when mapping a >>> page (based on the total mapcount), we could remove that usage. >>> >>> (2) Detect when to add a folio to the deferred split queue >>> -> If we would apply a different heuristic, or scan using the rmap on >>> the memory reclaim path for partially mapped anon folios to >>> split them, we could remove that usage as well. >>> >>> So maybe, we can simplify things in the future and remove >>> _nr_pages_mapped. For now, leave these things as they are, they need more >>> thought. Hugh really did a nice job implementing that precise tracking >>> after all. >>> >>> Note: Not adding order-1 sanity checks to the file_rmap functions for >>> now. For anon pages, they are certainly not required right now. >>> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: Jonathan Corbet <corbet@lwn.net> >>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>> Cc: Hugh Dickins <hughd@google.com> >>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> >>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>> Cc: Yin Fengwei <fengwei.yin@intel.com> >>> Cc: Yang Shi <shy828301@gmail.com> >>> Cc: Zi Yan <ziy@nvidia.com> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >> >> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing >> is correct: >> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> > > Thanks for the review! > > [...] > >>> >>> static inline int total_mapcount(struct page *page) >> >> nit: couldn't total_mapcount() just be implemented as a wrapper around >> folio_mapcount()? What's the benefit of PageCompound() check instead of just >> getting the folio and checking if it's large? i.e: > > Good point, let me take a look tomorrow if the compiler can optimize in > both cases equally well. I checked by adjusting total_mapcount(): Before: if (PageTransHuge(page) && total_mapcount(page) > 1) ffffffff81411931: 4c 89 e7 mov %r12,%rdi ffffffff81411934: e8 f7 b1 ff ff call ffffffff8140cb30 <PageTransHuge> ffffffff81411939: 85 c0 test %eax,%eax ffffffff8141193b: 74 29 je ffffffff81411966 <migrate_misplaced_page+0x166> ffffffff8141193d: 49 8b 04 24 mov (%r12),%rax return test_bit(PG_head, &page->flags) || ffffffff81411941: a9 00 00 01 00 test $0x10000,%eax ffffffff81411946: 0f 85 1f 01 00 00 jne ffffffff81411a6b <migrate_misplaced_page+0x26b> READ_ONCE(page->compound_head) & 1; ffffffff8141194c: 49 8b 44 24 08 mov 0x8(%r12),%rax return test_bit(PG_head, &page->flags) || ffffffff81411951: a8 01 test $0x1,%al ffffffff81411953: 0f 85 12 01 00 00 jne ffffffff81411a6b <migrate_misplaced_page+0x26b> ffffffff81411959: 41 8b 44 24 30 mov 0x30(%r12),%eax return atomic_read(&page->_mapcount) + 1; ffffffff8141195e: 83 c0 01 add $0x1,%eax ffffffff81411961: 83 f8 01 cmp $0x1,%eax ffffffff81411964: 7f 77 jg ffffffff814119dd <migrate_misplaced_page+0x1dd> So a total of 10 instructions after handling the mov/call/test/je for PageTransHuge(). After: if (PageTransHuge(page) && total_mapcount(page) > 1) ffffffff81411931: 4c 89 e7 mov %r12,%rdi ffffffff81411934: e8 f7 b1 ff ff call ffffffff8140cb30 <PageTransHuge> ffffffff81411939: 85 c0 test %eax,%eax ffffffff8141193b: 74 2f je ffffffff8141196c <migrate_misplaced_page+0x16c> unsigned long head = READ_ONCE(page->compound_head); ffffffff8141193d: 49 8b 44 24 08 mov 0x8(%r12),%rax if (unlikely(head & 1)) ffffffff81411942: a8 01 test $0x1,%al ffffffff81411944: 0f 85 fc 05 00 00 jne ffffffff81411f46 <migrate_misplaced_page+0x746> ffffffff8141194a: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) return page; ffffffff8141194f: 4c 89 e0 mov %r12,%rax ffffffff81411952: 48 8b 10 mov (%rax),%rdx if (likely(!folio_test_large(folio))) ffffffff81411955: f7 c2 00 00 01 00 test $0x10000,%edx ffffffff8141195b: 0f 85 da 05 00 00 jne ffffffff81411f3b <migrate_misplaced_page+0x73b> ffffffff81411961: 8b 40 30 mov 0x30(%rax),%eax return atomic_read(&folio->_mapcount) + 1; ffffffff81411964: 83 c0 01 add $0x1,%eax ffffffff81411967: 83 f8 01 cmp $0x1,%eax ffffffff8141196a: 7f 77 jg ffffffff814119e3 <migrate_misplaced_page+0x1e3> So a total of 11 (+ 1 NOP) instructions after handling the mov/call/test/je for PageTransHuge(). Essentially one more MOV instruction. I guess nobody really cares :)
On 10.08.23 13:14, David Hildenbrand wrote: > On 09.08.23 21:17, David Hildenbrand wrote: >> On 09.08.23 21:07, Ryan Roberts wrote: >>> On 09/08/2023 09:32, David Hildenbrand wrote: >>>> Let's track the total mapcount for all large folios in the first subpage. >>>> >>>> The total mapcount is what we actually want to know in folio_mapcount() >>>> and it is also sufficient for implementing folio_mapped(). This also >>>> gets rid of any "raceiness" concerns as expressed in >>>> folio_total_mapcount(). >>>> >>>> With sub-PMD THP becoming more important and things looking promising >>>> that we will soon get support for such anon THP, we want to avoid looping >>>> over all pages of a folio just to calculate the total mapcount. Further, >>>> we may soon want to use the total mapcount in other context more >>>> frequently, so prepare for reading it efficiently and atomically. >>>> >>>> Make room for the total mapcount in page[1] of the folio by moving >>>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb >>>> -- and with the total mapcount in place probably also not desirable even >>>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we >>>> can overlay the hugetlb fields. >>>> >>>> Note that we currently don't expect any order-1 compound pages / THP in >>>> rmap code, and that such support is not planned. If ever desired, we could >>>> move the compound mapcount to another page, because it only applies to >>>> PMD-mappable folios that are definitely larger. Let's avoid consuming >>>> more space elsewhere for now -- we might need more space for a different >>>> purpose in some subpages soon. >>>> >>>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount >>>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and >>>> page_mapped(). >>>> >>>> We can now get rid of folio_total_mapcount() and >>>> folio_large_is_mapped(), by just inlining reading of the total mapcount. >>>> >>>> _nr_pages_mapped is now only used in rmap code, so not accidentially >>>> externally where it might be used on arbitrary order-1 pages. The remaining >>>> usage is: >>>> >>>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED >>>> -> If we would account the total folio as mapped when mapping a >>>> page (based on the total mapcount), we could remove that usage. >>>> >>>> (2) Detect when to add a folio to the deferred split queue >>>> -> If we would apply a different heuristic, or scan using the rmap on >>>> the memory reclaim path for partially mapped anon folios to >>>> split them, we could remove that usage as well. >>>> >>>> So maybe, we can simplify things in the future and remove >>>> _nr_pages_mapped. For now, leave these things as they are, they need more >>>> thought. Hugh really did a nice job implementing that precise tracking >>>> after all. >>>> >>>> Note: Not adding order-1 sanity checks to the file_rmap functions for >>>> now. For anon pages, they are certainly not required right now. >>>> >>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>> Cc: Jonathan Corbet <corbet@lwn.net> >>>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>>> Cc: Hugh Dickins <hughd@google.com> >>>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> >>>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>>> Cc: Yin Fengwei <fengwei.yin@intel.com> >>>> Cc: Yang Shi <shy828301@gmail.com> >>>> Cc: Zi Yan <ziy@nvidia.com> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> >>> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing >>> is correct: >>> >>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> >> >> Thanks for the review! >> >> [...] >> >>>> >>>> static inline int total_mapcount(struct page *page) >>> >>> nit: couldn't total_mapcount() just be implemented as a wrapper around >>> folio_mapcount()? What's the benefit of PageCompound() check instead of just >>> getting the folio and checking if it's large? i.e: >> >> Good point, let me take a look tomorrow if the compiler can optimize in >> both cases equally well. > > I checked by adjusting total_mapcount(): > > Before: > > if (PageTransHuge(page) && total_mapcount(page) > 1) > ffffffff81411931: 4c 89 e7 mov %r12,%rdi > ffffffff81411934: e8 f7 b1 ff ff call ffffffff8140cb30 <PageTransHuge> > ffffffff81411939: 85 c0 test %eax,%eax > ffffffff8141193b: 74 29 je ffffffff81411966 <migrate_misplaced_page+0x166> > ffffffff8141193d: 49 8b 04 24 mov (%r12),%rax > return test_bit(PG_head, &page->flags) || > ffffffff81411941: a9 00 00 01 00 test $0x10000,%eax > ffffffff81411946: 0f 85 1f 01 00 00 jne ffffffff81411a6b <migrate_misplaced_page+0x26b> > READ_ONCE(page->compound_head) & 1; > ffffffff8141194c: 49 8b 44 24 08 mov 0x8(%r12),%rax > return test_bit(PG_head, &page->flags) || > ffffffff81411951: a8 01 test $0x1,%al > ffffffff81411953: 0f 85 12 01 00 00 jne ffffffff81411a6b <migrate_misplaced_page+0x26b> > ffffffff81411959: 41 8b 44 24 30 mov 0x30(%r12),%eax > return atomic_read(&page->_mapcount) + 1; > ffffffff8141195e: 83 c0 01 add $0x1,%eax > ffffffff81411961: 83 f8 01 cmp $0x1,%eax > ffffffff81411964: 7f 77 jg ffffffff814119dd <migrate_misplaced_page+0x1dd> > > So a total of 10 instructions after handling the mov/call/test/je for PageTransHuge(). > > After: > > if (PageTransHuge(page) && total_mapcount(page) > 1) > ffffffff81411931: 4c 89 e7 mov %r12,%rdi > ffffffff81411934: e8 f7 b1 ff ff call ffffffff8140cb30 <PageTransHuge> > ffffffff81411939: 85 c0 test %eax,%eax > ffffffff8141193b: 74 2f je ffffffff8141196c <migrate_misplaced_page+0x16c> > unsigned long head = READ_ONCE(page->compound_head); > ffffffff8141193d: 49 8b 44 24 08 mov 0x8(%r12),%rax > if (unlikely(head & 1)) > ffffffff81411942: a8 01 test $0x1,%al > ffffffff81411944: 0f 85 fc 05 00 00 jne ffffffff81411f46 <migrate_misplaced_page+0x746> > ffffffff8141194a: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > return page; > ffffffff8141194f: 4c 89 e0 mov %r12,%rax > ffffffff81411952: 48 8b 10 mov (%rax),%rdx > if (likely(!folio_test_large(folio))) > ffffffff81411955: f7 c2 00 00 01 00 test $0x10000,%edx > ffffffff8141195b: 0f 85 da 05 00 00 jne ffffffff81411f3b <migrate_misplaced_page+0x73b> > ffffffff81411961: 8b 40 30 mov 0x30(%rax),%eax > return atomic_read(&folio->_mapcount) + 1; > ffffffff81411964: 83 c0 01 add $0x1,%eax > ffffffff81411967: 83 f8 01 cmp $0x1,%eax > ffffffff8141196a: 7f 77 jg ffffffff814119e3 <migrate_misplaced_page+0x1e3> > > So a total of 11 (+ 1 NOP) instructions after handling the mov/call/test/je for PageTransHuge(). > > Essentially one more MOV instruction. > > I guess nobody really cares :) > Also, let's simply do: static inline bool folio_mapped(struct folio *folio) { folio_mapcount(folio) > 1; } So this all cleans up quite a bit.
On 10.08.23 13:27, David Hildenbrand wrote: > On 10.08.23 13:14, David Hildenbrand wrote: >> On 09.08.23 21:17, David Hildenbrand wrote: >>> On 09.08.23 21:07, Ryan Roberts wrote: >>>> On 09/08/2023 09:32, David Hildenbrand wrote: >>>>> Let's track the total mapcount for all large folios in the first subpage. >>>>> >>>>> The total mapcount is what we actually want to know in folio_mapcount() >>>>> and it is also sufficient for implementing folio_mapped(). This also >>>>> gets rid of any "raceiness" concerns as expressed in >>>>> folio_total_mapcount(). >>>>> >>>>> With sub-PMD THP becoming more important and things looking promising >>>>> that we will soon get support for such anon THP, we want to avoid looping >>>>> over all pages of a folio just to calculate the total mapcount. Further, >>>>> we may soon want to use the total mapcount in other context more >>>>> frequently, so prepare for reading it efficiently and atomically. >>>>> >>>>> Make room for the total mapcount in page[1] of the folio by moving >>>>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb >>>>> -- and with the total mapcount in place probably also not desirable even >>>>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we >>>>> can overlay the hugetlb fields. >>>>> >>>>> Note that we currently don't expect any order-1 compound pages / THP in >>>>> rmap code, and that such support is not planned. If ever desired, we could >>>>> move the compound mapcount to another page, because it only applies to >>>>> PMD-mappable folios that are definitely larger. Let's avoid consuming >>>>> more space elsewhere for now -- we might need more space for a different >>>>> purpose in some subpages soon. >>>>> >>>>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount >>>>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and >>>>> page_mapped(). >>>>> >>>>> We can now get rid of folio_total_mapcount() and >>>>> folio_large_is_mapped(), by just inlining reading of the total mapcount. >>>>> >>>>> _nr_pages_mapped is now only used in rmap code, so not accidentially >>>>> externally where it might be used on arbitrary order-1 pages. The remaining >>>>> usage is: >>>>> >>>>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED >>>>> -> If we would account the total folio as mapped when mapping a >>>>> page (based on the total mapcount), we could remove that usage. >>>>> >>>>> (2) Detect when to add a folio to the deferred split queue >>>>> -> If we would apply a different heuristic, or scan using the rmap on >>>>> the memory reclaim path for partially mapped anon folios to >>>>> split them, we could remove that usage as well. >>>>> >>>>> So maybe, we can simplify things in the future and remove >>>>> _nr_pages_mapped. For now, leave these things as they are, they need more >>>>> thought. Hugh really did a nice job implementing that precise tracking >>>>> after all. >>>>> >>>>> Note: Not adding order-1 sanity checks to the file_rmap functions for >>>>> now. For anon pages, they are certainly not required right now. >>>>> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>>> Cc: Jonathan Corbet <corbet@lwn.net> >>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>>>> Cc: Hugh Dickins <hughd@google.com> >>>>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> >>>>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>>>> Cc: Yin Fengwei <fengwei.yin@intel.com> >>>>> Cc: Yang Shi <shy828301@gmail.com> >>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>> >>>> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing >>>> is correct: >>>> >>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> >>> >>> Thanks for the review! >>> >>> [...] >>> >>>>> >>>>> static inline int total_mapcount(struct page *page) >>>> >>>> nit: couldn't total_mapcount() just be implemented as a wrapper around >>>> folio_mapcount()? What's the benefit of PageCompound() check instead of just >>>> getting the folio and checking if it's large? i.e: >>> >>> Good point, let me take a look tomorrow if the compiler can optimize in >>> both cases equally well. >> >> I checked by adjusting total_mapcount(): >> >> Before: >> >> if (PageTransHuge(page) && total_mapcount(page) > 1) >> ffffffff81411931: 4c 89 e7 mov %r12,%rdi >> ffffffff81411934: e8 f7 b1 ff ff call ffffffff8140cb30 <PageTransHuge> >> ffffffff81411939: 85 c0 test %eax,%eax >> ffffffff8141193b: 74 29 je ffffffff81411966 <migrate_misplaced_page+0x166> >> ffffffff8141193d: 49 8b 04 24 mov (%r12),%rax >> return test_bit(PG_head, &page->flags) || >> ffffffff81411941: a9 00 00 01 00 test $0x10000,%eax >> ffffffff81411946: 0f 85 1f 01 00 00 jne ffffffff81411a6b <migrate_misplaced_page+0x26b> >> READ_ONCE(page->compound_head) & 1; >> ffffffff8141194c: 49 8b 44 24 08 mov 0x8(%r12),%rax >> return test_bit(PG_head, &page->flags) || >> ffffffff81411951: a8 01 test $0x1,%al >> ffffffff81411953: 0f 85 12 01 00 00 jne ffffffff81411a6b <migrate_misplaced_page+0x26b> >> ffffffff81411959: 41 8b 44 24 30 mov 0x30(%r12),%eax >> return atomic_read(&page->_mapcount) + 1; >> ffffffff8141195e: 83 c0 01 add $0x1,%eax >> ffffffff81411961: 83 f8 01 cmp $0x1,%eax >> ffffffff81411964: 7f 77 jg ffffffff814119dd <migrate_misplaced_page+0x1dd> >> >> So a total of 10 instructions after handling the mov/call/test/je for PageTransHuge(). >> >> After: >> >> if (PageTransHuge(page) && total_mapcount(page) > 1) >> ffffffff81411931: 4c 89 e7 mov %r12,%rdi >> ffffffff81411934: e8 f7 b1 ff ff call ffffffff8140cb30 <PageTransHuge> >> ffffffff81411939: 85 c0 test %eax,%eax >> ffffffff8141193b: 74 2f je ffffffff8141196c <migrate_misplaced_page+0x16c> >> unsigned long head = READ_ONCE(page->compound_head); >> ffffffff8141193d: 49 8b 44 24 08 mov 0x8(%r12),%rax >> if (unlikely(head & 1)) >> ffffffff81411942: a8 01 test $0x1,%al >> ffffffff81411944: 0f 85 fc 05 00 00 jne ffffffff81411f46 <migrate_misplaced_page+0x746> >> ffffffff8141194a: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) >> return page; >> ffffffff8141194f: 4c 89 e0 mov %r12,%rax >> ffffffff81411952: 48 8b 10 mov (%rax),%rdx >> if (likely(!folio_test_large(folio))) >> ffffffff81411955: f7 c2 00 00 01 00 test $0x10000,%edx >> ffffffff8141195b: 0f 85 da 05 00 00 jne ffffffff81411f3b <migrate_misplaced_page+0x73b> >> ffffffff81411961: 8b 40 30 mov 0x30(%rax),%eax >> return atomic_read(&folio->_mapcount) + 1; >> ffffffff81411964: 83 c0 01 add $0x1,%eax >> ffffffff81411967: 83 f8 01 cmp $0x1,%eax >> ffffffff8141196a: 7f 77 jg ffffffff814119e3 <migrate_misplaced_page+0x1e3> >> >> So a total of 11 (+ 1 NOP) instructions after handling the mov/call/test/je for PageTransHuge(). >> >> Essentially one more MOV instruction. >> >> I guess nobody really cares :) >> > > Also, let's simply do: > > static inline bool folio_mapped(struct folio *folio) > { > folio_mapcount(folio) > 1; > 0, obviously :)
On 10/08/2023 12:32, David Hildenbrand wrote: > On 10.08.23 13:27, David Hildenbrand wrote: >> On 10.08.23 13:14, David Hildenbrand wrote: >>> On 09.08.23 21:17, David Hildenbrand wrote: >>>> On 09.08.23 21:07, Ryan Roberts wrote: >>>>> On 09/08/2023 09:32, David Hildenbrand wrote: >>>>>> Let's track the total mapcount for all large folios in the first subpage. >>>>>> >>>>>> The total mapcount is what we actually want to know in folio_mapcount() >>>>>> and it is also sufficient for implementing folio_mapped(). This also >>>>>> gets rid of any "raceiness" concerns as expressed in >>>>>> folio_total_mapcount(). >>>>>> >>>>>> With sub-PMD THP becoming more important and things looking promising >>>>>> that we will soon get support for such anon THP, we want to avoid looping >>>>>> over all pages of a folio just to calculate the total mapcount. Further, >>>>>> we may soon want to use the total mapcount in other context more >>>>>> frequently, so prepare for reading it efficiently and atomically. >>>>>> >>>>>> Make room for the total mapcount in page[1] of the folio by moving >>>>>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb >>>>>> -- and with the total mapcount in place probably also not desirable even >>>>>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we >>>>>> can overlay the hugetlb fields. >>>>>> >>>>>> Note that we currently don't expect any order-1 compound pages / THP in >>>>>> rmap code, and that such support is not planned. If ever desired, we could >>>>>> move the compound mapcount to another page, because it only applies to >>>>>> PMD-mappable folios that are definitely larger. Let's avoid consuming >>>>>> more space elsewhere for now -- we might need more space for a different >>>>>> purpose in some subpages soon. >>>>>> >>>>>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount >>>>>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and >>>>>> page_mapped(). >>>>>> >>>>>> We can now get rid of folio_total_mapcount() and >>>>>> folio_large_is_mapped(), by just inlining reading of the total mapcount. >>>>>> >>>>>> _nr_pages_mapped is now only used in rmap code, so not accidentially >>>>>> externally where it might be used on arbitrary order-1 pages. The remaining >>>>>> usage is: >>>>>> >>>>>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED >>>>>>     -> If we would account the total folio as mapped when mapping a >>>>>>        page (based on the total mapcount), we could remove that usage. >>>>>> >>>>>> (2) Detect when to add a folio to the deferred split queue >>>>>>     -> If we would apply a different heuristic, or scan using the rmap on >>>>>>        the memory reclaim path for partially mapped anon folios to >>>>>>        split them, we could remove that usage as well. >>>>>> >>>>>> So maybe, we can simplify things in the future and remove >>>>>> _nr_pages_mapped. For now, leave these things as they are, they need more >>>>>> thought. Hugh really did a nice job implementing that precise tracking >>>>>> after all. >>>>>> >>>>>> Note: Not adding order-1 sanity checks to the file_rmap functions for >>>>>>          now. For anon pages, they are certainly not required right now. >>>>>> >>>>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>>>> Cc: Jonathan Corbet <corbet@lwn.net> >>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com> >>>>>> Cc: Hugh Dickins <hughd@google.com> >>>>>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> >>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com> >>>>>> Cc: Yin Fengwei <fengwei.yin@intel.com> >>>>>> Cc: Yang Shi <shy828301@gmail.com> >>>>>> Cc: Zi Yan <ziy@nvidia.com> >>>>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>>> >>>>> Other than the nits and query on zeroing _total_mapcount below, LGTM. If >>>>> zeroing >>>>> is correct: >>>>> >>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> >>>> >>>> Thanks for the review! >>>> >>>> [...] >>>> >>>>>>         static inline int total_mapcount(struct page *page) >>>>> >>>>> nit: couldn't total_mapcount() just be implemented as a wrapper around >>>>> folio_mapcount()? What's the benefit of PageCompound() check instead of just >>>>> getting the folio and checking if it's large? i.e: >>>> >>>> Good point, let me take a look tomorrow if the compiler can optimize in >>>> both cases equally well. >>> >>> I checked by adjusting total_mapcount(): >>> >>> Before: >>> >>>           if (PageTransHuge(page) && total_mapcount(page) > 1) >>> ffffffff81411931:      4c 89 e7               mov   %r12,%rdi >>> ffffffff81411934:      e8 f7 b1 ff ff         call  ffffffff8140cb30 >>> <PageTransHuge> >>> ffffffff81411939:      85 c0                  test  %eax,%eax >>> ffffffff8141193b:      74 29                  je    ffffffff81411966 >>> <migrate_misplaced_page+0x166> >>> ffffffff8141193d:      49 8b 04 24            mov   (%r12),%rax >>>           return test_bit(PG_head, &page->flags) || >>> ffffffff81411941:      a9 00 00 01 00         test  $0x10000,%eax >>> ffffffff81411946:      0f 85 1f 01 00 00      jne   ffffffff81411a6b >>> <migrate_misplaced_page+0x26b> >>>                  READ_ONCE(page->compound_head) & 1; >>> ffffffff8141194c:      49 8b 44 24 08         mov   0x8(%r12),%rax >>>           return test_bit(PG_head, &page->flags) || >>> ffffffff81411951:      a8 01                  test  $0x1,%al >>> ffffffff81411953:      0f 85 12 01 00 00      jne   ffffffff81411a6b >>> <migrate_misplaced_page+0x26b> >>> ffffffff81411959:      41 8b 44 24 30         mov   0x30(%r12),%eax >>>                   return atomic_read(&page->_mapcount) + 1; >>> ffffffff8141195e:      83 c0 01               add   $0x1,%eax >>> ffffffff81411961:      83 f8 01               cmp   $0x1,%eax >>> ffffffff81411964:      7f 77                  jg    ffffffff814119dd >>> <migrate_misplaced_page+0x1dd> >>> >>> So a total of 10 instructions after handling the mov/call/test/je for >>> PageTransHuge(). >>> >>> After: >>> >>>           if (PageTransHuge(page) && total_mapcount(page) > 1) >>> ffffffff81411931:      4c 89 e7               mov   %r12,%rdi >>> ffffffff81411934:      e8 f7 b1 ff ff         call  ffffffff8140cb30 >>> <PageTransHuge> >>> ffffffff81411939:      85 c0                  test  %eax,%eax >>> ffffffff8141193b:      74 2f                  je    ffffffff8141196c >>> <migrate_misplaced_page+0x16c> >>>           unsigned long head = READ_ONCE(page->compound_head); >>> ffffffff8141193d:      49 8b 44 24 08         mov   0x8(%r12),%rax >>>           if (unlikely(head & 1)) >>> ffffffff81411942:      a8 01                  test  $0x1,%al >>> ffffffff81411944:      0f 85 fc 05 00 00      jne   ffffffff81411f46 >>> <migrate_misplaced_page+0x746> >>> ffffffff8141194a:      0f 1f 44 00 00         nopl  0x0(%rax,%rax,1) >>>                   return page; >>> ffffffff8141194f:      4c 89 e0               mov   %r12,%rax >>> ffffffff81411952:      48 8b 10               mov   (%rax),%rdx >>>           if (likely(!folio_test_large(folio))) >>> ffffffff81411955:      f7 c2 00 00 01 00      test  $0x10000,%edx >>> ffffffff8141195b:      0f 85 da 05 00 00      jne   ffffffff81411f3b >>> <migrate_misplaced_page+0x73b> >>> ffffffff81411961:      8b 40 30               mov   0x30(%rax),%eax >>>                   return atomic_read(&folio->_mapcount) + 1; >>> ffffffff81411964:      83 c0 01               add   $0x1,%eax >>> ffffffff81411967:      83 f8 01               cmp   $0x1,%eax >>> ffffffff8141196a:      7f 77                  jg    ffffffff814119e3 >>> <migrate_misplaced_page+0x1e3> >>> >>> So a total of 11 (+ 1 NOP) instructions after handling the mov/call/test/je >>> for PageTransHuge(). >>> >>> Essentially one more MOV instruction. >>> >>> I guess nobody really cares :) >>> >> >> Also, let's simply do: >> >> static inline bool folio_mapped(struct folio *folio) >> { >>     folio_mapcount(folio) > 1; > >> 0, obviously :) > ;-) All sounds good to me!
On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote: > > For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is > > really significant (judging from my experience on managing PageAnonExclusive > > using set_bit/test_bit/clear_bit when (un)mapping anon pages). > > > > As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be > > batching where possible (and Ryan is working on some more rmap batching). > > Yes, I've just posted [1] which batches the rmap removal. That would allow you > to convert the per-page atomic_dec() into a (usually) single per-large-folio > atomic_sub(). > > [1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/ Right, that'll definitely make more sense, thanks for the link; I'd be very happy to read more later (finally I got some free time recently..). But then does it mean David's patch can be attached at the end instead of proposed separately and early? I was asking mostly because I read it as a standalone patch first, and honestly I don't know the effect. It's based on not only the added atomic ops itself, but also the field changes. For example, this patch moves Hugh's _nr_pages_mapped into the 2nd tail page, I think it means for any rmap change of any small page of a huge one we'll need to start touching one more 64B cacheline on x86. I really have no idea what does it mean for especially a large SMP: see 292648ac5cf1 on why I had an impression of that. But I've no enough experience or clue to prove it a problem either, maybe would be interesting to measure the time needed for some pte-mapped loops? E.g., something like faulting in a thp, then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it takes before/after this patch. When looking at this, I actually found one thing that is slightly confusing, not directly relevant to your patch, but regarding the reuse of tail page 1 on offset 24 bytes. Current it's Hugh's _nr_pages_mapped, and you're proposing to replace it with the total mapcount: atomic_t _nr_pages_mapped; /* 88 4 */ Now my question is.. isn't byte 24 of tail page 1 used for keeping a poisoned mapping? See prep_compound_tail() where it has: p->mapping = TAIL_MAPPING; While here mapping is, afaict, also using offset 24 of the tail page 1: struct address_space * mapping; /* 24 8 */ I hope I did a wrong math somewhere, though.
On 10.08.23 19:15, Peter Xu wrote: > On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote: >>> For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is >>> really significant (judging from my experience on managing PageAnonExclusive >>> using set_bit/test_bit/clear_bit when (un)mapping anon pages). >>> >>> As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be >>> batching where possible (and Ryan is working on some more rmap batching). >> >> Yes, I've just posted [1] which batches the rmap removal. That would allow you >> to convert the per-page atomic_dec() into a (usually) single per-large-folio >> atomic_sub(). >> >> [1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/ > > Right, that'll definitely make more sense, thanks for the link; I'd be very > happy to read more later (finally I got some free time recently..). But > then does it mean David's patch can be attached at the end instead of > proposed separately and early? Not in my opinion. Batching rmap makes sense even without this change, and this change makes sense even without batching. > > I was asking mostly because I read it as a standalone patch first, and > honestly I don't know the effect. It's based on not only the added atomic > ops itself, but also the field changes. > > For example, this patch moves Hugh's _nr_pages_mapped into the 2nd tail > page, I think it means for any rmap change of any small page of a huge one > we'll need to start touching one more 64B cacheline on x86. I really have > no idea what does it mean for especially a large SMP: see 292648ac5cf1 on > why I had an impression of that. But I've no enough experience or clue to > prove it a problem either, maybe would be interesting to measure the time > needed for some pte-mapped loops? E.g., something like faulting in a thp, Okay, so your speculation right now is: 1) The change in cacheline might be problematic. 2) The additional atomic operation might be problematic. > then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it > takes before/after this patch. I can certainly try getting some numbers on that. If you're aware of other micro-benchmarks that would likely notice slower pte-mapping of THPs, please let me know. > > When looking at this, I actually found one thing that is slightly > confusing, not directly relevant to your patch, but regarding the reuse of > tail page 1 on offset 24 bytes. Current it's Hugh's _nr_pages_mapped, > and you're proposing to replace it with the total mapcount: > > atomic_t _nr_pages_mapped; /* 88 4 */ > > Now my question is.. isn't byte 24 of tail page 1 used for keeping a > poisoned mapping? See prep_compound_tail() where it has: > > p->mapping = TAIL_MAPPING; > > While here mapping is, afaict, also using offset 24 of the tail page 1: > > struct address_space * mapping; /* 24 8 */ > > I hope I did a wrong math somewhere, though. > I think your math is correct. prep_compound_head() is called after prep_compound_tail(), so prep_compound_head() wins. In __split_huge_page_tail() there is a VM_BUG_ON_PAGE() that explains the situation: /* ->mapping in first and second tail page is replaced by other uses */ VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, page_tail); Thanks for raising that, I had to look into that myself.
On 10/08/2023 18:47, David Hildenbrand wrote: > On 10.08.23 19:15, Peter Xu wrote: >> On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote: >>>> For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is >>>> really significant (judging from my experience on managing PageAnonExclusive >>>> using set_bit/test_bit/clear_bit when (un)mapping anon pages). >>>> >>>> As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be >>>> batching where possible (and Ryan is working on some more rmap batching). >>> >>> Yes, I've just posted [1] which batches the rmap removal. That would allow you >>> to convert the per-page atomic_dec() into a (usually) single per-large-folio >>> atomic_sub(). >>> >>> [1] >>> https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/ >> >> Right, that'll definitely make more sense, thanks for the link; I'd be very >> happy to read more later (finally I got some free time recently..). But >> then does it mean David's patch can be attached at the end instead of >> proposed separately and early? > > Not in my opinion. Batching rmap makes sense even without this change, and this > change makes sense even without batching. FWIW, I agree that my series and David's series should be treated independently. There is independent value in both. It's also worth pointing out that with my series, the amount of batching you see in practice still depends on large folios being mapped, which isn't quite the common case yet.
On Thu, Aug 10, 2023 at 07:47:35PM +0200, David Hildenbrand wrote: > On 10.08.23 19:15, Peter Xu wrote: > > On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote: > > > > For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is > > > > really significant (judging from my experience on managing PageAnonExclusive > > > > using set_bit/test_bit/clear_bit when (un)mapping anon pages). > > > > > > > > As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be > > > > batching where possible (and Ryan is working on some more rmap batching). > > > > > > Yes, I've just posted [1] which batches the rmap removal. That would allow you > > > to convert the per-page atomic_dec() into a (usually) single per-large-folio > > > atomic_sub(). > > > > > > [1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/ > > > > Right, that'll definitely make more sense, thanks for the link; I'd be very > > happy to read more later (finally I got some free time recently..). But > > then does it mean David's patch can be attached at the end instead of > > proposed separately and early? > > Not in my opinion. Batching rmap makes sense even without this change, and > this change makes sense even without batching. > > > > > I was asking mostly because I read it as a standalone patch first, and > > honestly I don't know the effect. It's based on not only the added atomic > > ops itself, but also the field changes. > > > > For example, this patch moves Hugh's _nr_pages_mapped into the 2nd tail > > page, I think it means for any rmap change of any small page of a huge one > > we'll need to start touching one more 64B cacheline on x86. I really have > > no idea what does it mean for especially a large SMP: see 292648ac5cf1 on > > why I had an impression of that. But I've no enough experience or clue to > > prove it a problem either, maybe would be interesting to measure the time > > needed for some pte-mapped loops? E.g., something like faulting in a thp, > > Okay, so your speculation right now is: > > 1) The change in cacheline might be problematic. > > 2) The additional atomic operation might be problematic. > > > then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it > > takes before/after this patch. > > I can certainly try getting some numbers on that. If you're aware of other > micro-benchmarks that would likely notice slower pte-mapping of THPs, please > let me know. Thanks. > > > > > When looking at this, I actually found one thing that is slightly > > confusing, not directly relevant to your patch, but regarding the reuse of > > tail page 1 on offset 24 bytes. Current it's Hugh's _nr_pages_mapped, > > and you're proposing to replace it with the total mapcount: > > > > atomic_t _nr_pages_mapped; /* 88 4 */ > > > > Now my question is.. isn't byte 24 of tail page 1 used for keeping a > > poisoned mapping? See prep_compound_tail() where it has: > > > > p->mapping = TAIL_MAPPING; > > > > While here mapping is, afaict, also using offset 24 of the tail page 1: > > > > struct address_space * mapping; /* 24 8 */ > > > > I hope I did a wrong math somewhere, though. > > > > I think your math is correct. > > prep_compound_head() is called after prep_compound_tail(), so > prep_compound_head() wins. > > In __split_huge_page_tail() there is a VM_BUG_ON_PAGE() that explains the > situation: > > /* ->mapping in first and second tail page is replaced by other uses */ > VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING, > page_tail); > > Thanks for raising that, I had to look into that myself. It's so confusing so I did try to document them a bit myself, then I found maybe I should just post a patch for it and I just did: https://lore.kernel.org/r/20230810204944.53471-1-peterx@redhat.com It'll conflict with yours, but I marked RFC so it's never anything urgent, but maybe that'll be helpful already for you to introduce any new fields like total_mapcounts. AFAICS if that patch was all correct (while I'm not yet sure..), you can actually fit your new total mapcount field into page 1 so even avoid the extra cacheline access. You can have a look: the trick is refcount for tail page 1 is still seems to be free on 32 bits (if that was your worry before). Then it'll be very nice if to keep Hugh's counter all in tail 1.
On Thu, Aug 10, 2023 at 10:37:04AM +0200, David Hildenbrand wrote: > On 10.08.23 05:25, Matthew Wilcox wrote: > > On Wed, Aug 09, 2023 at 05:23:46PM -0400, Peter Xu wrote: > > > Hi, David, > > > > > > Some pure questions below.. > > > > > > On Wed, Aug 09, 2023 at 10:32:56AM +0200, David Hildenbrand wrote: > > > > Let's track the total mapcount for all large folios in the first subpage. > > > > > > > > The total mapcount is what we actually want to know in folio_mapcount() > > > > and it is also sufficient for implementing folio_mapped(). This also > > > > gets rid of any "raceiness" concerns as expressed in > > > > folio_total_mapcount(). > > > > > > Any more information for that "raciness" described here? > > > > UTSL. > > > > /* > > * Add all the PTE mappings of those pages mapped by PTE. > > * Limit the loop to folio_nr_pages_mapped()? > > * Perhaps: given all the raciness, that may be a good or a bad idea. > > */ > > > > Yes, that comment from Hugh primarily discusses how we could possibly > optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the > iterations would be racy. As far as I can see, there are cases where "it > would be certainly a bad idea" :) Is the race described about mapcount being changed right after it's read? Are you aware of anything specific that will be broken, and will be fixed with this patch? I assume mapcount==1 will be very special in this case when e.g. holding a pgtable lock, other than that I won't be surprised if mapcount changes in parallel. But I must confess I don't really have any thorough digests on this whole matter. > > > In the other comment in that function, it's also made clear what the > traditional behavior with PMD-mappable THP was "In the common case, avoid > the loop when no pages mapped by PTE", which will no longer hold with > sub-PMD THP. Having a total mapcount does sound helpful if partial folio is common indeed. I'm curious whether that'll be so common after the large anon folio work - isn't it be sad if partial folio will be a norm? It sounds to me that's the case when small page sizes should be used.. and it's prone to waste?
On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote: > AFAICS if that patch was all correct (while I'm not yet sure..), you can > actually fit your new total mapcount field into page 1 so even avoid the > extra cacheline access. You can have a look: the trick is refcount for > tail page 1 is still seems to be free on 32 bits (if that was your worry > before). Then it'll be very nice if to keep Hugh's counter all in tail 1. No, refcount must be 0 on all tail pages. We rely on this in many places in the MM.
On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote: > > Yes, that comment from Hugh primarily discusses how we could possibly > > optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the > > iterations would be racy. As far as I can see, there are cases where "it > > would be certainly a bad idea" :) > > Is the race described about mapcount being changed right after it's read? > Are you aware of anything specific that will be broken, and will be fixed > with this patch? The problem is that people check the mapcount while holding no locks; not the PTL, not the page lock. So it's an unfixable race. > Having a total mapcount does sound helpful if partial folio is common > indeed. > > I'm curious whether that'll be so common after the large anon folio work - > isn't it be sad if partial folio will be a norm? It sounds to me that's > the case when small page sizes should be used.. and it's prone to waste? The problem is that entire_mapcount isn't really entire_mapcount. It's pmd_mapcount. I have had thoughts about using it as entire_mapcount, but it gets gnarly when people do partial unmaps. So the _usual_ case ends up touching every struct page. Which sucks. Also it's one of the things which stands in the way of shrinking struct page. But it's kind of annoying to explain all of this to you individually. There have been hundreds of emails about it over the last months on this mailing list. It would be nice if you could catch up instead of jumping in.
On 10.08.23 23:54, Matthew Wilcox wrote: > On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote: >>> Yes, that comment from Hugh primarily discusses how we could possibly >>> optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the >>> iterations would be racy. As far as I can see, there are cases where "it >>> would be certainly a bad idea" :) >> >> Is the race described about mapcount being changed right after it's read? >> Are you aware of anything specific that will be broken, and will be fixed >> with this patch? > > The problem is that people check the mapcount while holding no locks; > not the PTL, not the page lock. So it's an unfixable race. > >> Having a total mapcount does sound helpful if partial folio is common >> indeed. >> >> I'm curious whether that'll be so common after the large anon folio work - >> isn't it be sad if partial folio will be a norm? It sounds to me that's >> the case when small page sizes should be used.. and it's prone to waste? > > The problem is that entire_mapcount isn't really entire_mapcount. > It's pmd_mapcount. I have had thoughts about using it as entire_mapcount, > but it gets gnarly when people do partial unmaps. So the _usual_ case > ends up touching every struct page. Which sucks. Also it's one of the > things which stands in the way of shrinking struct page. Right, so one current idea is to have a single total_mapcount and look into removing the subpage mapcounts (which will require first removing _nr_pages_mapped, because that's still one of the important users). Until we get there, also rmap code has to do eventually "more tracking" and might, unfortunately, end up slower. > > But it's kind of annoying to explain all of this to you individually. > There have been hundreds of emails about it over the last months on > this mailing list. It would be nice if you could catch up instead of > jumping in. To be fair, a lot of the details are not readily available and in the heads of selected people :) Peter, if you're interested, we can discuss the current plans, issues and ideas offline!
>> Okay, so your speculation right now is: >> >> 1) The change in cacheline might be problematic. >> >> 2) The additional atomic operation might be problematic. >> >>> then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it >>> takes before/after this patch. >> >> I can certainly try getting some numbers on that. If you're aware of other >> micro-benchmarks that would likely notice slower pte-mapping of THPs, please >> let me know. > > Thanks. If I effectively only measure the real PTE->PMD remapping (only measure the for loop that mprotects() one 4k page inside each of 512 THPs ) without any of the mmap+populate+munmap, I can certainly measure a real difference. I briefly looked at some perf data across the overall benchmark runtime. For page_remove_rmap(), the new atomic_dec() doesn't seem to be significant. Data indicates that it's significantly less relevant than a later atomic_add_negative(). For page_add_anon_rmap(), it's a bit fuzzy. Definitely, the atomic_inc_return_relaxed(mapped) seems to stick out, but I cannot rule out that the atomic_add() also plays a role. The PTE->PMD remapping effectively does (__split_huge_pmd_locked()) for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) { ... page_add_anon_rmap(page + i, vma, addr, RMAP_NONE); ... } ... page_remove_rmap(page, vma, true); Inside that loop we're repeatedly accessing the total_mapcount and _nr_pages_mapped. So my best guess would have been that both are already hot in the cache. RMAP batching certainly sounds like a good idea for __split_huge_pmd_locked(), independent of this patch. What would probably also interesting is observing happens when we unmap a single PTE of a THP and we cannot batch, to see if the page_remove_rmap() matters in the bigger scale. I'll do some more digging tomorrow to clarify some details. Running some kernel compile tests with thp=always at least didn't reveal any surprises so far.
On 10.08.23 23:48, Matthew Wilcox wrote: > On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote: >> AFAICS if that patch was all correct (while I'm not yet sure..), you can >> actually fit your new total mapcount field into page 1 so even avoid the >> extra cacheline access. You can have a look: the trick is refcount for >> tail page 1 is still seems to be free on 32 bits (if that was your worry >> before). Then it'll be very nice if to keep Hugh's counter all in tail 1. > > No, refcount must be 0 on all tail pages. We rely on this in many places > in the MM. Very right. One could theoretically 1) Move the compound/entire mapcount to page[2] 2) Make hugetlb stop using the entire mapcount and only the total mapcount. 3) Then leave total_mapcount and nr_pages_mapped in page[1] 4) Make page_mapcount() use total_mapcount for hugetlb. When (un)mapping a PMD-mapped THP, we would go to page[2]. Otherwise, only page[1]. The entire mapcount, similarly to nr_pages_mapped, primarily serves to get memory stats right; well, and to implement page_mapcount() for THP. But I'm not 100% sure yet if the overhead from having nr_pages_mapped in page[2] is significant enough at this point.
On Thu, Aug 10, 2023 at 11:59:25PM +0200, David Hildenbrand wrote: > On 10.08.23 23:54, Matthew Wilcox wrote: > > On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote: > > > > Yes, that comment from Hugh primarily discusses how we could possibly > > > > optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the > > > > iterations would be racy. As far as I can see, there are cases where "it > > > > would be certainly a bad idea" :) > > > > > > Is the race described about mapcount being changed right after it's read? > > > Are you aware of anything specific that will be broken, and will be fixed > > > with this patch? > > > > The problem is that people check the mapcount while holding no locks; > > not the PTL, not the page lock. So it's an unfixable race. > > > > > Having a total mapcount does sound helpful if partial folio is common > > > indeed. > > > > > > I'm curious whether that'll be so common after the large anon folio work - > > > isn't it be sad if partial folio will be a norm? It sounds to me that's > > > the case when small page sizes should be used.. and it's prone to waste? > > > > The problem is that entire_mapcount isn't really entire_mapcount. > > It's pmd_mapcount. I have had thoughts about using it as entire_mapcount, > > but it gets gnarly when people do partial unmaps. So the _usual_ case > > ends up touching every struct page. Which sucks. Also it's one of the > > things which stands in the way of shrinking struct page. > > Right, so one current idea is to have a single total_mapcount and look into > removing the subpage mapcounts (which will require first removing > _nr_pages_mapped, because that's still one of the important users). > > Until we get there, also rmap code has to do eventually "more tracking" and > might, unfortunately, end up slower. > > > > > But it's kind of annoying to explain all of this to you individually. > > There have been hundreds of emails about it over the last months on > > this mailing list. It would be nice if you could catch up instead of > > jumping in. > > To be fair, a lot of the details are not readily available and in the heads > of selected people :) > > Peter, if you're interested, we can discuss the current plans, issues and > ideas offline! Thanks for offering help, David. Personally I still am unclear yet on why entire_mapcount cannot be used as full-folio mapcounts, and why "partial unmap" can happen a lot (I don't expect), but yeah I can try to catch up to educate myself first. The only issue regarding an offline sync-up is that even if David will help Peter on catching up the bits, it'll not scale when another Peter2 had the same question.. So David, rather than I waste your time on helping one person, let me try to catch up with the public threads - I'm not sure how far I can go myself; otoh thread links will definitely be helpful to be replied here, so anyone else can reference too. I collected a list (which can be enriched) of few threads that might be related, just in case helpful to anyone besides myself: [PATCH 0/2] don't use mapcount() to check large folio sharing https://lore.kernel.org/r/20230728161356.1784568-1-fengwei.yin@intel.com [PATCH v1-v2 0/3] support large folio for mlock https://lore.kernel.org/r/20230728070929.2487065-1-fengwei.yin@intel.com https://lore.kernel.org/r/20230809061105.3369958-1-fengwei.yin@intel.com [PATCH v1 0/4] Optimize mmap_exit for large folios https://lore.kernel.org/r/20230810103332.3062143-1-ryan.roberts@arm.com [PATCH v4-v5 0/5] variable-order, large folios for anonymous memory https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/ https://lore.kernel.org/r/20230810142942.3169679-1-ryan.roberts@arm.com [PATCH v3-v4 0/3] Optimize large folio interaction with deferred split (I assumed Ryan's this one goes into the previous set v5 finally, so just the discussions as reference) https://lore.kernel.org/r/20230720112955.643283-1-ryan.roberts@arm.com https://lore.kernel.org/r/20230727141837.3386072-1-ryan.roberts@arm.com [RFC PATCH v2 0/4] fix large folio for madvise_cold_or_pageout() https://lore.kernel.org/r/20230721094043.2506691-1-fengwei.yin@intel.com I'm not sure how far I'll go; maybe I'll start working on something else before I finish all of them. I'll see.. Not allowing people to jump in will definitely cause less interactions and less involvement/open-ness for the mm community, as sometimes people can't easily judge when it's proper to jump in. IMHO the ideal solution is always keep all discussions public (either meetings with recordings, or shared online documents, always use on-list discussions, etc.), then share the links.
On 11 Aug 2023, at 11:03, Peter Xu wrote: > On Thu, Aug 10, 2023 at 11:59:25PM +0200, David Hildenbrand wrote: >> On 10.08.23 23:54, Matthew Wilcox wrote: >>> On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote: >>>>> Yes, that comment from Hugh primarily discusses how we could possibly >>>>> optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the >>>>> iterations would be racy. As far as I can see, there are cases where "it >>>>> would be certainly a bad idea" :) >>>> >>>> Is the race described about mapcount being changed right after it's read? >>>> Are you aware of anything specific that will be broken, and will be fixed >>>> with this patch? >>> >>> The problem is that people check the mapcount while holding no locks; >>> not the PTL, not the page lock. So it's an unfixable race. >>> >>>> Having a total mapcount does sound helpful if partial folio is common >>>> indeed. >>>> >>>> I'm curious whether that'll be so common after the large anon folio work - >>>> isn't it be sad if partial folio will be a norm? It sounds to me that's >>>> the case when small page sizes should be used.. and it's prone to waste? >>> >>> The problem is that entire_mapcount isn't really entire_mapcount. >>> It's pmd_mapcount. I have had thoughts about using it as entire_mapcount, >>> but it gets gnarly when people do partial unmaps. So the _usual_ case >>> ends up touching every struct page. Which sucks. Also it's one of the >>> things which stands in the way of shrinking struct page. >> >> Right, so one current idea is to have a single total_mapcount and look into >> removing the subpage mapcounts (which will require first removing >> _nr_pages_mapped, because that's still one of the important users). >> >> Until we get there, also rmap code has to do eventually "more tracking" and >> might, unfortunately, end up slower. >> >>> >>> But it's kind of annoying to explain all of this to you individually. >>> There have been hundreds of emails about it over the last months on >>> this mailing list. It would be nice if you could catch up instead of >>> jumping in. >> >> To be fair, a lot of the details are not readily available and in the heads >> of selected people :) >> >> Peter, if you're interested, we can discuss the current plans, issues and >> ideas offline! > > Thanks for offering help, David. > > Personally I still am unclear yet on why entire_mapcount cannot be used as > full-folio mapcounts, and why "partial unmap" can happen a lot (I don't > expect), but yeah I can try to catch up to educate myself first. Separate entire_mapcount and per-page mapcount are needed to maintain precise NR_{ANON,FILE}_MAPPED and NR_ANON_THPS. I wrote some explanation (third paragraph) at: https://lore.kernel.org/linux-mm/A28053D6-E158-4726-8BE1-B9F4960AD570@nvidia.com/. Let me know if it helps. > > The only issue regarding an offline sync-up is that even if David will help > Peter on catching up the bits, it'll not scale when another Peter2 had the > same question.. So David, rather than I waste your time on helping one > person, let me try to catch up with the public threads - I'm not sure how > far I can go myself; otoh thread links will definitely be helpful to be > replied here, so anyone else can reference too. I collected a list (which > can be enriched) of few threads that might be related, just in case helpful > to anyone besides myself: > > [PATCH 0/2] don't use mapcount() to check large folio sharing > https://lore.kernel.org/r/20230728161356.1784568-1-fengwei.yin@intel.com > > [PATCH v1-v2 0/3] support large folio for mlock > https://lore.kernel.org/r/20230728070929.2487065-1-fengwei.yin@intel.com > https://lore.kernel.org/r/20230809061105.3369958-1-fengwei.yin@intel.com > > [PATCH v1 0/4] Optimize mmap_exit for large folios > https://lore.kernel.org/r/20230810103332.3062143-1-ryan.roberts@arm.com > > [PATCH v4-v5 0/5] variable-order, large folios for anonymous memory > https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/ > https://lore.kernel.org/r/20230810142942.3169679-1-ryan.roberts@arm.com > > [PATCH v3-v4 0/3] Optimize large folio interaction with deferred split > (I assumed Ryan's this one goes into the previous set v5 finally, so just > the discussions as reference) > https://lore.kernel.org/r/20230720112955.643283-1-ryan.roberts@arm.com > https://lore.kernel.org/r/20230727141837.3386072-1-ryan.roberts@arm.com > > [RFC PATCH v2 0/4] fix large folio for madvise_cold_or_pageout() > https://lore.kernel.org/r/20230721094043.2506691-1-fengwei.yin@intel.com > > I'm not sure how far I'll go; maybe I'll start working on something else > before I finish all of them. I'll see.. > > Not allowing people to jump in will definitely cause less interactions and > less involvement/open-ness for the mm community, as sometimes people can't > easily judge when it's proper to jump in. > > IMHO the ideal solution is always keep all discussions public (either > meetings with recordings, or shared online documents, always use on-list > discussions, etc.), then share the links. > > -- > Peter Xu -- Best Regards, Yan, Zi
On 11.08.23 17:03, Peter Xu wrote: > On Thu, Aug 10, 2023 at 11:59:25PM +0200, David Hildenbrand wrote: >> On 10.08.23 23:54, Matthew Wilcox wrote: >>> On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote: >>>>> Yes, that comment from Hugh primarily discusses how we could possibly >>>>> optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the >>>>> iterations would be racy. As far as I can see, there are cases where "it >>>>> would be certainly a bad idea" :) >>>> >>>> Is the race described about mapcount being changed right after it's read? >>>> Are you aware of anything specific that will be broken, and will be fixed >>>> with this patch? >>> >>> The problem is that people check the mapcount while holding no locks; >>> not the PTL, not the page lock. So it's an unfixable race. >>> >>>> Having a total mapcount does sound helpful if partial folio is common >>>> indeed. >>>> >>>> I'm curious whether that'll be so common after the large anon folio work - >>>> isn't it be sad if partial folio will be a norm? It sounds to me that's >>>> the case when small page sizes should be used.. and it's prone to waste? >>> >>> The problem is that entire_mapcount isn't really entire_mapcount. >>> It's pmd_mapcount. I have had thoughts about using it as entire_mapcount, >>> but it gets gnarly when people do partial unmaps. So the _usual_ case >>> ends up touching every struct page. Which sucks. Also it's one of the >>> things which stands in the way of shrinking struct page. >> >> Right, so one current idea is to have a single total_mapcount and look into >> removing the subpage mapcounts (which will require first removing >> _nr_pages_mapped, because that's still one of the important users). >> >> Until we get there, also rmap code has to do eventually "more tracking" and >> might, unfortunately, end up slower. >> >>> >>> But it's kind of annoying to explain all of this to you individually. >>> There have been hundreds of emails about it over the last months on >>> this mailing list. It would be nice if you could catch up instead of >>> jumping in. >> >> To be fair, a lot of the details are not readily available and in the heads >> of selected people :) >> >> Peter, if you're interested, we can discuss the current plans, issues and >> ideas offline! > > Thanks for offering help, David. > > Personally I still am unclear yet on why entire_mapcount cannot be used as > full-folio mapcounts, and why "partial unmap" can happen a lot (I don't > expect), but yeah I can try to catch up to educate myself first. Using fork() is the easiest way. mremap(), MADV_DONTNEED, munmap, ... You might end up having to scan page tables and/or the rmap to figure out which mapcount to adjust, which we should absolutely avoid. > The only issue regarding an offline sync-up is that even if David will help > Peter on catching up the bits, it'll not scale when another Peter2 had the > same question.. So David, rather than I waste your time on helping one > person, let me try to catch up with the public threads - I'm not sure how > far I can go myself; Sure. But note that it's a moving target, and some discussions have been going on for a long time. I recall there were various discussions, including LSF/MM, mm biweekly meeting, and more. So even if you scan through all that, you might either get outdated or incomplete information.
On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote: > On 10.08.23 23:48, Matthew Wilcox wrote: > > On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote: > > > AFAICS if that patch was all correct (while I'm not yet sure..), you can > > > actually fit your new total mapcount field into page 1 so even avoid the > > > extra cacheline access. You can have a look: the trick is refcount for > > > tail page 1 is still seems to be free on 32 bits (if that was your worry > > > before). Then it'll be very nice if to keep Hugh's counter all in tail 1. > > > > No, refcount must be 0 on all tail pages. We rely on this in many places > > in the MM. > > Very right. Obviously I could have missed this in the past.. can I ask for an example explaining why refcount will be referenced before knowing it's a head? > > One could theoretically > > 1) Move the compound/entire mapcount to page[2] > 2) Make hugetlb stop using the entire mapcount and only the total > mapcount. > 3) Then leave total_mapcount and nr_pages_mapped in page[1] > 4) Make page_mapcount() use total_mapcount for hugetlb. > > When (un)mapping a PMD-mapped THP, we would go to page[2]. Otherwise, only > page[1]. > > The entire mapcount, similarly to nr_pages_mapped, primarily serves to get > memory stats right; well, and to implement page_mapcount() for THP. > > But I'm not 100% sure yet if the overhead from having nr_pages_mapped in > page[2] is significant enough at this point. Worth trying. Besides the cachelines, cmpxchg should also lock the memory bus and meanwhile affect the whole processor pipelines in some way. Maybe that effect is amplified too when it runs in a loop with one more counter.
On 11.08.23 17:18, Peter Xu wrote: > On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote: >> On 10.08.23 23:48, Matthew Wilcox wrote: >>> On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote: >>>> AFAICS if that patch was all correct (while I'm not yet sure..), you can >>>> actually fit your new total mapcount field into page 1 so even avoid the >>>> extra cacheline access. You can have a look: the trick is refcount for >>>> tail page 1 is still seems to be free on 32 bits (if that was your worry >>>> before). Then it'll be very nice if to keep Hugh's counter all in tail 1. >>> >>> No, refcount must be 0 on all tail pages. We rely on this in many places >>> in the MM. >> >> Very right. > > Obviously I could have missed this in the past.. can I ask for an example > explaining why refcount will be referenced before knowing it's a head? I think the issue is, when coming from a PFN walker (or GUP-fast), you might see "oh, this is a folio, let's lookup the head page". And you do that. Then, you try taking a reference on that head page. (see try_get_folio()). But as you didn't hold a reference on the folio yet, it can happily get freed + repurposed in the meantime, so maybe it's not a head page anymore. So if the field would get reused for something else, grabbing a reference would corrupt whatever is now stored in there.
On Fri, Aug 11, 2023 at 05:32:37PM +0200, David Hildenbrand wrote: > On 11.08.23 17:18, Peter Xu wrote: > > On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote: > > > On 10.08.23 23:48, Matthew Wilcox wrote: > > > > On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote: > > > > > AFAICS if that patch was all correct (while I'm not yet sure..), you can > > > > > actually fit your new total mapcount field into page 1 so even avoid the > > > > > extra cacheline access. You can have a look: the trick is refcount for > > > > > tail page 1 is still seems to be free on 32 bits (if that was your worry > > > > > before). Then it'll be very nice if to keep Hugh's counter all in tail 1. > > > > > > > > No, refcount must be 0 on all tail pages. We rely on this in many places > > > > in the MM. > > > > > > Very right. > > > > Obviously I could have missed this in the past.. can I ask for an example > > explaining why refcount will be referenced before knowing it's a head? > > I think the issue is, when coming from a PFN walker (or GUP-fast), you might > see "oh, this is a folio, let's lookup the head page". And you do that. > > Then, you try taking a reference on that head page. (see try_get_folio()). > > But as you didn't hold a reference on the folio yet, it can happily get > freed + repurposed in the meantime, so maybe it's not a head page anymore. > > So if the field would get reused for something else, grabbing a reference > would corrupt whatever is now stored in there. Not an issue before large folios, am I right? Because having a head page reused as tail cannot happen iiuc with current thps if only pmd-sized, because the head page is guaranteed to be pmd aligned physically. I don't really know, where a hugetlb 2M head can be reused by a 1G huge later right during the window of fast-gup walking. But obviously that's not common either if that could ever happen. Maybe Matthew was referring to something else (per "in many places")? Thanks,
On 11.08.23 17:58, Peter Xu wrote: > On Fri, Aug 11, 2023 at 05:32:37PM +0200, David Hildenbrand wrote: >> On 11.08.23 17:18, Peter Xu wrote: >>> On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote: >>>> On 10.08.23 23:48, Matthew Wilcox wrote: >>>>> On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote: >>>>>> AFAICS if that patch was all correct (while I'm not yet sure..), you can >>>>>> actually fit your new total mapcount field into page 1 so even avoid the >>>>>> extra cacheline access. You can have a look: the trick is refcount for >>>>>> tail page 1 is still seems to be free on 32 bits (if that was your worry >>>>>> before). Then it'll be very nice if to keep Hugh's counter all in tail 1. >>>>> >>>>> No, refcount must be 0 on all tail pages. We rely on this in many places >>>>> in the MM. >>>> >>>> Very right. >>> >>> Obviously I could have missed this in the past.. can I ask for an example >>> explaining why refcount will be referenced before knowing it's a head? >> >> I think the issue is, when coming from a PFN walker (or GUP-fast), you might >> see "oh, this is a folio, let's lookup the head page". And you do that. >> >> Then, you try taking a reference on that head page. (see try_get_folio()). >> >> But as you didn't hold a reference on the folio yet, it can happily get >> freed + repurposed in the meantime, so maybe it's not a head page anymore. >> >> So if the field would get reused for something else, grabbing a reference >> would corrupt whatever is now stored in there. > > Not an issue before large folios, am I right? Because having a head page > reused as tail cannot happen iiuc with current thps if only pmd-sized, > because the head page is guaranteed to be pmd aligned physically. There are other users of compound pages, no? THP and hugetlb are just two examples I think. For example, I can spot __GFP_COMP in slab code. Must such compound pages would not be applicable to GUP, though, but to PFN walkers could end up trying to grab them. > > I don't really know, where a hugetlb 2M head can be reused by a 1G huge > later right during the window of fast-gup walking. But obviously that's not > common either if that could ever happen. > > Maybe Matthew was referring to something else (per "in many places")? There are some other cases where PFN walkers want to identify tail pages to skip over them. See the comment in has_unmovable_pages().
On 11 Aug 2023, at 12:08, David Hildenbrand wrote: > On 11.08.23 17:58, Peter Xu wrote: >> On Fri, Aug 11, 2023 at 05:32:37PM +0200, David Hildenbrand wrote: >>> On 11.08.23 17:18, Peter Xu wrote: >>>> On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote: >>>>> On 10.08.23 23:48, Matthew Wilcox wrote: >>>>>> On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote: >>>>>>> AFAICS if that patch was all correct (while I'm not yet sure..), you can >>>>>>> actually fit your new total mapcount field into page 1 so even avoid the >>>>>>> extra cacheline access. You can have a look: the trick is refcount for >>>>>>> tail page 1 is still seems to be free on 32 bits (if that was your worry >>>>>>> before). Then it'll be very nice if to keep Hugh's counter all in tail 1. >>>>>> >>>>>> No, refcount must be 0 on all tail pages. We rely on this in many places >>>>>> in the MM. >>>>> >>>>> Very right. >>>> >>>> Obviously I could have missed this in the past.. can I ask for an example >>>> explaining why refcount will be referenced before knowing it's a head? >>> >>> I think the issue is, when coming from a PFN walker (or GUP-fast), you might >>> see "oh, this is a folio, let's lookup the head page". And you do that. >>> >>> Then, you try taking a reference on that head page. (see try_get_folio()). >>> >>> But as you didn't hold a reference on the folio yet, it can happily get >>> freed + repurposed in the meantime, so maybe it's not a head page anymore. >>> >>> So if the field would get reused for something else, grabbing a reference >>> would corrupt whatever is now stored in there. >> >> Not an issue before large folios, am I right? Because having a head page >> reused as tail cannot happen iiuc with current thps if only pmd-sized, >> because the head page is guaranteed to be pmd aligned physically. > > There are other users of compound pages, no? THP and hugetlb are just two examples I think. For example, I can spot __GFP_COMP in slab code. > > Must such compound pages would not be applicable to GUP, though, but to PFN walkers could end up trying to grab them. > For FS supporting large folios, their page cache pages can be any order <= PMD_ORDER. See page_cache_ra_order() in mm/readahead.c >> >> I don't really know, where a hugetlb 2M head can be reused by a 1G huge >> later right during the window of fast-gup walking. But obviously that's not >> common either if that could ever happen. >> >> Maybe Matthew was referring to something else (per "in many places")? > > There are some other cases where PFN walkers want to identify tail pages to skip over them. See the comment in has_unmovable_pages(). > > -- > Cheers, > > David / dhildenb -- Best Regards, Yan, Zi
On Fri, Aug 11, 2023 at 12:11:55PM -0400, Zi Yan wrote: > On 11 Aug 2023, at 12:08, David Hildenbrand wrote: > > > On 11.08.23 17:58, Peter Xu wrote: > >> On Fri, Aug 11, 2023 at 05:32:37PM +0200, David Hildenbrand wrote: > >>> On 11.08.23 17:18, Peter Xu wrote: > >>>> On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote: > >>>>> On 10.08.23 23:48, Matthew Wilcox wrote: > >>>>>> On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote: > >>>>>>> AFAICS if that patch was all correct (while I'm not yet sure..), you can > >>>>>>> actually fit your new total mapcount field into page 1 so even avoid the > >>>>>>> extra cacheline access. You can have a look: the trick is refcount for > >>>>>>> tail page 1 is still seems to be free on 32 bits (if that was your worry > >>>>>>> before). Then it'll be very nice if to keep Hugh's counter all in tail 1. > >>>>>> > >>>>>> No, refcount must be 0 on all tail pages. We rely on this in many places > >>>>>> in the MM. > >>>>> > >>>>> Very right. > >>>> > >>>> Obviously I could have missed this in the past.. can I ask for an example > >>>> explaining why refcount will be referenced before knowing it's a head? > >>> > >>> I think the issue is, when coming from a PFN walker (or GUP-fast), you might > >>> see "oh, this is a folio, let's lookup the head page". And you do that. > >>> > >>> Then, you try taking a reference on that head page. (see try_get_folio()). > >>> > >>> But as you didn't hold a reference on the folio yet, it can happily get > >>> freed + repurposed in the meantime, so maybe it's not a head page anymore. > >>> > >>> So if the field would get reused for something else, grabbing a reference > >>> would corrupt whatever is now stored in there. > >> > >> Not an issue before large folios, am I right? Because having a head page > >> reused as tail cannot happen iiuc with current thps if only pmd-sized, > >> because the head page is guaranteed to be pmd aligned physically. > > > > There are other users of compound pages, no? THP and hugetlb are just two examples I think. For example, I can spot __GFP_COMP in slab code. > > > > Must such compound pages would not be applicable to GUP, though, but to PFN walkers could end up trying to grab them. > > > For FS supporting large folios, their page cache pages can be any order <= PMD_ORDER. > See page_cache_ra_order() in mm/readahead.c Ah yes.. > > >> > >> I don't really know, where a hugetlb 2M head can be reused by a 1G huge > >> later right during the window of fast-gup walking. But obviously that's not > >> common either if that could ever happen. > >> > >> Maybe Matthew was referring to something else (per "in many places")? > > > > There are some other cases where PFN walkers want to identify tail pages to skip over them. See the comment in has_unmovable_pages(). Indeed. Thanks!
diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index 9a607059ea11..b0d3b1d3e8ea 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -116,14 +116,14 @@ pages: succeeds on tail pages. - map/unmap of a PMD entry for the whole THP increment/decrement - folio->_entire_mapcount and also increment/decrement - folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount - goes from -1 to 0 or 0 to -1. + folio->_entire_mapcount, increment/decrement folio->_total_mapcount + and also increment/decrement folio->_nr_pages_mapped by COMPOUND_MAPPED + when _entire_mapcount goes from -1 to 0 or 0 to -1. - map/unmap of individual pages with PTE entry increment/decrement - page->_mapcount and also increment/decrement folio->_nr_pages_mapped - when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts - the number of pages mapped by PTE. + page->_mapcount, increment/decrement folio->_total_mapcount and also + increment/decrement folio->_nr_pages_mapped when page->_mapcount goes + from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE. split_huge_page internally has to distribute the refcounts in the head page to the tail pages before clearing all PG_head/tail bits from the page diff --git a/include/linux/mm.h b/include/linux/mm.h index 6a95dfed4957..30ac004fa0ef 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1187,41 +1187,24 @@ static inline int page_mapcount(struct page *page) return mapcount; } -int folio_total_mapcount(struct folio *folio); - /** - * folio_mapcount() - Calculate the number of mappings of this folio. + * folio_mapcount() - Return the total number of mappings of this folio. * @folio: The folio. * - * A large folio tracks both how many times the entire folio is mapped, - * and how many times each individual page in the folio is mapped. - * This function calculates the total number of times the folio is - * mapped. - * - * Return: The number of times this folio is mapped. + * Return: The total number of times this folio is mapped. */ static inline int folio_mapcount(struct folio *folio) { if (likely(!folio_test_large(folio))) return atomic_read(&folio->_mapcount) + 1; - return folio_total_mapcount(folio); + return atomic_read(&folio->_total_mapcount) + 1; } static inline int total_mapcount(struct page *page) { if (likely(!PageCompound(page))) return atomic_read(&page->_mapcount) + 1; - return folio_total_mapcount(page_folio(page)); -} - -static inline bool folio_large_is_mapped(struct folio *folio) -{ - /* - * Reading _entire_mapcount below could be omitted if hugetlb - * participated in incrementing nr_pages_mapped when compound mapped. - */ - return atomic_read(&folio->_nr_pages_mapped) > 0 || - atomic_read(&folio->_entire_mapcount) >= 0; + return atomic_read(&page_folio(page)->_total_mapcount) + 1; } /** @@ -1234,7 +1217,7 @@ static inline bool folio_mapped(struct folio *folio) { if (likely(!folio_test_large(folio))) return atomic_read(&folio->_mapcount) >= 0; - return folio_large_is_mapped(folio); + return atomic_read(&folio->_total_mapcount) >= 0; } /* @@ -1246,7 +1229,7 @@ static inline bool page_mapped(struct page *page) { if (likely(!PageCompound(page))) return atomic_read(&page->_mapcount) >= 0; - return folio_large_is_mapped(page_folio(page)); + return atomic_read(&page_folio(page)->_total_mapcount) >= 0; } static inline struct page *virt_to_head_page(const void *x) @@ -2148,7 +2131,7 @@ static inline size_t folio_size(struct folio *folio) * looking at the precise mapcount of the first subpage in the folio, and * assuming the other subpages are the same. This may not be true for large * folios. If you want exact mapcounts for exact calculations, look at - * page_mapcount() or folio_total_mapcount(). + * page_mapcount() or folio_mapcount(). * * Return: The estimated number of processes sharing a folio. */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 291c05cacd48..16e66b3332c6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -267,7 +267,8 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page) * @_folio_dtor: Which destructor to use for this folio. * @_folio_order: Do not use directly, call folio_order(). * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). - * @_nr_pages_mapped: Do not use directly, call folio_mapcount(). + * @_total_mapcount: Do not use directly, call folio_mapcount(). + * @_nr_pages_mapped: Do not use outside of rmap code (and not for hugetlb). * @_pincount: Do not use directly, call folio_maybe_dma_pinned(). * @_folio_nr_pages: Do not use directly, call folio_nr_pages(). * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h. @@ -321,7 +322,7 @@ struct folio { unsigned char _folio_dtor; unsigned char _folio_order; atomic_t _entire_mapcount; - atomic_t _nr_pages_mapped; + atomic_t _total_mapcount; atomic_t _pincount; #ifdef CONFIG_64BIT unsigned int _folio_nr_pages; @@ -346,6 +347,7 @@ struct folio { unsigned long _head_2a; /* public: */ struct list_head _deferred_list; + atomic_t _nr_pages_mapped; /* private: the union with struct page is transitional */ }; struct page __page_2; diff --git a/include/linux/rmap.h b/include/linux/rmap.h index a3825ce81102..a7b1c005e0c9 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -210,14 +210,17 @@ void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *, static inline void __page_dup_rmap(struct page *page, bool compound) { - if (compound) { - struct folio *folio = (struct folio *)page; + struct folio *folio = page_folio(page); + if (compound) { VM_BUG_ON_PAGE(compound && !PageHead(page), page); atomic_inc(&folio->_entire_mapcount); } else { atomic_inc(&page->_mapcount); } + + if (folio_test_large(folio)) + atomic_inc(&folio->_total_mapcount); } static inline void page_dup_file_rmap(struct page *page, bool compound) diff --git a/mm/debug.c b/mm/debug.c index ee533a5ceb79..97f6f6b32ae7 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -99,10 +99,10 @@ static void __dump_page(struct page *page) page, page_ref_count(head), mapcount, mapping, page_to_pgoff(page), page_to_pfn(page)); if (compound) { - pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n", + pr_warn("head:%p order:%u entire_mapcount:%d total_mapcount:%d pincount:%d\n", head, compound_order(head), folio_entire_mapcount(folio), - folio_nr_pages_mapped(folio), + folio_mapcount(folio), atomic_read(&folio->_pincount)); } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 154c210892a1..43a2150e41e3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -583,6 +583,7 @@ void prep_transhuge_page(struct page *page) VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); INIT_LIST_HEAD(&folio->_deferred_list); + atomic_set(&folio->_nr_pages_mapped, 0); folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR); } @@ -2795,6 +2796,7 @@ void free_transhuge_page(struct page *page) } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); } + VM_WARN_ON_ONCE(atomic_read(&folio->_nr_pages_mapped)); free_compound_page(page); } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5f498e8025cc..6a614c559ccf 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, struct page *p; atomic_set(&folio->_entire_mapcount, 0); - atomic_set(&folio->_nr_pages_mapped, 0); + atomic_set(&folio->_total_mapcount, 0); atomic_set(&folio->_pincount, 0); for (i = 1; i < nr_pages; i++) { @@ -2027,7 +2027,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, /* we rely on prep_new_hugetlb_folio to set the destructor */ folio_set_order(folio, order); atomic_set(&folio->_entire_mapcount, -1); - atomic_set(&folio->_nr_pages_mapped, 0); + atomic_set(&folio->_total_mapcount, -1); atomic_set(&folio->_pincount, 0); return true; diff --git a/mm/internal.h b/mm/internal.h index 02a6fd36f46e..e4646fed44a5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -53,10 +53,10 @@ struct folio_batch; void page_writeback_init(void); /* - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages, + * If a 16GB folio were mapped by PTEs of all of its 4kB pages, * its nr_pages_mapped would be 0x400000: choose the COMPOUND_MAPPED bit - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Hugetlb currently - * leaves nr_pages_mapped at 0, but avoid surprise if it participates later. + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Not applicable to + * hugetlb. */ #define COMPOUND_MAPPED 0x800000 #define FOLIO_PAGES_MAPPED (COMPOUND_MAPPED - 1) @@ -67,15 +67,6 @@ void page_writeback_init(void); */ #define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ -/* - * How many individual pages have an elevated _mapcount. Excludes - * the folio's entire_mapcount. - */ -static inline int folio_nr_pages_mapped(struct folio *folio) -{ - return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED; -} - static inline void *folio_raw_mapping(struct folio *folio) { unsigned long mapping = (unsigned long)folio->mapping; @@ -420,7 +411,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order) folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR); folio_set_order(folio, order); atomic_set(&folio->_entire_mapcount, -1); - atomic_set(&folio->_nr_pages_mapped, 0); + atomic_set(&folio->_total_mapcount, -1); atomic_set(&folio->_pincount, 0); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 96b7c1a7d1f2..5bbd5782b543 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1009,8 +1009,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) bad_page(page, "nonzero entire_mapcount"); goto out; } - if (unlikely(atomic_read(&folio->_nr_pages_mapped))) { - bad_page(page, "nonzero nr_pages_mapped"); + if (unlikely(atomic_read(&folio->_total_mapcount) + 1)) { + bad_page(page, "nonzero total_mapcount"); goto out; } if (unlikely(atomic_read(&folio->_pincount))) { diff --git a/mm/rmap.c b/mm/rmap.c index 1f04debdc87a..a7dcae48245b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1070,29 +1070,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, return page_vma_mkclean_one(&pvmw); } -int folio_total_mapcount(struct folio *folio) -{ - int mapcount = folio_entire_mapcount(folio); - int nr_pages; - int i; - - /* In the common case, avoid the loop when no pages mapped by PTE */ - if (folio_nr_pages_mapped(folio) == 0) - return mapcount; - /* - * Add all the PTE mappings of those pages mapped by PTE. - * Limit the loop to folio_nr_pages_mapped()? - * Perhaps: given all the raciness, that may be a good or a bad idea. - */ - nr_pages = folio_nr_pages(folio); - for (i = 0; i < nr_pages; i++) - mapcount += atomic_read(&folio_page(folio, i)->_mapcount); - - /* But each of those _mapcounts was based on -1 */ - mapcount += nr_pages; - return mapcount; -} - /** * page_move_anon_rmap - move a page to our anon_vma * @page: the page to move to our anon_vma @@ -1236,6 +1213,9 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, } } + if (folio_test_large(folio)) + atomic_inc(&folio->_total_mapcount); + VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); @@ -1289,6 +1269,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); } + if (folio_test_large(folio)) + /* increment count (starts at -1) */ + atomic_set(&folio->_total_mapcount, 0); + __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); __page_set_anon_rmap(folio, &folio->page, vma, address, 1); } @@ -1310,14 +1294,14 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, bool compound) { atomic_t *mapped = &folio->_nr_pages_mapped; - unsigned int nr_pmdmapped = 0, first; + unsigned int nr_pmdmapped = 0, i, first; int nr = 0; VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio); /* Is page being mapped by PTE? Is this its first map to be added? */ if (likely(!compound)) { - do { + for (i = 0; i < nr_pages; i++, page++) { first = atomic_inc_and_test(&page->_mapcount); if (first && folio_test_large(folio)) { first = atomic_inc_return_relaxed(mapped); @@ -1326,7 +1310,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, if (first) nr++; - } while (page++, --nr_pages > 0); + } } else if (folio_test_pmd_mappable(folio)) { /* That test is redundant: it's for safety or to optimize out */ @@ -1346,6 +1330,9 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page, } } + if (folio_test_large(folio)) + atomic_add(compound ? 1 : nr_pages, &folio->_total_mapcount); + if (nr_pmdmapped) __lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ? NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped); @@ -1398,6 +1385,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, VM_BUG_ON_PAGE(compound && !PageHead(page), page); + if (folio_test_large(folio)) + atomic_dec(&folio->_total_mapcount); + /* Hugetlb pages are not counted in NR_*MAPPED */ if (unlikely(folio_test_hugetlb(folio))) { /* hugetlb pages are always mapped with pmds */ @@ -2538,6 +2528,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma, BUG_ON(!anon_vma); /* address might be in next vma when migration races vma_merge */ first = atomic_inc_and_test(&folio->_entire_mapcount); + atomic_inc(&folio->_total_mapcount); VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); if (first) @@ -2551,6 +2542,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio, BUG_ON(address < vma->vm_start || address >= vma->vm_end); /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); + atomic_set(&folio->_total_mapcount, 0); folio_clear_hugetlb_restore_reserve(folio); __page_set_anon_rmap(folio, &folio->page, vma, address, 1); }
Let's track the total mapcount for all large folios in the first subpage. The total mapcount is what we actually want to know in folio_mapcount() and it is also sufficient for implementing folio_mapped(). This also gets rid of any "raceiness" concerns as expressed in folio_total_mapcount(). With sub-PMD THP becoming more important and things looking promising that we will soon get support for such anon THP, we want to avoid looping over all pages of a folio just to calculate the total mapcount. Further, we may soon want to use the total mapcount in other context more frequently, so prepare for reading it efficiently and atomically. Make room for the total mapcount in page[1] of the folio by moving _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb -- and with the total mapcount in place probably also not desirable even if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we can overlay the hugetlb fields. Note that we currently don't expect any order-1 compound pages / THP in rmap code, and that such support is not planned. If ever desired, we could move the compound mapcount to another page, because it only applies to PMD-mappable folios that are definitely larger. Let's avoid consuming more space elsewhere for now -- we might need more space for a different purpose in some subpages soon. Maintain the total mapcount also for hugetlb pages. Use the total mapcount to implement folio_mapcount(), total_mapcount(), folio_mapped() and page_mapped(). We can now get rid of folio_total_mapcount() and folio_large_is_mapped(), by just inlining reading of the total mapcount. _nr_pages_mapped is now only used in rmap code, so not accidentially externally where it might be used on arbitrary order-1 pages. The remaining usage is: (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED -> If we would account the total folio as mapped when mapping a page (based on the total mapcount), we could remove that usage. (2) Detect when to add a folio to the deferred split queue -> If we would apply a different heuristic, or scan using the rmap on the memory reclaim path for partially mapped anon folios to split them, we could remove that usage as well. So maybe, we can simplify things in the future and remove _nr_pages_mapped. For now, leave these things as they are, they need more thought. Hugh really did a nice job implementing that precise tracking after all. Note: Not adding order-1 sanity checks to the file_rmap functions for now. For anon pages, they are certainly not required right now. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- Documentation/mm/transhuge.rst | 12 +++++----- include/linux/mm.h | 31 ++++++------------------ include/linux/mm_types.h | 6 +++-- include/linux/rmap.h | 7 ++++-- mm/debug.c | 4 ++-- mm/huge_memory.c | 2 ++ mm/hugetlb.c | 4 ++-- mm/internal.h | 17 ++++--------- mm/page_alloc.c | 4 ++-- mm/rmap.c | 44 ++++++++++++++-------------------- 10 files changed, 52 insertions(+), 79 deletions(-)