[mm-unstable,v1] mm: add a total mapcount for large folios

Message ID	20230809083256.699513-1-david@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Jonathan Corbet <corbet@lwn.net>, Mike Kravetz <mike.kravetz@oracle.com>, Hugh Dickins <hughd@google.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Ryan Roberts <ryan.roberts@arm.com>, Yin Fengwei <fengwei.yin@intel.com>, Yang Shi <shy828301@gmail.com>, Zi Yan <ziy@nvidia.com> Subject: [PATCH mm-unstable v1] mm: add a total mapcount for large folios Date: Wed, 9 Aug 2023 10:32:56 +0200 Message-ID: <20230809083256.699513-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[mm-unstable,v1] mm: add a total mapcount for large folios \| expand [mm-unstable,v1] mm: add a total mapcount for large folios

David Hildenbrand Aug. 9, 2023, 8:32 a.m. UTC

Let's track the total mapcount for all large folios in the first subpage.

The total mapcount is what we actually want to know in folio_mapcount()
and it is also sufficient for implementing folio_mapped(). This also
gets rid of any "raceiness" concerns as expressed in
folio_total_mapcount().

With sub-PMD THP becoming more important and things looking promising
that we will soon get support for such anon THP, we want to avoid looping
over all pages of a folio just to calculate the total mapcount. Further,
we may soon want to use the total mapcount in other context more
frequently, so prepare for reading it efficiently and atomically.

Make room for the total mapcount in page[1] of the folio by moving
_nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
-- and with the total mapcount in place probably also not desirable even
if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
can overlay the hugetlb fields.

Note that we currently don't expect any order-1 compound pages / THP in
rmap code, and that such support is not planned. If ever desired, we could
move the compound mapcount to another page, because it only applies to
PMD-mappable folios that are definitely larger. Let's avoid consuming
more space elsewhere for now -- we might need more space for a different
purpose in some subpages soon.

Maintain the total mapcount also for hugetlb pages. Use the total mapcount
to implement folio_mapcount(), total_mapcount(), folio_mapped() and
page_mapped().

We can now get rid of folio_total_mapcount() and
folio_large_is_mapped(), by just inlining reading of the total mapcount.

_nr_pages_mapped is now only used in rmap code, so not accidentially
externally where it might be used on arbitrary order-1 pages. The remaining
usage is:

(1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
 -> If we would account the total folio as mapped when mapping a
    page (based on the total mapcount), we could remove that usage.

(2) Detect when to add a folio to the deferred split queue
 -> If we would apply a different heuristic, or scan using the rmap on
    the memory reclaim path for partially mapped anon folios to
    split them, we could remove that usage as well.

So maybe, we can simplify things in the future and remove
_nr_pages_mapped. For now, leave these things as they are, they need more
thought. Hugh really did a nice job implementing that precise tracking
after all.

Note: Not adding order-1 sanity checks to the file_rmap functions for
      now. For anon pages, they are certainly not required right now.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/mm/transhuge.rst | 12 +++++-----
 include/linux/mm.h             | 31 ++++++------------------
 include/linux/mm_types.h       |  6 +++--
 include/linux/rmap.h           |  7 ++++--
 mm/debug.c                     |  4 ++--
 mm/huge_memory.c               |  2 ++
 mm/hugetlb.c                   |  4 ++--
 mm/internal.h                  | 17 ++++---------
 mm/page_alloc.c                |  4 ++--
 mm/rmap.c                      | 44 ++++++++++++++--------------------
 10 files changed, 52 insertions(+), 79 deletions(-)

Zi Yan Aug. 9, 2023, 3:45 p.m. UTC | #1

On 9 Aug 2023, at 4:32, David Hildenbrand wrote:

> Let's track the total mapcount for all large folios in the first subpage.
>
> The total mapcount is what we actually want to know in folio_mapcount()
> and it is also sufficient for implementing folio_mapped(). This also
> gets rid of any "raceiness" concerns as expressed in
> folio_total_mapcount().
>
> With sub-PMD THP becoming more important and things looking promising
> that we will soon get support for such anon THP, we want to avoid looping
> over all pages of a folio just to calculate the total mapcount. Further,
> we may soon want to use the total mapcount in other context more
> frequently, so prepare for reading it efficiently and atomically.
>
> Make room for the total mapcount in page[1] of the folio by moving
> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
> -- and with the total mapcount in place probably also not desirable even
> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
> can overlay the hugetlb fields.
>
> Note that we currently don't expect any order-1 compound pages / THP in
> rmap code, and that such support is not planned. If ever desired, we could
> move the compound mapcount to another page, because it only applies to
> PMD-mappable folios that are definitely larger. Let's avoid consuming
> more space elsewhere for now -- we might need more space for a different
> purpose in some subpages soon.
>
> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
> page_mapped().
>
> We can now get rid of folio_total_mapcount() and
> folio_large_is_mapped(), by just inlining reading of the total mapcount.
>
> _nr_pages_mapped is now only used in rmap code, so not accidentially
> externally where it might be used on arbitrary order-1 pages. The remaining
> usage is:
>
> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>  -> If we would account the total folio as mapped when mapping a
>     page (based on the total mapcount), we could remove that usage.
>
> (2) Detect when to add a folio to the deferred split queue
>  -> If we would apply a different heuristic, or scan using the rmap on
>     the memory reclaim path for partially mapped anon folios to
>     split them, we could remove that usage as well.
>
> So maybe, we can simplify things in the future and remove
> _nr_pages_mapped. For now, leave these things as they are, they need more
> thought. Hugh really did a nice job implementing that precise tracking
> after all.
>
> Note: Not adding order-1 sanity checks to the file_rmap functions for
>       now. For anon pages, they are certainly not required right now.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Yin Fengwei <fengwei.yin@intel.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  Documentation/mm/transhuge.rst | 12 +++++-----
>  include/linux/mm.h             | 31 ++++++------------------
>  include/linux/mm_types.h       |  6 +++--
>  include/linux/rmap.h           |  7 ++++--
>  mm/debug.c                     |  4 ++--
>  mm/huge_memory.c               |  2 ++
>  mm/hugetlb.c                   |  4 ++--
>  mm/internal.h                  | 17 ++++---------
>  mm/page_alloc.c                |  4 ++--
>  mm/rmap.c                      | 44 ++++++++++++++--------------------
>  10 files changed, 52 insertions(+), 79 deletions(-)
>
LGTM. Thanks. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

Ryan Roberts Aug. 9, 2023, 7:07 p.m. UTC | #2

On 09/08/2023 09:32, David Hildenbrand wrote:
> Let's track the total mapcount for all large folios in the first subpage.
> 
> The total mapcount is what we actually want to know in folio_mapcount()
> and it is also sufficient for implementing folio_mapped(). This also
> gets rid of any "raceiness" concerns as expressed in
> folio_total_mapcount().
> 
> With sub-PMD THP becoming more important and things looking promising
> that we will soon get support for such anon THP, we want to avoid looping
> over all pages of a folio just to calculate the total mapcount. Further,
> we may soon want to use the total mapcount in other context more
> frequently, so prepare for reading it efficiently and atomically.
> 
> Make room for the total mapcount in page[1] of the folio by moving
> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
> -- and with the total mapcount in place probably also not desirable even
> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
> can overlay the hugetlb fields.
> 
> Note that we currently don't expect any order-1 compound pages / THP in
> rmap code, and that such support is not planned. If ever desired, we could
> move the compound mapcount to another page, because it only applies to
> PMD-mappable folios that are definitely larger. Let's avoid consuming
> more space elsewhere for now -- we might need more space for a different
> purpose in some subpages soon.
> 
> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
> page_mapped().
> 
> We can now get rid of folio_total_mapcount() and
> folio_large_is_mapped(), by just inlining reading of the total mapcount.
> 
> _nr_pages_mapped is now only used in rmap code, so not accidentially
> externally where it might be used on arbitrary order-1 pages. The remaining
> usage is:
> 
> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>  -> If we would account the total folio as mapped when mapping a
>     page (based on the total mapcount), we could remove that usage.
> 
> (2) Detect when to add a folio to the deferred split queue
>  -> If we would apply a different heuristic, or scan using the rmap on
>     the memory reclaim path for partially mapped anon folios to
>     split them, we could remove that usage as well.
> 
> So maybe, we can simplify things in the future and remove
> _nr_pages_mapped. For now, leave these things as they are, they need more
> thought. Hugh really did a nice job implementing that precise tracking
> after all.
> 
> Note: Not adding order-1 sanity checks to the file_rmap functions for
>       now. For anon pages, they are certainly not required right now.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Yin Fengwei <fengwei.yin@intel.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing
is correct:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>


> ---
>  Documentation/mm/transhuge.rst | 12 +++++-----
>  include/linux/mm.h             | 31 ++++++------------------
>  include/linux/mm_types.h       |  6 +++--
>  include/linux/rmap.h           |  7 ++++--
>  mm/debug.c                     |  4 ++--
>  mm/huge_memory.c               |  2 ++
>  mm/hugetlb.c                   |  4 ++--
>  mm/internal.h                  | 17 ++++---------
>  mm/page_alloc.c                |  4 ++--
>  mm/rmap.c                      | 44 ++++++++++++++--------------------
>  10 files changed, 52 insertions(+), 79 deletions(-)
> 
> diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
> index 9a607059ea11..b0d3b1d3e8ea 100644
> --- a/Documentation/mm/transhuge.rst
> +++ b/Documentation/mm/transhuge.rst
> @@ -116,14 +116,14 @@ pages:
>      succeeds on tail pages.
>  
>    - map/unmap of a PMD entry for the whole THP increment/decrement
> -    folio->_entire_mapcount and also increment/decrement
> -    folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount
> -    goes from -1 to 0 or 0 to -1.
> +    folio->_entire_mapcount, increment/decrement folio->_total_mapcount
> +    and also increment/decrement folio->_nr_pages_mapped by COMPOUND_MAPPED
> +    when _entire_mapcount goes from -1 to 0 or 0 to -1.
>  
>    - map/unmap of individual pages with PTE entry increment/decrement
> -    page->_mapcount and also increment/decrement folio->_nr_pages_mapped
> -    when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts
> -    the number of pages mapped by PTE.
> +    page->_mapcount, increment/decrement folio->_total_mapcount and also
> +    increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
> +    from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
>  
>  split_huge_page internally has to distribute the refcounts in the head
>  page to the tail pages before clearing all PG_head/tail bits from the page
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6a95dfed4957..30ac004fa0ef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1187,41 +1187,24 @@ static inline int page_mapcount(struct page *page)
>  	return mapcount;
>  }
>  
> -int folio_total_mapcount(struct folio *folio);
> -
>  /**
> - * folio_mapcount() - Calculate the number of mappings of this folio.
> + * folio_mapcount() - Return the total number of mappings of this folio.
>   * @folio: The folio.
>   *
> - * A large folio tracks both how many times the entire folio is mapped,
> - * and how many times each individual page in the folio is mapped.
> - * This function calculates the total number of times the folio is
> - * mapped.
> - *
> - * Return: The number of times this folio is mapped.
> + * Return: The total number of times this folio is mapped.
>   */
>  static inline int folio_mapcount(struct folio *folio)
>  {
>  	if (likely(!folio_test_large(folio)))
>  		return atomic_read(&folio->_mapcount) + 1;
> -	return folio_total_mapcount(folio);
> +	return atomic_read(&folio->_total_mapcount) + 1;
>  }
>  
>  static inline int total_mapcount(struct page *page)

nit: couldn't total_mapcount() just be implemented as a wrapper around
folio_mapcount()? What's the benefit of PageCompound() check instead of just
getting the folio and checking if it's large? i.e:

static inline int total_mapcount(struct page *page)
{
	return folio_mapcount(page_folio(page));
}

>  {
>  	if (likely(!PageCompound(page)))
>  		return atomic_read(&page->_mapcount) + 1;
> -	return folio_total_mapcount(page_folio(page));
> -}
> -
> -static inline bool folio_large_is_mapped(struct folio *folio)
> -{
> -	/*
> -	 * Reading _entire_mapcount below could be omitted if hugetlb
> -	 * participated in incrementing nr_pages_mapped when compound mapped.
> -	 */
> -	return atomic_read(&folio->_nr_pages_mapped) > 0 ||
> -		atomic_read(&folio->_entire_mapcount) >= 0;
> +	return atomic_read(&page_folio(page)->_total_mapcount) + 1;
>  }
>  
>  /**
> @@ -1234,7 +1217,7 @@ static inline bool folio_mapped(struct folio *folio)
>  {
>  	if (likely(!folio_test_large(folio)))
>  		return atomic_read(&folio->_mapcount) >= 0;
> -	return folio_large_is_mapped(folio);
> +	return atomic_read(&folio->_total_mapcount) >= 0;
>  }
>  
>  /*
> @@ -1246,7 +1229,7 @@ static inline bool page_mapped(struct page *page)
>  {
>  	if (likely(!PageCompound(page)))
>  		return atomic_read(&page->_mapcount) >= 0;
> -	return folio_large_is_mapped(page_folio(page));
> +	return atomic_read(&page_folio(page)->_total_mapcount) >= 0;

nit: same here; couldn't this be a wrapper around folio_mapped()?

>  }
>  
>  static inline struct page *virt_to_head_page(const void *x)
> @@ -2148,7 +2131,7 @@ static inline size_t folio_size(struct folio *folio)
>   * looking at the precise mapcount of the first subpage in the folio, and
>   * assuming the other subpages are the same. This may not be true for large
>   * folios. If you want exact mapcounts for exact calculations, look at
> - * page_mapcount() or folio_total_mapcount().
> + * page_mapcount() or folio_mapcount().
>   *
>   * Return: The estimated number of processes sharing a folio.
>   */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 291c05cacd48..16e66b3332c6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -267,7 +267,8 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
>   * @_folio_dtor: Which destructor to use for this folio.
>   * @_folio_order: Do not use directly, call folio_order().
>   * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
> - * @_nr_pages_mapped: Do not use directly, call folio_mapcount().
> + * @_total_mapcount: Do not use directly, call folio_mapcount().
> + * @_nr_pages_mapped: Do not use outside of rmap code (and not for hugetlb).
>   * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
>   * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
>   * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
> @@ -321,7 +322,7 @@ struct folio {
>  			unsigned char _folio_dtor;
>  			unsigned char _folio_order;
>  			atomic_t _entire_mapcount;
> -			atomic_t _nr_pages_mapped;
> +			atomic_t _total_mapcount;
>  			atomic_t _pincount;
>  #ifdef CONFIG_64BIT
>  			unsigned int _folio_nr_pages;
> @@ -346,6 +347,7 @@ struct folio {
>  			unsigned long _head_2a;
>  	/* public: */
>  			struct list_head _deferred_list;
> +			atomic_t _nr_pages_mapped;
>  	/* private: the union with struct page is transitional */
>  		};
>  		struct page __page_2;
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index a3825ce81102..a7b1c005e0c9 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -210,14 +210,17 @@ void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>  
>  static inline void __page_dup_rmap(struct page *page, bool compound)
>  {
> -	if (compound) {
> -		struct folio *folio = (struct folio *)page;
> +	struct folio *folio = page_folio(page);
>  
> +	if (compound) {
>  		VM_BUG_ON_PAGE(compound && !PageHead(page), page);
>  		atomic_inc(&folio->_entire_mapcount);
>  	} else {
>  		atomic_inc(&page->_mapcount);
>  	}
> +
> +	if (folio_test_large(folio))
> +		atomic_inc(&folio->_total_mapcount);
>  }
>  
>  static inline void page_dup_file_rmap(struct page *page, bool compound)
> diff --git a/mm/debug.c b/mm/debug.c
> index ee533a5ceb79..97f6f6b32ae7 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -99,10 +99,10 @@ static void __dump_page(struct page *page)
>  			page, page_ref_count(head), mapcount, mapping,
>  			page_to_pgoff(page), page_to_pfn(page));
>  	if (compound) {
> -		pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
> +		pr_warn("head:%p order:%u entire_mapcount:%d total_mapcount:%d pincount:%d\n",
>  				head, compound_order(head),
>  				folio_entire_mapcount(folio),
> -				folio_nr_pages_mapped(folio),
> +				folio_mapcount(folio),
>  				atomic_read(&folio->_pincount));
>  	}
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 154c210892a1..43a2150e41e3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -583,6 +583,7 @@ void prep_transhuge_page(struct page *page)
>  
>  	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
>  	INIT_LIST_HEAD(&folio->_deferred_list);
> +	atomic_set(&folio->_nr_pages_mapped, 0);
>  	folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR);
>  }
>  
> @@ -2795,6 +2796,7 @@ void free_transhuge_page(struct page *page)
>  		}
>  		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  	}
> +	VM_WARN_ON_ONCE(atomic_read(&folio->_nr_pages_mapped));
>  	free_compound_page(page);
>  }
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5f498e8025cc..6a614c559ccf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
>  	struct page *p;
>  
>  	atomic_set(&folio->_entire_mapcount, 0);
> -	atomic_set(&folio->_nr_pages_mapped, 0);
> +	atomic_set(&folio->_total_mapcount, 0);

Just checking this is definitely what you intended? _total_mapcount is -1 when
it means "no pages mapped", so 0 means 1 page mapped?


>  	atomic_set(&folio->_pincount, 0);
>  
>  	for (i = 1; i < nr_pages; i++) {
> @@ -2027,7 +2027,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
>  	/* we rely on prep_new_hugetlb_folio to set the destructor */
>  	folio_set_order(folio, order);
>  	atomic_set(&folio->_entire_mapcount, -1);
> -	atomic_set(&folio->_nr_pages_mapped, 0);
> +	atomic_set(&folio->_total_mapcount, -1);
>  	atomic_set(&folio->_pincount, 0);
>  	return true;
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 02a6fd36f46e..e4646fed44a5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -53,10 +53,10 @@ struct folio_batch;
>  void page_writeback_init(void);
>  
>  /*
> - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
> + * If a 16GB folio were mapped by PTEs of all of its 4kB pages,
>   * its nr_pages_mapped would be 0x400000: choose the COMPOUND_MAPPED bit
> - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
> - * leaves nr_pages_mapped at 0, but avoid surprise if it participates later.
> + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Not applicable to
> + * hugetlb.
>   */
>  #define COMPOUND_MAPPED		0x800000
>  #define FOLIO_PAGES_MAPPED	(COMPOUND_MAPPED - 1)
> @@ -67,15 +67,6 @@ void page_writeback_init(void);
>   */
>  #define SHOW_MEM_FILTER_NODES		(0x0001u)	/* disallowed nodes */
>  
> -/*
> - * How many individual pages have an elevated _mapcount.  Excludes
> - * the folio's entire_mapcount.
> - */
> -static inline int folio_nr_pages_mapped(struct folio *folio)
> -{
> -	return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
> -}
> -
>  static inline void *folio_raw_mapping(struct folio *folio)
>  {
>  	unsigned long mapping = (unsigned long)folio->mapping;
> @@ -420,7 +411,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>  	folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
>  	folio_set_order(folio, order);
>  	atomic_set(&folio->_entire_mapcount, -1);
> -	atomic_set(&folio->_nr_pages_mapped, 0);
> +	atomic_set(&folio->_total_mapcount, -1);
>  	atomic_set(&folio->_pincount, 0);
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 96b7c1a7d1f2..5bbd5782b543 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1009,8 +1009,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>  			bad_page(page, "nonzero entire_mapcount");
>  			goto out;
>  		}
> -		if (unlikely(atomic_read(&folio->_nr_pages_mapped))) {
> -			bad_page(page, "nonzero nr_pages_mapped");
> +		if (unlikely(atomic_read(&folio->_total_mapcount) + 1)) {
> +			bad_page(page, "nonzero total_mapcount");
>  			goto out;
>  		}
>  		if (unlikely(atomic_read(&folio->_pincount))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1f04debdc87a..a7dcae48245b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1070,29 +1070,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
>  	return page_vma_mkclean_one(&pvmw);
>  }
>  
> -int folio_total_mapcount(struct folio *folio)
> -{
> -	int mapcount = folio_entire_mapcount(folio);
> -	int nr_pages;
> -	int i;
> -
> -	/* In the common case, avoid the loop when no pages mapped by PTE */
> -	if (folio_nr_pages_mapped(folio) == 0)
> -		return mapcount;
> -	/*
> -	 * Add all the PTE mappings of those pages mapped by PTE.
> -	 * Limit the loop to folio_nr_pages_mapped()?
> -	 * Perhaps: given all the raciness, that may be a good or a bad idea.
> -	 */
> -	nr_pages = folio_nr_pages(folio);
> -	for (i = 0; i < nr_pages; i++)
> -		mapcount += atomic_read(&folio_page(folio, i)->_mapcount);
> -
> -	/* But each of those _mapcounts was based on -1 */
> -	mapcount += nr_pages;
> -	return mapcount;
> -}
> -
>  /**
>   * page_move_anon_rmap - move a page to our anon_vma
>   * @page:	the page to move to our anon_vma
> @@ -1236,6 +1213,9 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>  		}
>  	}
>  
> +	if (folio_test_large(folio))
> +		atomic_inc(&folio->_total_mapcount);
> +
>  	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
>  	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
>  
> @@ -1289,6 +1269,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>  		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>  	}
>  
> +	if (folio_test_large(folio))
> +		/* increment count (starts at -1) */
> +		atomic_set(&folio->_total_mapcount, 0);
> +
>  	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>  	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
> @@ -1310,14 +1294,14 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
>  			bool compound)
>  {
>  	atomic_t *mapped = &folio->_nr_pages_mapped;
> -	unsigned int nr_pmdmapped = 0, first;
> +	unsigned int nr_pmdmapped = 0, i, first;
>  	int nr = 0;
>  
>  	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
>  
>  	/* Is page being mapped by PTE? Is this its first map to be added? */
>  	if (likely(!compound)) {
> -		do {
> +		for (i = 0; i < nr_pages; i++, page++) {
>  			first = atomic_inc_and_test(&page->_mapcount);
>  			if (first && folio_test_large(folio)) {
>  				first = atomic_inc_return_relaxed(mapped);
> @@ -1326,7 +1310,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
>  
>  			if (first)
>  				nr++;
> -		} while (page++, --nr_pages > 0);
> +		}
>  	} else if (folio_test_pmd_mappable(folio)) {
>  		/* That test is redundant: it's for safety or to optimize out */
>  
> @@ -1346,6 +1330,9 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
>  		}
>  	}
>  
> +	if (folio_test_large(folio))
> +		atomic_add(compound ? 1 : nr_pages, &folio->_total_mapcount);
> +
>  	if (nr_pmdmapped)
>  		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
>  			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
> @@ -1398,6 +1385,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  
>  	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
>  
> +	if (folio_test_large(folio))
> +		atomic_dec(&folio->_total_mapcount);
> +
>  	/* Hugetlb pages are not counted in NR_*MAPPED */
>  	if (unlikely(folio_test_hugetlb(folio))) {
>  		/* hugetlb pages are always mapped with pmds */
> @@ -2538,6 +2528,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>  	BUG_ON(!anon_vma);
>  	/* address might be in next vma when migration races vma_merge */
>  	first = atomic_inc_and_test(&folio->_entire_mapcount);
> +	atomic_inc(&folio->_total_mapcount);
>  	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
>  	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
>  	if (first)
> @@ -2551,6 +2542,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio,
>  	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
>  	/* increment count (starts at -1) */
>  	atomic_set(&folio->_entire_mapcount, 0);
> +	atomic_set(&folio->_total_mapcount, 0);
>  	folio_clear_hugetlb_restore_reserve(folio);
>  	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }

David Hildenbrand Aug. 9, 2023, 7:17 p.m. UTC | #3

On 09.08.23 21:07, Ryan Roberts wrote:
> On 09/08/2023 09:32, David Hildenbrand wrote:
>> Let's track the total mapcount for all large folios in the first subpage.
>>
>> The total mapcount is what we actually want to know in folio_mapcount()
>> and it is also sufficient for implementing folio_mapped(). This also
>> gets rid of any "raceiness" concerns as expressed in
>> folio_total_mapcount().
>>
>> With sub-PMD THP becoming more important and things looking promising
>> that we will soon get support for such anon THP, we want to avoid looping
>> over all pages of a folio just to calculate the total mapcount. Further,
>> we may soon want to use the total mapcount in other context more
>> frequently, so prepare for reading it efficiently and atomically.
>>
>> Make room for the total mapcount in page[1] of the folio by moving
>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
>> -- and with the total mapcount in place probably also not desirable even
>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
>> can overlay the hugetlb fields.
>>
>> Note that we currently don't expect any order-1 compound pages / THP in
>> rmap code, and that such support is not planned. If ever desired, we could
>> move the compound mapcount to another page, because it only applies to
>> PMD-mappable folios that are definitely larger. Let's avoid consuming
>> more space elsewhere for now -- we might need more space for a different
>> purpose in some subpages soon.
>>
>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
>> page_mapped().
>>
>> We can now get rid of folio_total_mapcount() and
>> folio_large_is_mapped(), by just inlining reading of the total mapcount.
>>
>> _nr_pages_mapped is now only used in rmap code, so not accidentially
>> externally where it might be used on arbitrary order-1 pages. The remaining
>> usage is:
>>
>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>>   -> If we would account the total folio as mapped when mapping a
>>      page (based on the total mapcount), we could remove that usage.
>>
>> (2) Detect when to add a folio to the deferred split queue
>>   -> If we would apply a different heuristic, or scan using the rmap on
>>      the memory reclaim path for partially mapped anon folios to
>>      split them, we could remove that usage as well.
>>
>> So maybe, we can simplify things in the future and remove
>> _nr_pages_mapped. For now, leave these things as they are, they need more
>> thought. Hugh really did a nice job implementing that precise tracking
>> after all.
>>
>> Note: Not adding order-1 sanity checks to the file_rmap functions for
>>        now. For anon pages, they are certainly not required right now.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Yin Fengwei <fengwei.yin@intel.com>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing
> is correct:
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

Thanks for the review!

[...]

>>   
>>   static inline int total_mapcount(struct page *page)
> 
> nit: couldn't total_mapcount() just be implemented as a wrapper around
> folio_mapcount()? What's the benefit of PageCompound() check instead of just
> getting the folio and checking if it's large? i.e:

Good point, let me take a look tomorrow if the compiler can optimize in 
both cases equally well.

[...]

>>   
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 5f498e8025cc..6a614c559ccf 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
>>   	struct page *p;
>>   
>>   	atomic_set(&folio->_entire_mapcount, 0);
>> -	atomic_set(&folio->_nr_pages_mapped, 0);
>> +	atomic_set(&folio->_total_mapcount, 0);
> 
> Just checking this is definitely what you intended? _total_mapcount is -1 when
> it means "no pages mapped", so 0 means 1 page mapped?

I was blindly doing what _entire_mapcount is doing: zeroing out the 
values. ;)

But let's look into the details: in __destroy_compound_gigantic_folio(), 
we're manually dissolving the whole compound page. So instead of 
actually returning a compound page to the buddy (where we would make 
sure the mapcounts are -1, to then zero them out !), we simply zero-out 
the fields we use and then dissolve the compound page: to be left with a 
bunch of order-0 pages where the memmap is in a clean state.

(the buddy doesn't handle that page order, so we have to do things 
manually to get to order-0 pages we can reuse or free)

Matthew Wilcox Aug. 9, 2023, 7:21 p.m. UTC | #4

On Wed, Aug 09, 2023 at 08:07:43PM +0100, Ryan Roberts wrote:
> > +++ b/mm/hugetlb.c
> > @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
> >  	struct page *p;
> >  
> >  	atomic_set(&folio->_entire_mapcount, 0);
> > -	atomic_set(&folio->_nr_pages_mapped, 0);
> > +	atomic_set(&folio->_total_mapcount, 0);
> 
> Just checking this is definitely what you intended? _total_mapcount is -1 when
> it means "no pages mapped", so 0 means 1 page mapped?

We're destroying the page here, so rather than setting the meaning of
this, we're setting the contents of this memory to 0.

Other thoughts that ran through my mind ... can we wrap?  I don't think
we can; we always increment total_mapcount by 1, no matter whether we're
incrementing entire_mapcount or an individual page's mapcount, and we
always call folio_get() first, so we can't increment total_mapcount
past 2^32 because folio_get() will die first.  We might be able to
wrap past 2^31, but I don't think so.

I had some other thoughts, but I convinced myself they were all OK.

David Hildenbrand Aug. 9, 2023, 7:26 p.m. UTC | #5

On 09.08.23 21:21, Matthew Wilcox wrote:
> On Wed, Aug 09, 2023 at 08:07:43PM +0100, Ryan Roberts wrote:
>>> +++ b/mm/hugetlb.c
>>> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
>>>   	struct page *p;
>>>   
>>>   	atomic_set(&folio->_entire_mapcount, 0);
>>> -	atomic_set(&folio->_nr_pages_mapped, 0);
>>> +	atomic_set(&folio->_total_mapcount, 0);
>>
>> Just checking this is definitely what you intended? _total_mapcount is -1 when
>> it means "no pages mapped", so 0 means 1 page mapped?
> 
> We're destroying the page here, so rather than setting the meaning of
> this, we're setting the contents of this memory to 0.
> 
> 
> Other thoughts that ran through my mind ... can we wrap?  I don't think
> we can; we always increment total_mapcount by 1, no matter whether we're
> incrementing entire_mapcount or an individual page's mapcount, and we
> always call folio_get() first, so we can't increment total_mapcount
> past 2^32 because folio_get() will die first.  We might be able to
> wrap past 2^31, but I don't think so.

 From my understanding, if we wrap the total mapcount, we already 
wrapped the refcount -- as you say, grabbing a reference ahead of time 
for each mapping is mandatory. Both are 31bit values. We could treat the 
total mapcount as an unsigned int, but that's rather future work.

Also, even folio_mapcount() and total_mapcount() return an "int" as of now.

But yes, I also thought about that. In the future we might want (at 
least) for bigger folios refcount+total_mapcount to be 64bit. Or we 
manage to decouple both and only have the total_mapcount be 64bit only.

Peter Xu Aug. 9, 2023, 9:23 p.m. UTC | #6

Hi, David,

Some pure questions below..

On Wed, Aug 09, 2023 at 10:32:56AM +0200, David Hildenbrand wrote:
> Let's track the total mapcount for all large folios in the first subpage.
> 
> The total mapcount is what we actually want to know in folio_mapcount()
> and it is also sufficient for implementing folio_mapped(). This also
> gets rid of any "raceiness" concerns as expressed in
> folio_total_mapcount().

Any more information for that "raciness" described here?

> 
> With sub-PMD THP becoming more important and things looking promising
> that we will soon get support for such anon THP, we want to avoid looping
> over all pages of a folio just to calculate the total mapcount. Further,
> we may soon want to use the total mapcount in other context more
> frequently, so prepare for reading it efficiently and atomically.

Any (perhaps existing) discussion on reduced loops vs added atomic
field/ops?

I had a feeling that there's some discussion behind the proposal of this
patch, if that's the case it'll be great to attach the link in the commit
log.

Thanks,

Yin Fengwei Aug. 10, 2023, 3:14 a.m. UTC | #7

On 8/10/23 03:26, David Hildenbrand wrote:
> On 09.08.23 21:21, Matthew Wilcox wrote:
>> On Wed, Aug 09, 2023 at 08:07:43PM +0100, Ryan Roberts wrote:
>>>> +++ b/mm/hugetlb.c
>>>> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
>>>>       struct page *p;
>>>>         atomic_set(&folio->_entire_mapcount, 0);
>>>> -    atomic_set(&folio->_nr_pages_mapped, 0);
>>>> +    atomic_set(&folio->_total_mapcount, 0);
>>>
>>> Just checking this is definitely what you intended? _total_mapcount is -1 when
>>> it means "no pages mapped", so 0 means 1 page mapped?
>>
>> We're destroying the page here, so rather than setting the meaning of
>> this, we're setting the contents of this memory to 0.
>>
>>
>> Other thoughts that ran through my mind ... can we wrap?  I don't think
>> we can; we always increment total_mapcount by 1, no matter whether we're
>> incrementing entire_mapcount or an individual page's mapcount, and we
>> always call folio_get() first, so we can't increment total_mapcount
>> past 2^32 because folio_get() will die first.  We might be able to
>> wrap past 2^31, but I don't think so.
> 
> From my understanding, if we wrap the total mapcount, we already wrapped the refcount -- as you say, grabbing a reference ahead of time for each mapping is mandatory. Both are 31bit values. We could treat the total mapcount as an unsigned int, but that's rather future work.
> 
> Also, even folio_mapcount() and total_mapcount() return an "int" as of now.
> 
> But yes, I also thought about that. In the future we might want (at least) for bigger folios refcount+total_mapcount to be 64bit. Or we manage to decouple both and only have the total_mapcount be 64bit only.
This means pvmw may need to check more than 2^32 pte entries. :). I hope
we have other way to get bigger folio and keep mapcount not too big.


Regards
Yin, Fengwei

>

Yin Fengwei Aug. 10, 2023, 3:24 a.m. UTC | #8

On 8/9/23 16:32, David Hildenbrand wrote:
> Let's track the total mapcount for all large folios in the first subpage.
> 
> The total mapcount is what we actually want to know in folio_mapcount()
> and it is also sufficient for implementing folio_mapped(). This also
> gets rid of any "raceiness" concerns as expressed in
> folio_total_mapcount().
> 
> With sub-PMD THP becoming more important and things looking promising
> that we will soon get support for such anon THP, we want to avoid looping
> over all pages of a folio just to calculate the total mapcount. Further,
> we may soon want to use the total mapcount in other context more
> frequently, so prepare for reading it efficiently and atomically.
> 
> Make room for the total mapcount in page[1] of the folio by moving
> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
> -- and with the total mapcount in place probably also not desirable even
> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
> can overlay the hugetlb fields.
> 
> Note that we currently don't expect any order-1 compound pages / THP in
> rmap code, and that such support is not planned. If ever desired, we could
> move the compound mapcount to another page, because it only applies to
> PMD-mappable folios that are definitely larger. Let's avoid consuming
> more space elsewhere for now -- we might need more space for a different
> purpose in some subpages soon.
> 
> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
> page_mapped().
> 
> We can now get rid of folio_total_mapcount() and
> folio_large_is_mapped(), by just inlining reading of the total mapcount.
> 
> _nr_pages_mapped is now only used in rmap code, so not accidentially
> externally where it might be used on arbitrary order-1 pages. The remaining
> usage is:
> 
> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>  -> If we would account the total folio as mapped when mapping a
>     page (based on the total mapcount), we could remove that usage.
> 
> (2) Detect when to add a folio to the deferred split queue
>  -> If we would apply a different heuristic, or scan using the rmap on
>     the memory reclaim path for partially mapped anon folios to
>     split them, we could remove that usage as well.
> 
> So maybe, we can simplify things in the future and remove
> _nr_pages_mapped. For now, leave these things as they are, they need more
> thought. Hugh really did a nice job implementing that precise tracking
> after all.
> 
> Note: Not adding order-1 sanity checks to the file_rmap functions for
>       now. For anon pages, they are certainly not required right now.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Yin Fengwei <fengwei.yin@intel.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>


Regards
Yin, Fengwei



> ---
>  Documentation/mm/transhuge.rst | 12 +++++-----
>  include/linux/mm.h             | 31 ++++++------------------
>  include/linux/mm_types.h       |  6 +++--
>  include/linux/rmap.h           |  7 ++++--
>  mm/debug.c                     |  4 ++--
>  mm/huge_memory.c               |  2 ++
>  mm/hugetlb.c                   |  4 ++--
>  mm/internal.h                  | 17 ++++---------
>  mm/page_alloc.c                |  4 ++--
>  mm/rmap.c                      | 44 ++++++++++++++--------------------
>  10 files changed, 52 insertions(+), 79 deletions(-)
> 
> diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
> index 9a607059ea11..b0d3b1d3e8ea 100644
> --- a/Documentation/mm/transhuge.rst
> +++ b/Documentation/mm/transhuge.rst
> @@ -116,14 +116,14 @@ pages:
>      succeeds on tail pages.
>  
>    - map/unmap of a PMD entry for the whole THP increment/decrement
> -    folio->_entire_mapcount and also increment/decrement
> -    folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount
> -    goes from -1 to 0 or 0 to -1.
> +    folio->_entire_mapcount, increment/decrement folio->_total_mapcount
> +    and also increment/decrement folio->_nr_pages_mapped by COMPOUND_MAPPED
> +    when _entire_mapcount goes from -1 to 0 or 0 to -1.
>  
>    - map/unmap of individual pages with PTE entry increment/decrement
> -    page->_mapcount and also increment/decrement folio->_nr_pages_mapped
> -    when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts
> -    the number of pages mapped by PTE.
> +    page->_mapcount, increment/decrement folio->_total_mapcount and also
> +    increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
> +    from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
>  
>  split_huge_page internally has to distribute the refcounts in the head
>  page to the tail pages before clearing all PG_head/tail bits from the page
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6a95dfed4957..30ac004fa0ef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1187,41 +1187,24 @@ static inline int page_mapcount(struct page *page)
>  	return mapcount;
>  }
>  
> -int folio_total_mapcount(struct folio *folio);
> -
>  /**
> - * folio_mapcount() - Calculate the number of mappings of this folio.
> + * folio_mapcount() - Return the total number of mappings of this folio.
>   * @folio: The folio.
>   *
> - * A large folio tracks both how many times the entire folio is mapped,
> - * and how many times each individual page in the folio is mapped.
> - * This function calculates the total number of times the folio is
> - * mapped.
> - *
> - * Return: The number of times this folio is mapped.
> + * Return: The total number of times this folio is mapped.
>   */
>  static inline int folio_mapcount(struct folio *folio)
>  {
>  	if (likely(!folio_test_large(folio)))
>  		return atomic_read(&folio->_mapcount) + 1;
> -	return folio_total_mapcount(folio);
> +	return atomic_read(&folio->_total_mapcount) + 1;
>  }
>  
>  static inline int total_mapcount(struct page *page)
>  {
>  	if (likely(!PageCompound(page)))
>  		return atomic_read(&page->_mapcount) + 1;
> -	return folio_total_mapcount(page_folio(page));
> -}
> -
> -static inline bool folio_large_is_mapped(struct folio *folio)
> -{
> -	/*
> -	 * Reading _entire_mapcount below could be omitted if hugetlb
> -	 * participated in incrementing nr_pages_mapped when compound mapped.
> -	 */
> -	return atomic_read(&folio->_nr_pages_mapped) > 0 ||
> -		atomic_read(&folio->_entire_mapcount) >= 0;
> +	return atomic_read(&page_folio(page)->_total_mapcount) + 1;
>  }
>  
>  /**
> @@ -1234,7 +1217,7 @@ static inline bool folio_mapped(struct folio *folio)
>  {
>  	if (likely(!folio_test_large(folio)))
>  		return atomic_read(&folio->_mapcount) >= 0;
> -	return folio_large_is_mapped(folio);
> +	return atomic_read(&folio->_total_mapcount) >= 0;
>  }
>  
>  /*
> @@ -1246,7 +1229,7 @@ static inline bool page_mapped(struct page *page)
>  {
>  	if (likely(!PageCompound(page)))
>  		return atomic_read(&page->_mapcount) >= 0;
> -	return folio_large_is_mapped(page_folio(page));
> +	return atomic_read(&page_folio(page)->_total_mapcount) >= 0;
>  }
>  
>  static inline struct page *virt_to_head_page(const void *x)
> @@ -2148,7 +2131,7 @@ static inline size_t folio_size(struct folio *folio)
>   * looking at the precise mapcount of the first subpage in the folio, and
>   * assuming the other subpages are the same. This may not be true for large
>   * folios. If you want exact mapcounts for exact calculations, look at
> - * page_mapcount() or folio_total_mapcount().
> + * page_mapcount() or folio_mapcount().
>   *
>   * Return: The estimated number of processes sharing a folio.
>   */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 291c05cacd48..16e66b3332c6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -267,7 +267,8 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page)
>   * @_folio_dtor: Which destructor to use for this folio.
>   * @_folio_order: Do not use directly, call folio_order().
>   * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
> - * @_nr_pages_mapped: Do not use directly, call folio_mapcount().
> + * @_total_mapcount: Do not use directly, call folio_mapcount().
> + * @_nr_pages_mapped: Do not use outside of rmap code (and not for hugetlb).
>   * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
>   * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
>   * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
> @@ -321,7 +322,7 @@ struct folio {
>  			unsigned char _folio_dtor;
>  			unsigned char _folio_order;
>  			atomic_t _entire_mapcount;
> -			atomic_t _nr_pages_mapped;
> +			atomic_t _total_mapcount;
>  			atomic_t _pincount;
>  #ifdef CONFIG_64BIT
>  			unsigned int _folio_nr_pages;
> @@ -346,6 +347,7 @@ struct folio {
>  			unsigned long _head_2a;
>  	/* public: */
>  			struct list_head _deferred_list;
> +			atomic_t _nr_pages_mapped;
>  	/* private: the union with struct page is transitional */
>  		};
>  		struct page __page_2;
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index a3825ce81102..a7b1c005e0c9 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -210,14 +210,17 @@ void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>  
>  static inline void __page_dup_rmap(struct page *page, bool compound)
>  {
> -	if (compound) {
> -		struct folio *folio = (struct folio *)page;
> +	struct folio *folio = page_folio(page);
>  
> +	if (compound) {
>  		VM_BUG_ON_PAGE(compound && !PageHead(page), page);
>  		atomic_inc(&folio->_entire_mapcount);
>  	} else {
>  		atomic_inc(&page->_mapcount);
>  	}
> +
> +	if (folio_test_large(folio))
> +		atomic_inc(&folio->_total_mapcount);
>  }
>  
>  static inline void page_dup_file_rmap(struct page *page, bool compound)
> diff --git a/mm/debug.c b/mm/debug.c
> index ee533a5ceb79..97f6f6b32ae7 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -99,10 +99,10 @@ static void __dump_page(struct page *page)
>  			page, page_ref_count(head), mapcount, mapping,
>  			page_to_pgoff(page), page_to_pfn(page));
>  	if (compound) {
> -		pr_warn("head:%p order:%u entire_mapcount:%d nr_pages_mapped:%d pincount:%d\n",
> +		pr_warn("head:%p order:%u entire_mapcount:%d total_mapcount:%d pincount:%d\n",
>  				head, compound_order(head),
>  				folio_entire_mapcount(folio),
> -				folio_nr_pages_mapped(folio),
> +				folio_mapcount(folio),
>  				atomic_read(&folio->_pincount));
>  	}
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 154c210892a1..43a2150e41e3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -583,6 +583,7 @@ void prep_transhuge_page(struct page *page)
>  
>  	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
>  	INIT_LIST_HEAD(&folio->_deferred_list);
> +	atomic_set(&folio->_nr_pages_mapped, 0);
>  	folio_set_compound_dtor(folio, TRANSHUGE_PAGE_DTOR);
>  }
>  
> @@ -2795,6 +2796,7 @@ void free_transhuge_page(struct page *page)
>  		}
>  		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  	}
> +	VM_WARN_ON_ONCE(atomic_read(&folio->_nr_pages_mapped));
>  	free_compound_page(page);
>  }
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5f498e8025cc..6a614c559ccf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct folio *folio,
>  	struct page *p;
>  
>  	atomic_set(&folio->_entire_mapcount, 0);
> -	atomic_set(&folio->_nr_pages_mapped, 0);
> +	atomic_set(&folio->_total_mapcount, 0);
>  	atomic_set(&folio->_pincount, 0);
>  
>  	for (i = 1; i < nr_pages; i++) {
> @@ -2027,7 +2027,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
>  	/* we rely on prep_new_hugetlb_folio to set the destructor */
>  	folio_set_order(folio, order);
>  	atomic_set(&folio->_entire_mapcount, -1);
> -	atomic_set(&folio->_nr_pages_mapped, 0);
> +	atomic_set(&folio->_total_mapcount, -1);
>  	atomic_set(&folio->_pincount, 0);
>  	return true;
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 02a6fd36f46e..e4646fed44a5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -53,10 +53,10 @@ struct folio_batch;
>  void page_writeback_init(void);
>  
>  /*
> - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
> + * If a 16GB folio were mapped by PTEs of all of its 4kB pages,
>   * its nr_pages_mapped would be 0x400000: choose the COMPOUND_MAPPED bit
> - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
> - * leaves nr_pages_mapped at 0, but avoid surprise if it participates later.
> + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Not applicable to
> + * hugetlb.
>   */
>  #define COMPOUND_MAPPED		0x800000
>  #define FOLIO_PAGES_MAPPED	(COMPOUND_MAPPED - 1)
> @@ -67,15 +67,6 @@ void page_writeback_init(void);
>   */
>  #define SHOW_MEM_FILTER_NODES		(0x0001u)	/* disallowed nodes */
>  
> -/*
> - * How many individual pages have an elevated _mapcount.  Excludes
> - * the folio's entire_mapcount.
> - */
> -static inline int folio_nr_pages_mapped(struct folio *folio)
> -{
> -	return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
> -}
> -
>  static inline void *folio_raw_mapping(struct folio *folio)
>  {
>  	unsigned long mapping = (unsigned long)folio->mapping;
> @@ -420,7 +411,7 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>  	folio_set_compound_dtor(folio, COMPOUND_PAGE_DTOR);
>  	folio_set_order(folio, order);
>  	atomic_set(&folio->_entire_mapcount, -1);
> -	atomic_set(&folio->_nr_pages_mapped, 0);
> +	atomic_set(&folio->_total_mapcount, -1);
>  	atomic_set(&folio->_pincount, 0);
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 96b7c1a7d1f2..5bbd5782b543 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1009,8 +1009,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
>  			bad_page(page, "nonzero entire_mapcount");
>  			goto out;
>  		}
> -		if (unlikely(atomic_read(&folio->_nr_pages_mapped))) {
> -			bad_page(page, "nonzero nr_pages_mapped");
> +		if (unlikely(atomic_read(&folio->_total_mapcount) + 1)) {
> +			bad_page(page, "nonzero total_mapcount");
>  			goto out;
>  		}
>  		if (unlikely(atomic_read(&folio->_pincount))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1f04debdc87a..a7dcae48245b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1070,29 +1070,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
>  	return page_vma_mkclean_one(&pvmw);
>  }
>  
> -int folio_total_mapcount(struct folio *folio)
> -{
> -	int mapcount = folio_entire_mapcount(folio);
> -	int nr_pages;
> -	int i;
> -
> -	/* In the common case, avoid the loop when no pages mapped by PTE */
> -	if (folio_nr_pages_mapped(folio) == 0)
> -		return mapcount;
> -	/*
> -	 * Add all the PTE mappings of those pages mapped by PTE.
> -	 * Limit the loop to folio_nr_pages_mapped()?
> -	 * Perhaps: given all the raciness, that may be a good or a bad idea.
> -	 */
> -	nr_pages = folio_nr_pages(folio);
> -	for (i = 0; i < nr_pages; i++)
> -		mapcount += atomic_read(&folio_page(folio, i)->_mapcount);
> -
> -	/* But each of those _mapcounts was based on -1 */
> -	mapcount += nr_pages;
> -	return mapcount;
> -}
> -
>  /**
>   * page_move_anon_rmap - move a page to our anon_vma
>   * @page:	the page to move to our anon_vma
> @@ -1236,6 +1213,9 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>  		}
>  	}
>  
> +	if (folio_test_large(folio))
> +		atomic_inc(&folio->_total_mapcount);
> +
>  	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
>  	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
>  
> @@ -1289,6 +1269,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>  		__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>  	}
>  
> +	if (folio_test_large(folio))
> +		/* increment count (starts at -1) */
> +		atomic_set(&folio->_total_mapcount, 0);
> +
>  	__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>  	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }
> @@ -1310,14 +1294,14 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
>  			bool compound)
>  {
>  	atomic_t *mapped = &folio->_nr_pages_mapped;
> -	unsigned int nr_pmdmapped = 0, first;
> +	unsigned int nr_pmdmapped = 0, i, first;
>  	int nr = 0;
>  
>  	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
>  
>  	/* Is page being mapped by PTE? Is this its first map to be added? */
>  	if (likely(!compound)) {
> -		do {
> +		for (i = 0; i < nr_pages; i++, page++) {
>  			first = atomic_inc_and_test(&page->_mapcount);
>  			if (first && folio_test_large(folio)) {
>  				first = atomic_inc_return_relaxed(mapped);
> @@ -1326,7 +1310,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
>  
>  			if (first)
>  				nr++;
> -		} while (page++, --nr_pages > 0);
> +		}
>  	} else if (folio_test_pmd_mappable(folio)) {
>  		/* That test is redundant: it's for safety or to optimize out */
>  
> @@ -1346,6 +1330,9 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
>  		}
>  	}
>  
> +	if (folio_test_large(folio))
> +		atomic_add(compound ? 1 : nr_pages, &folio->_total_mapcount);
> +
>  	if (nr_pmdmapped)
>  		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
>  			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
> @@ -1398,6 +1385,9 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  
>  	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
>  
> +	if (folio_test_large(folio))
> +		atomic_dec(&folio->_total_mapcount);
> +
>  	/* Hugetlb pages are not counted in NR_*MAPPED */
>  	if (unlikely(folio_test_hugetlb(folio))) {
>  		/* hugetlb pages are always mapped with pmds */
> @@ -2538,6 +2528,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
>  	BUG_ON(!anon_vma);
>  	/* address might be in next vma when migration races vma_merge */
>  	first = atomic_inc_and_test(&folio->_entire_mapcount);
> +	atomic_inc(&folio->_total_mapcount);
>  	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
>  	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
>  	if (first)
> @@ -2551,6 +2542,7 @@ void hugepage_add_new_anon_rmap(struct folio *folio,
>  	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
>  	/* increment count (starts at -1) */
>  	atomic_set(&folio->_entire_mapcount, 0);
> +	atomic_set(&folio->_total_mapcount, 0);
>  	folio_clear_hugetlb_restore_reserve(folio);
>  	__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
>  }

Matthew Wilcox Aug. 10, 2023, 3:25 a.m. UTC | #9

On Wed, Aug 09, 2023 at 05:23:46PM -0400, Peter Xu wrote:
> Hi, David,
> 
> Some pure questions below..
> 
> On Wed, Aug 09, 2023 at 10:32:56AM +0200, David Hildenbrand wrote:
> > Let's track the total mapcount for all large folios in the first subpage.
> > 
> > The total mapcount is what we actually want to know in folio_mapcount()
> > and it is also sufficient for implementing folio_mapped(). This also
> > gets rid of any "raceiness" concerns as expressed in
> > folio_total_mapcount().
> 
> Any more information for that "raciness" described here?

UTSL.

        /*
         * Add all the PTE mappings of those pages mapped by PTE.
         * Limit the loop to folio_nr_pages_mapped()?
         * Perhaps: given all the raciness, that may be a good or a bad idea.
         */

David Hildenbrand Aug. 10, 2023, 8:37 a.m. UTC | #10

On 10.08.23 05:25, Matthew Wilcox wrote:
> On Wed, Aug 09, 2023 at 05:23:46PM -0400, Peter Xu wrote:
>> Hi, David,
>>
>> Some pure questions below..
>>
>> On Wed, Aug 09, 2023 at 10:32:56AM +0200, David Hildenbrand wrote:
>>> Let's track the total mapcount for all large folios in the first subpage.
>>>
>>> The total mapcount is what we actually want to know in folio_mapcount()
>>> and it is also sufficient for implementing folio_mapped(). This also
>>> gets rid of any "raceiness" concerns as expressed in
>>> folio_total_mapcount().
>>
>> Any more information for that "raciness" described here?
> 
> UTSL.
> 
>          /*
>           * Add all the PTE mappings of those pages mapped by PTE.
>           * Limit the loop to folio_nr_pages_mapped()?
>           * Perhaps: given all the raciness, that may be a good or a bad idea.
>           */
> 

Yes, that comment from Hugh primarily discusses how we could possibly 
optimize the loop, and if relying on folio_nr_pages_mapped() to reduce 
the iterations would be racy. As far as I can see, there are cases where 
"it would be certainly a bad idea" :)

In the other comment in that function, it's also made clear what the 
traditional behavior with PMD-mappable THP was "In the common case, 
avoid the loop when no pages mapped by PTE", which will no longer hold 
with sub-PMD THP.

David Hildenbrand Aug. 10, 2023, 8:59 a.m. UTC | #11

On 09.08.23 23:23, Peter Xu wrote:
> Hi, David,
> 
> Some pure questions below..

Hi Peter,

thanks for having a look!

[...]

>> With sub-PMD THP becoming more important and things looking promising
>> that we will soon get support for such anon THP, we want to avoid looping
>> over all pages of a folio just to calculate the total mapcount. Further,
>> we may soon want to use the total mapcount in other context more
>> frequently, so prepare for reading it efficiently and atomically.
> 
> Any (perhaps existing) discussion on reduced loops vs added atomic
> field/ops?

So far it's not been raised as a concern, so no existing discussion.

For order-0 pages the behavior is unchanged.

For PMD-mapped THP and hugetlb it's most certainly noise compared to the 
other activities when (un)mapping these large pages.

For PTE-mapped THP, it might be a bit bigger noise, although I doubt it 
is really significant (judging from my experience on managing 
PageAnonExclusive using set_bit/test_bit/clear_bit when (un)mapping anon 
pages).

As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should 
be batching where possible (and Ryan is working on some more rmap 
batching). There, managing the subpage mapcount dominates all other 
overhead significantly.

> 
> I had a feeling that there's some discussion behind the proposal of this
> patch, if that's the case it'll be great to attach the link in the commit
> log.

There were (mostly offline) discussions on how to sort out some other 
issues that PTE-mapped THP are facing, and how to eventually get rid of 
the subpage mapcounts (once consumer being _nr_pages_mapped as spelled 
out in the patch description). Having a proper total mapcount available 
was discussed as one building block.

I don't think I have anything of value to link that would make sense for 
the patch as is, as this patch is mostly independent from all that.

Ryan Roberts Aug. 10, 2023, 10:40 a.m. UTC | #12

On 09/08/2023 20:17, David Hildenbrand wrote:
> On 09.08.23 21:07, Ryan Roberts wrote:
>> On 09/08/2023 09:32, David Hildenbrand wrote:
>>> Let's track the total mapcount for all large folios in the first subpage.
>>>
>>> The total mapcount is what we actually want to know in folio_mapcount()
>>> and it is also sufficient for implementing folio_mapped(). This also
>>> gets rid of any "raceiness" concerns as expressed in
>>> folio_total_mapcount().
>>>
>>> With sub-PMD THP becoming more important and things looking promising
>>> that we will soon get support for such anon THP, we want to avoid looping
>>> over all pages of a folio just to calculate the total mapcount. Further,
>>> we may soon want to use the total mapcount in other context more
>>> frequently, so prepare for reading it efficiently and atomically.
>>>
>>> Make room for the total mapcount in page[1] of the folio by moving
>>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
>>> -- and with the total mapcount in place probably also not desirable even
>>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
>>> can overlay the hugetlb fields.
>>>
>>> Note that we currently don't expect any order-1 compound pages / THP in
>>> rmap code, and that such support is not planned. If ever desired, we could
>>> move the compound mapcount to another page, because it only applies to
>>> PMD-mappable folios that are definitely larger. Let's avoid consuming
>>> more space elsewhere for now -- we might need more space for a different
>>> purpose in some subpages soon.
>>>
>>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
>>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
>>> page_mapped().
>>>
>>> We can now get rid of folio_total_mapcount() and
>>> folio_large_is_mapped(), by just inlining reading of the total mapcount.
>>>
>>> _nr_pages_mapped is now only used in rmap code, so not accidentially
>>> externally where it might be used on arbitrary order-1 pages. The remaining
>>> usage is:
>>>
>>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>>>   -> If we would account the total folio as mapped when mapping a
>>>      page (based on the total mapcount), we could remove that usage.
>>>
>>> (2) Detect when to add a folio to the deferred split queue
>>>   -> If we would apply a different heuristic, or scan using the rmap on
>>>      the memory reclaim path for partially mapped anon folios to
>>>      split them, we could remove that usage as well.
>>>
>>> So maybe, we can simplify things in the future and remove
>>> _nr_pages_mapped. For now, leave these things as they are, they need more
>>> thought. Hugh really did a nice job implementing that precise tracking
>>> after all.
>>>
>>> Note: Not adding order-1 sanity checks to the file_rmap functions for
>>>        now. For anon pages, they are certainly not required right now.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Yin Fengwei <fengwei.yin@intel.com>
>>> Cc: Yang Shi <shy828301@gmail.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing
>> is correct:
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> Thanks for the review!
> 
> [...]
> 
>>>     static inline int total_mapcount(struct page *page)
>>
>> nit: couldn't total_mapcount() just be implemented as a wrapper around
>> folio_mapcount()? What's the benefit of PageCompound() check instead of just
>> getting the folio and checking if it's large? i.e:
> 
> Good point, let me take a look tomorrow if the compiler can optimize in both
> cases equally well.
> 
> [...]
> 
>>>   diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 5f498e8025cc..6a614c559ccf 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -1479,7 +1479,7 @@ static void __destroy_compound_gigantic_folio(struct
>>> folio *folio,
>>>       struct page *p;
>>>         atomic_set(&folio->_entire_mapcount, 0);
>>> -    atomic_set(&folio->_nr_pages_mapped, 0);
>>> +    atomic_set(&folio->_total_mapcount, 0);
>>
>> Just checking this is definitely what you intended? _total_mapcount is -1 when
>> it means "no pages mapped", so 0 means 1 page mapped?
> 
> I was blindly doing what _entire_mapcount is doing: zeroing out the values. ;)
> 
> But let's look into the details: in __destroy_compound_gigantic_folio(), we're
> manually dissolving the whole compound page. So instead of actually returning a
> compound page to the buddy (where we would make sure the mapcounts are -1, to
> then zero them out !), we simply zero-out the fields we use and then dissolve
> the compound page: to be left with a bunch of order-0 pages where the memmap is
> in a clean state.
> 
> (the buddy doesn't handle that page order, so we have to do things manually to
> get to order-0 pages we can reuse or free)
> 

Yeah fair enough, thanks for the explanation.

Ryan Roberts Aug. 10, 2023, 10:48 a.m. UTC | #13

On 10/08/2023 09:59, David Hildenbrand wrote:
> On 09.08.23 23:23, Peter Xu wrote:
>> Hi, David,
>>
>> Some pure questions below..
> 
> Hi Peter,
> 
> thanks for having a look!
> 
> [...]
> 
>>> With sub-PMD THP becoming more important and things looking promising
>>> that we will soon get support for such anon THP, we want to avoid looping
>>> over all pages of a folio just to calculate the total mapcount. Further,
>>> we may soon want to use the total mapcount in other context more
>>> frequently, so prepare for reading it efficiently and atomically.
>>
>> Any (perhaps existing) discussion on reduced loops vs added atomic
>> field/ops?
> 
> So far it's not been raised as a concern, so no existing discussion.
> 
> For order-0 pages the behavior is unchanged.
> 
> For PMD-mapped THP and hugetlb it's most certainly noise compared to the other
> activities when (un)mapping these large pages.
> 
> For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is
> really significant (judging from my experience on managing PageAnonExclusive
> using set_bit/test_bit/clear_bit when (un)mapping anon pages).
> 
> As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be
> batching where possible (and Ryan is working on some more rmap batching). 

Yes, I've just posted [1] which batches the rmap removal. That would allow you
to convert the per-page atomic_dec() into a (usually) single per-large-folio
atomic_sub().

[1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/

> There,
> managing the subpage mapcount dominates all other overhead significantly.
> 
>>
>> I had a feeling that there's some discussion behind the proposal of this
>> patch, if that's the case it'll be great to attach the link in the commit
>> log.
> 
> There were (mostly offline) discussions on how to sort out some other issues
> that PTE-mapped THP are facing, and how to eventually get rid of the subpage
> mapcounts (once consumer being _nr_pages_mapped as spelled out in the patch
> description). Having a proper total mapcount available was discussed as one
> building block.
> 
> I don't think I have anything of value to link that would make sense for the
> patch as is, as this patch is mostly independent from all that.
>

David Hildenbrand Aug. 10, 2023, 11:14 a.m. UTC | #14

On 09.08.23 21:17, David Hildenbrand wrote:
> On 09.08.23 21:07, Ryan Roberts wrote:
>> On 09/08/2023 09:32, David Hildenbrand wrote:
>>> Let's track the total mapcount for all large folios in the first subpage.
>>>
>>> The total mapcount is what we actually want to know in folio_mapcount()
>>> and it is also sufficient for implementing folio_mapped(). This also
>>> gets rid of any "raceiness" concerns as expressed in
>>> folio_total_mapcount().
>>>
>>> With sub-PMD THP becoming more important and things looking promising
>>> that we will soon get support for such anon THP, we want to avoid looping
>>> over all pages of a folio just to calculate the total mapcount. Further,
>>> we may soon want to use the total mapcount in other context more
>>> frequently, so prepare for reading it efficiently and atomically.
>>>
>>> Make room for the total mapcount in page[1] of the folio by moving
>>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
>>> -- and with the total mapcount in place probably also not desirable even
>>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
>>> can overlay the hugetlb fields.
>>>
>>> Note that we currently don't expect any order-1 compound pages / THP in
>>> rmap code, and that such support is not planned. If ever desired, we could
>>> move the compound mapcount to another page, because it only applies to
>>> PMD-mappable folios that are definitely larger. Let's avoid consuming
>>> more space elsewhere for now -- we might need more space for a different
>>> purpose in some subpages soon.
>>>
>>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
>>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
>>> page_mapped().
>>>
>>> We can now get rid of folio_total_mapcount() and
>>> folio_large_is_mapped(), by just inlining reading of the total mapcount.
>>>
>>> _nr_pages_mapped is now only used in rmap code, so not accidentially
>>> externally where it might be used on arbitrary order-1 pages. The remaining
>>> usage is:
>>>
>>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>>>    -> If we would account the total folio as mapped when mapping a
>>>       page (based on the total mapcount), we could remove that usage.
>>>
>>> (2) Detect when to add a folio to the deferred split queue
>>>    -> If we would apply a different heuristic, or scan using the rmap on
>>>       the memory reclaim path for partially mapped anon folios to
>>>       split them, we could remove that usage as well.
>>>
>>> So maybe, we can simplify things in the future and remove
>>> _nr_pages_mapped. For now, leave these things as they are, they need more
>>> thought. Hugh really did a nice job implementing that precise tracking
>>> after all.
>>>
>>> Note: Not adding order-1 sanity checks to the file_rmap functions for
>>>         now. For anon pages, they are certainly not required right now.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Yin Fengwei <fengwei.yin@intel.com>
>>> Cc: Yang Shi <shy828301@gmail.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing
>> is correct:
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> Thanks for the review!
> 
> [...]
> 
>>>    
>>>    static inline int total_mapcount(struct page *page)
>>
>> nit: couldn't total_mapcount() just be implemented as a wrapper around
>> folio_mapcount()? What's the benefit of PageCompound() check instead of just
>> getting the folio and checking if it's large? i.e:
> 
> Good point, let me take a look tomorrow if the compiler can optimize in
> both cases equally well.

I checked by adjusting total_mapcount():

Before:

         if (PageTransHuge(page) && total_mapcount(page) > 1)
ffffffff81411931:       4c 89 e7                mov    %r12,%rdi
ffffffff81411934:       e8 f7 b1 ff ff          call   ffffffff8140cb30 <PageTransHuge>
ffffffff81411939:       85 c0                   test   %eax,%eax
ffffffff8141193b:       74 29                   je     ffffffff81411966 <migrate_misplaced_page+0x166>
ffffffff8141193d:       49 8b 04 24             mov    (%r12),%rax
         return test_bit(PG_head, &page->flags) ||
ffffffff81411941:       a9 00 00 01 00          test   $0x10000,%eax
ffffffff81411946:       0f 85 1f 01 00 00       jne    ffffffff81411a6b <migrate_misplaced_page+0x26b>
                READ_ONCE(page->compound_head) & 1;
ffffffff8141194c:       49 8b 44 24 08          mov    0x8(%r12),%rax
         return test_bit(PG_head, &page->flags) ||
ffffffff81411951:       a8 01                   test   $0x1,%al
ffffffff81411953:       0f 85 12 01 00 00       jne    ffffffff81411a6b <migrate_misplaced_page+0x26b>
ffffffff81411959:       41 8b 44 24 30          mov    0x30(%r12),%eax
                 return atomic_read(&page->_mapcount) + 1;
ffffffff8141195e:       83 c0 01                add    $0x1,%eax
ffffffff81411961:       83 f8 01                cmp    $0x1,%eax
ffffffff81411964:       7f 77                   jg     ffffffff814119dd <migrate_misplaced_page+0x1dd>

So a total of 10 instructions after handling the mov/call/test/je for PageTransHuge().

After:

         if (PageTransHuge(page) && total_mapcount(page) > 1)
ffffffff81411931:       4c 89 e7                mov    %r12,%rdi
ffffffff81411934:       e8 f7 b1 ff ff          call   ffffffff8140cb30 <PageTransHuge>
ffffffff81411939:       85 c0                   test   %eax,%eax
ffffffff8141193b:       74 2f                   je     ffffffff8141196c <migrate_misplaced_page+0x16c>
         unsigned long head = READ_ONCE(page->compound_head);
ffffffff8141193d:       49 8b 44 24 08          mov    0x8(%r12),%rax
         if (unlikely(head & 1))
ffffffff81411942:       a8 01                   test   $0x1,%al
ffffffff81411944:       0f 85 fc 05 00 00       jne    ffffffff81411f46 <migrate_misplaced_page+0x746>
ffffffff8141194a:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
                 return page;
ffffffff8141194f:       4c 89 e0                mov    %r12,%rax
ffffffff81411952:       48 8b 10                mov    (%rax),%rdx
         if (likely(!folio_test_large(folio)))
ffffffff81411955:       f7 c2 00 00 01 00       test   $0x10000,%edx
ffffffff8141195b:       0f 85 da 05 00 00       jne    ffffffff81411f3b <migrate_misplaced_page+0x73b>
ffffffff81411961:       8b 40 30                mov    0x30(%rax),%eax
                 return atomic_read(&folio->_mapcount) + 1;
ffffffff81411964:       83 c0 01                add    $0x1,%eax
ffffffff81411967:       83 f8 01                cmp    $0x1,%eax
ffffffff8141196a:       7f 77                   jg     ffffffff814119e3 <migrate_misplaced_page+0x1e3>

So a total of 11 (+ 1 NOP) instructions after handling the mov/call/test/je for PageTransHuge().

Essentially one more MOV instruction.

I guess nobody really cares :)

David Hildenbrand Aug. 10, 2023, 11:27 a.m. UTC | #15

On 10.08.23 13:14, David Hildenbrand wrote:
> On 09.08.23 21:17, David Hildenbrand wrote:
>> On 09.08.23 21:07, Ryan Roberts wrote:
>>> On 09/08/2023 09:32, David Hildenbrand wrote:
>>>> Let's track the total mapcount for all large folios in the first subpage.
>>>>
>>>> The total mapcount is what we actually want to know in folio_mapcount()
>>>> and it is also sufficient for implementing folio_mapped(). This also
>>>> gets rid of any "raceiness" concerns as expressed in
>>>> folio_total_mapcount().
>>>>
>>>> With sub-PMD THP becoming more important and things looking promising
>>>> that we will soon get support for such anon THP, we want to avoid looping
>>>> over all pages of a folio just to calculate the total mapcount. Further,
>>>> we may soon want to use the total mapcount in other context more
>>>> frequently, so prepare for reading it efficiently and atomically.
>>>>
>>>> Make room for the total mapcount in page[1] of the folio by moving
>>>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
>>>> -- and with the total mapcount in place probably also not desirable even
>>>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
>>>> can overlay the hugetlb fields.
>>>>
>>>> Note that we currently don't expect any order-1 compound pages / THP in
>>>> rmap code, and that such support is not planned. If ever desired, we could
>>>> move the compound mapcount to another page, because it only applies to
>>>> PMD-mappable folios that are definitely larger. Let's avoid consuming
>>>> more space elsewhere for now -- we might need more space for a different
>>>> purpose in some subpages soon.
>>>>
>>>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
>>>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
>>>> page_mapped().
>>>>
>>>> We can now get rid of folio_total_mapcount() and
>>>> folio_large_is_mapped(), by just inlining reading of the total mapcount.
>>>>
>>>> _nr_pages_mapped is now only used in rmap code, so not accidentially
>>>> externally where it might be used on arbitrary order-1 pages. The remaining
>>>> usage is:
>>>>
>>>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>>>>     -> If we would account the total folio as mapped when mapping a
>>>>        page (based on the total mapcount), we could remove that usage.
>>>>
>>>> (2) Detect when to add a folio to the deferred split queue
>>>>     -> If we would apply a different heuristic, or scan using the rmap on
>>>>        the memory reclaim path for partially mapped anon folios to
>>>>        split them, we could remove that usage as well.
>>>>
>>>> So maybe, we can simplify things in the future and remove
>>>> _nr_pages_mapped. For now, leave these things as they are, they need more
>>>> thought. Hugh really did a nice job implementing that precise tracking
>>>> after all.
>>>>
>>>> Note: Not adding order-1 sanity checks to the file_rmap functions for
>>>>          now. For anon pages, they are certainly not required right now.
>>>>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Yin Fengwei <fengwei.yin@intel.com>
>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing
>>> is correct:
>>>
>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Thanks for the review!
>>
>> [...]
>>
>>>>     
>>>>     static inline int total_mapcount(struct page *page)
>>>
>>> nit: couldn't total_mapcount() just be implemented as a wrapper around
>>> folio_mapcount()? What's the benefit of PageCompound() check instead of just
>>> getting the folio and checking if it's large? i.e:
>>
>> Good point, let me take a look tomorrow if the compiler can optimize in
>> both cases equally well.
> 
> I checked by adjusting total_mapcount():
> 
> Before:
> 
>           if (PageTransHuge(page) && total_mapcount(page) > 1)
> ffffffff81411931:       4c 89 e7                mov    %r12,%rdi
> ffffffff81411934:       e8 f7 b1 ff ff          call   ffffffff8140cb30 <PageTransHuge>
> ffffffff81411939:       85 c0                   test   %eax,%eax
> ffffffff8141193b:       74 29                   je     ffffffff81411966 <migrate_misplaced_page+0x166>
> ffffffff8141193d:       49 8b 04 24             mov    (%r12),%rax
>           return test_bit(PG_head, &page->flags) ||
> ffffffff81411941:       a9 00 00 01 00          test   $0x10000,%eax
> ffffffff81411946:       0f 85 1f 01 00 00       jne    ffffffff81411a6b <migrate_misplaced_page+0x26b>
>                  READ_ONCE(page->compound_head) & 1;
> ffffffff8141194c:       49 8b 44 24 08          mov    0x8(%r12),%rax
>           return test_bit(PG_head, &page->flags) ||
> ffffffff81411951:       a8 01                   test   $0x1,%al
> ffffffff81411953:       0f 85 12 01 00 00       jne    ffffffff81411a6b <migrate_misplaced_page+0x26b>
> ffffffff81411959:       41 8b 44 24 30          mov    0x30(%r12),%eax
>                   return atomic_read(&page->_mapcount) + 1;
> ffffffff8141195e:       83 c0 01                add    $0x1,%eax
> ffffffff81411961:       83 f8 01                cmp    $0x1,%eax
> ffffffff81411964:       7f 77                   jg     ffffffff814119dd <migrate_misplaced_page+0x1dd>
> 
> So a total of 10 instructions after handling the mov/call/test/je for PageTransHuge().
> 
> After:
> 
>           if (PageTransHuge(page) && total_mapcount(page) > 1)
> ffffffff81411931:       4c 89 e7                mov    %r12,%rdi
> ffffffff81411934:       e8 f7 b1 ff ff          call   ffffffff8140cb30 <PageTransHuge>
> ffffffff81411939:       85 c0                   test   %eax,%eax
> ffffffff8141193b:       74 2f                   je     ffffffff8141196c <migrate_misplaced_page+0x16c>
>           unsigned long head = READ_ONCE(page->compound_head);
> ffffffff8141193d:       49 8b 44 24 08          mov    0x8(%r12),%rax
>           if (unlikely(head & 1))
> ffffffff81411942:       a8 01                   test   $0x1,%al
> ffffffff81411944:       0f 85 fc 05 00 00       jne    ffffffff81411f46 <migrate_misplaced_page+0x746>
> ffffffff8141194a:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>                   return page;
> ffffffff8141194f:       4c 89 e0                mov    %r12,%rax
> ffffffff81411952:       48 8b 10                mov    (%rax),%rdx
>           if (likely(!folio_test_large(folio)))
> ffffffff81411955:       f7 c2 00 00 01 00       test   $0x10000,%edx
> ffffffff8141195b:       0f 85 da 05 00 00       jne    ffffffff81411f3b <migrate_misplaced_page+0x73b>
> ffffffff81411961:       8b 40 30                mov    0x30(%rax),%eax
>                   return atomic_read(&folio->_mapcount) + 1;
> ffffffff81411964:       83 c0 01                add    $0x1,%eax
> ffffffff81411967:       83 f8 01                cmp    $0x1,%eax
> ffffffff8141196a:       7f 77                   jg     ffffffff814119e3 <migrate_misplaced_page+0x1e3>
> 
> So a total of 11 (+ 1 NOP) instructions after handling the mov/call/test/je for PageTransHuge().
> 
> Essentially one more MOV instruction.
> 
> I guess nobody really cares :)
> 

Also, let's simply do:

static inline bool folio_mapped(struct folio *folio)
{
	folio_mapcount(folio) > 1;
}

So this all cleans up quite a bit.

David Hildenbrand Aug. 10, 2023, 11:32 a.m. UTC | #16

On 10.08.23 13:27, David Hildenbrand wrote:
> On 10.08.23 13:14, David Hildenbrand wrote:
>> On 09.08.23 21:17, David Hildenbrand wrote:
>>> On 09.08.23 21:07, Ryan Roberts wrote:
>>>> On 09/08/2023 09:32, David Hildenbrand wrote:
>>>>> Let's track the total mapcount for all large folios in the first subpage.
>>>>>
>>>>> The total mapcount is what we actually want to know in folio_mapcount()
>>>>> and it is also sufficient for implementing folio_mapped(). This also
>>>>> gets rid of any "raceiness" concerns as expressed in
>>>>> folio_total_mapcount().
>>>>>
>>>>> With sub-PMD THP becoming more important and things looking promising
>>>>> that we will soon get support for such anon THP, we want to avoid looping
>>>>> over all pages of a folio just to calculate the total mapcount. Further,
>>>>> we may soon want to use the total mapcount in other context more
>>>>> frequently, so prepare for reading it efficiently and atomically.
>>>>>
>>>>> Make room for the total mapcount in page[1] of the folio by moving
>>>>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
>>>>> -- and with the total mapcount in place probably also not desirable even
>>>>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
>>>>> can overlay the hugetlb fields.
>>>>>
>>>>> Note that we currently don't expect any order-1 compound pages / THP in
>>>>> rmap code, and that such support is not planned. If ever desired, we could
>>>>> move the compound mapcount to another page, because it only applies to
>>>>> PMD-mappable folios that are definitely larger. Let's avoid consuming
>>>>> more space elsewhere for now -- we might need more space for a different
>>>>> purpose in some subpages soon.
>>>>>
>>>>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
>>>>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
>>>>> page_mapped().
>>>>>
>>>>> We can now get rid of folio_total_mapcount() and
>>>>> folio_large_is_mapped(), by just inlining reading of the total mapcount.
>>>>>
>>>>> _nr_pages_mapped is now only used in rmap code, so not accidentially
>>>>> externally where it might be used on arbitrary order-1 pages. The remaining
>>>>> usage is:
>>>>>
>>>>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>>>>>      -> If we would account the total folio as mapped when mapping a
>>>>>         page (based on the total mapcount), we could remove that usage.
>>>>>
>>>>> (2) Detect when to add a folio to the deferred split queue
>>>>>      -> If we would apply a different heuristic, or scan using the rmap on
>>>>>         the memory reclaim path for partially mapped anon folios to
>>>>>         split them, we could remove that usage as well.
>>>>>
>>>>> So maybe, we can simplify things in the future and remove
>>>>> _nr_pages_mapped. For now, leave these things as they are, they need more
>>>>> thought. Hugh really did a nice job implementing that precise tracking
>>>>> after all.
>>>>>
>>>>> Note: Not adding order-1 sanity checks to the file_rmap functions for
>>>>>           now. For anon pages, they are certainly not required right now.
>>>>>
>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Cc: Yin Fengwei <fengwei.yin@intel.com>
>>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>
>>>> Other than the nits and query on zeroing _total_mapcount below, LGTM. If zeroing
>>>> is correct:
>>>>
>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> Thanks for the review!
>>>
>>> [...]
>>>
>>>>>      
>>>>>      static inline int total_mapcount(struct page *page)
>>>>
>>>> nit: couldn't total_mapcount() just be implemented as a wrapper around
>>>> folio_mapcount()? What's the benefit of PageCompound() check instead of just
>>>> getting the folio and checking if it's large? i.e:
>>>
>>> Good point, let me take a look tomorrow if the compiler can optimize in
>>> both cases equally well.
>>
>> I checked by adjusting total_mapcount():
>>
>> Before:
>>
>>            if (PageTransHuge(page) && total_mapcount(page) > 1)
>> ffffffff81411931:       4c 89 e7                mov    %r12,%rdi
>> ffffffff81411934:       e8 f7 b1 ff ff          call   ffffffff8140cb30 <PageTransHuge>
>> ffffffff81411939:       85 c0                   test   %eax,%eax
>> ffffffff8141193b:       74 29                   je     ffffffff81411966 <migrate_misplaced_page+0x166>
>> ffffffff8141193d:       49 8b 04 24             mov    (%r12),%rax
>>            return test_bit(PG_head, &page->flags) ||
>> ffffffff81411941:       a9 00 00 01 00          test   $0x10000,%eax
>> ffffffff81411946:       0f 85 1f 01 00 00       jne    ffffffff81411a6b <migrate_misplaced_page+0x26b>
>>                   READ_ONCE(page->compound_head) & 1;
>> ffffffff8141194c:       49 8b 44 24 08          mov    0x8(%r12),%rax
>>            return test_bit(PG_head, &page->flags) ||
>> ffffffff81411951:       a8 01                   test   $0x1,%al
>> ffffffff81411953:       0f 85 12 01 00 00       jne    ffffffff81411a6b <migrate_misplaced_page+0x26b>
>> ffffffff81411959:       41 8b 44 24 30          mov    0x30(%r12),%eax
>>                    return atomic_read(&page->_mapcount) + 1;
>> ffffffff8141195e:       83 c0 01                add    $0x1,%eax
>> ffffffff81411961:       83 f8 01                cmp    $0x1,%eax
>> ffffffff81411964:       7f 77                   jg     ffffffff814119dd <migrate_misplaced_page+0x1dd>
>>
>> So a total of 10 instructions after handling the mov/call/test/je for PageTransHuge().
>>
>> After:
>>
>>            if (PageTransHuge(page) && total_mapcount(page) > 1)
>> ffffffff81411931:       4c 89 e7                mov    %r12,%rdi
>> ffffffff81411934:       e8 f7 b1 ff ff          call   ffffffff8140cb30 <PageTransHuge>
>> ffffffff81411939:       85 c0                   test   %eax,%eax
>> ffffffff8141193b:       74 2f                   je     ffffffff8141196c <migrate_misplaced_page+0x16c>
>>            unsigned long head = READ_ONCE(page->compound_head);
>> ffffffff8141193d:       49 8b 44 24 08          mov    0x8(%r12),%rax
>>            if (unlikely(head & 1))
>> ffffffff81411942:       a8 01                   test   $0x1,%al
>> ffffffff81411944:       0f 85 fc 05 00 00       jne    ffffffff81411f46 <migrate_misplaced_page+0x746>
>> ffffffff8141194a:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>                    return page;
>> ffffffff8141194f:       4c 89 e0                mov    %r12,%rax
>> ffffffff81411952:       48 8b 10                mov    (%rax),%rdx
>>            if (likely(!folio_test_large(folio)))
>> ffffffff81411955:       f7 c2 00 00 01 00       test   $0x10000,%edx
>> ffffffff8141195b:       0f 85 da 05 00 00       jne    ffffffff81411f3b <migrate_misplaced_page+0x73b>
>> ffffffff81411961:       8b 40 30                mov    0x30(%rax),%eax
>>                    return atomic_read(&folio->_mapcount) + 1;
>> ffffffff81411964:       83 c0 01                add    $0x1,%eax
>> ffffffff81411967:       83 f8 01                cmp    $0x1,%eax
>> ffffffff8141196a:       7f 77                   jg     ffffffff814119e3 <migrate_misplaced_page+0x1e3>
>>
>> So a total of 11 (+ 1 NOP) instructions after handling the mov/call/test/je for PageTransHuge().
>>
>> Essentially one more MOV instruction.
>>
>> I guess nobody really cares :)
>>
> 
> Also, let's simply do:
> 
> static inline bool folio_mapped(struct folio *folio)
> {
> 	folio_mapcount(folio) > 1;

 > 0, obviously :)

Ryan Roberts Aug. 10, 2023, 11:35 a.m. UTC | #17

On 10/08/2023 12:32, David Hildenbrand wrote:
> On 10.08.23 13:27, David Hildenbrand wrote:
>> On 10.08.23 13:14, David Hildenbrand wrote:
>>> On 09.08.23 21:17, David Hildenbrand wrote:
>>>> On 09.08.23 21:07, Ryan Roberts wrote:
>>>>> On 09/08/2023 09:32, David Hildenbrand wrote:
>>>>>> Let's track the total mapcount for all large folios in the first subpage.
>>>>>>
>>>>>> The total mapcount is what we actually want to know in folio_mapcount()
>>>>>> and it is also sufficient for implementing folio_mapped(). This also
>>>>>> gets rid of any "raceiness" concerns as expressed in
>>>>>> folio_total_mapcount().
>>>>>>
>>>>>> With sub-PMD THP becoming more important and things looking promising
>>>>>> that we will soon get support for such anon THP, we want to avoid looping
>>>>>> over all pages of a folio just to calculate the total mapcount. Further,
>>>>>> we may soon want to use the total mapcount in other context more
>>>>>> frequently, so prepare for reading it efficiently and atomically.
>>>>>>
>>>>>> Make room for the total mapcount in page[1] of the folio by moving
>>>>>> _nr_pages_mapped to page[2] of the folio: it is not applicable to hugetlb
>>>>>> -- and with the total mapcount in place probably also not desirable even
>>>>>> if PMD-mappable hugetlb pages could get PTE-mapped at some point -- so we
>>>>>> can overlay the hugetlb fields.
>>>>>>
>>>>>> Note that we currently don't expect any order-1 compound pages / THP in
>>>>>> rmap code, and that such support is not planned. If ever desired, we could
>>>>>> move the compound mapcount to another page, because it only applies to
>>>>>> PMD-mappable folios that are definitely larger. Let's avoid consuming
>>>>>> more space elsewhere for now -- we might need more space for a different
>>>>>> purpose in some subpages soon.
>>>>>>
>>>>>> Maintain the total mapcount also for hugetlb pages. Use the total mapcount
>>>>>> to implement folio_mapcount(), total_mapcount(), folio_mapped() and
>>>>>> page_mapped().
>>>>>>
>>>>>> We can now get rid of folio_total_mapcount() and
>>>>>> folio_large_is_mapped(), by just inlining reading of the total mapcount.
>>>>>>
>>>>>> _nr_pages_mapped is now only used in rmap code, so not accidentially
>>>>>> externally where it might be used on arbitrary order-1 pages. The remaining
>>>>>> usage is:
>>>>>>
>>>>>> (1) Detect how to adjust stats: NR_ANON_MAPPED and NR_FILE_MAPPED
>>>>>>      -> If we would account the total folio as mapped when mapping a
>>>>>>         page (based on the total mapcount), we could remove that usage.
>>>>>>
>>>>>> (2) Detect when to add a folio to the deferred split queue
>>>>>>      -> If we would apply a different heuristic, or scan using the rmap on
>>>>>>         the memory reclaim path for partially mapped anon folios to
>>>>>>         split them, we could remove that usage as well.
>>>>>>
>>>>>> So maybe, we can simplify things in the future and remove
>>>>>> _nr_pages_mapped. For now, leave these things as they are, they need more
>>>>>> thought. Hugh really did a nice job implementing that precise tracking
>>>>>> after all.
>>>>>>
>>>>>> Note: Not adding order-1 sanity checks to the file_rmap functions for
>>>>>>           now. For anon pages, they are certainly not required right now.
>>>>>>
>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Cc: Yin Fengwei <fengwei.yin@intel.com>
>>>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>
>>>>> Other than the nits and query on zeroing _total_mapcount below, LGTM. If
>>>>> zeroing
>>>>> is correct:
>>>>>
>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>
>>>> Thanks for the review!
>>>>
>>>> [...]
>>>>
>>>>>>           static inline int total_mapcount(struct page *page)
>>>>>
>>>>> nit: couldn't total_mapcount() just be implemented as a wrapper around
>>>>> folio_mapcount()? What's the benefit of PageCompound() check instead of just
>>>>> getting the folio and checking if it's large? i.e:
>>>>
>>>> Good point, let me take a look tomorrow if the compiler can optimize in
>>>> both cases equally well.
>>>
>>> I checked by adjusting total_mapcount():
>>>
>>> Before:
>>>
>>>            if (PageTransHuge(page) && total_mapcount(page) > 1)
>>> ffffffff81411931:       4c 89 e7                mov    %r12,%rdi
>>> ffffffff81411934:       e8 f7 b1 ff ff          call   ffffffff8140cb30
>>> <PageTransHuge>
>>> ffffffff81411939:       85 c0                   test   %eax,%eax
>>> ffffffff8141193b:       74 29                   je     ffffffff81411966
>>> <migrate_misplaced_page+0x166>
>>> ffffffff8141193d:       49 8b 04 24             mov    (%r12),%rax
>>>            return test_bit(PG_head, &page->flags) ||
>>> ffffffff81411941:       a9 00 00 01 00          test   $0x10000,%eax
>>> ffffffff81411946:       0f 85 1f 01 00 00       jne    ffffffff81411a6b
>>> <migrate_misplaced_page+0x26b>
>>>                   READ_ONCE(page->compound_head) & 1;
>>> ffffffff8141194c:       49 8b 44 24 08          mov    0x8(%r12),%rax
>>>            return test_bit(PG_head, &page->flags) ||
>>> ffffffff81411951:       a8 01                   test   $0x1,%al
>>> ffffffff81411953:       0f 85 12 01 00 00       jne    ffffffff81411a6b
>>> <migrate_misplaced_page+0x26b>
>>> ffffffff81411959:       41 8b 44 24 30          mov    0x30(%r12),%eax
>>>                    return atomic_read(&page->_mapcount) + 1;
>>> ffffffff8141195e:       83 c0 01                add    $0x1,%eax
>>> ffffffff81411961:       83 f8 01                cmp    $0x1,%eax
>>> ffffffff81411964:       7f 77                   jg     ffffffff814119dd
>>> <migrate_misplaced_page+0x1dd>
>>>
>>> So a total of 10 instructions after handling the mov/call/test/je for
>>> PageTransHuge().
>>>
>>> After:
>>>
>>>            if (PageTransHuge(page) && total_mapcount(page) > 1)
>>> ffffffff81411931:       4c 89 e7                mov    %r12,%rdi
>>> ffffffff81411934:       e8 f7 b1 ff ff          call   ffffffff8140cb30
>>> <PageTransHuge>
>>> ffffffff81411939:       85 c0                   test   %eax,%eax
>>> ffffffff8141193b:       74 2f                   je     ffffffff8141196c
>>> <migrate_misplaced_page+0x16c>
>>>            unsigned long head = READ_ONCE(page->compound_head);
>>> ffffffff8141193d:       49 8b 44 24 08          mov    0x8(%r12),%rax
>>>            if (unlikely(head & 1))
>>> ffffffff81411942:       a8 01                   test   $0x1,%al
>>> ffffffff81411944:       0f 85 fc 05 00 00       jne    ffffffff81411f46
>>> <migrate_misplaced_page+0x746>
>>> ffffffff8141194a:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>>                    return page;
>>> ffffffff8141194f:       4c 89 e0                mov    %r12,%rax
>>> ffffffff81411952:       48 8b 10                mov    (%rax),%rdx
>>>            if (likely(!folio_test_large(folio)))
>>> ffffffff81411955:       f7 c2 00 00 01 00       test   $0x10000,%edx
>>> ffffffff8141195b:       0f 85 da 05 00 00       jne    ffffffff81411f3b
>>> <migrate_misplaced_page+0x73b>
>>> ffffffff81411961:       8b 40 30                mov    0x30(%rax),%eax
>>>                    return atomic_read(&folio->_mapcount) + 1;
>>> ffffffff81411964:       83 c0 01                add    $0x1,%eax
>>> ffffffff81411967:       83 f8 01                cmp    $0x1,%eax
>>> ffffffff8141196a:       7f 77                   jg     ffffffff814119e3
>>> <migrate_misplaced_page+0x1e3>
>>>
>>> So a total of 11 (+ 1 NOP) instructions after handling the mov/call/test/je
>>> for PageTransHuge().
>>>
>>> Essentially one more MOV instruction.
>>>
>>> I guess nobody really cares :)
>>>
>>
>> Also, let's simply do:
>>
>> static inline bool folio_mapped(struct folio *folio)
>> {
>>     folio_mapcount(folio) > 1;
> 
>> 0, obviously :)
> 

;-) All sounds good to me!

Peter Xu Aug. 10, 2023, 5:15 p.m. UTC | #18

On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote:
> > For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is
> > really significant (judging from my experience on managing PageAnonExclusive
> > using set_bit/test_bit/clear_bit when (un)mapping anon pages).
> > 
> > As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be
> > batching where possible (and Ryan is working on some more rmap batching). 
> 
> Yes, I've just posted [1] which batches the rmap removal. That would allow you
> to convert the per-page atomic_dec() into a (usually) single per-large-folio
> atomic_sub().
> 
> [1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/

Right, that'll definitely make more sense, thanks for the link; I'd be very
happy to read more later (finally I got some free time recently..).  But
then does it mean David's patch can be attached at the end instead of
proposed separately and early?

I was asking mostly because I read it as a standalone patch first, and
honestly I don't know the effect.  It's based on not only the added atomic
ops itself, but also the field changes.

For example, this patch moves Hugh's _nr_pages_mapped into the 2nd tail
page, I think it means for any rmap change of any small page of a huge one
we'll need to start touching one more 64B cacheline on x86.  I really have
no idea what does it mean for especially a large SMP: see 292648ac5cf1 on
why I had an impression of that.  But I've no enough experience or clue to
prove it a problem either, maybe would be interesting to measure the time
needed for some pte-mapped loops?  E.g., something like faulting in a thp,
then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it
takes before/after this patch.

When looking at this, I actually found one thing that is slightly
confusing, not directly relevant to your patch, but regarding the reuse of
tail page 1 on offset 24 bytes.  Current it's Hugh's _nr_pages_mapped,
and you're proposing to replace it with the total mapcount:

        atomic_t   _nr_pages_mapped;     /*    88     4 */

Now my question is.. isn't byte 24 of tail page 1 used for keeping a
poisoned mapping?  See prep_compound_tail() where it has:

	p->mapping = TAIL_MAPPING;

While here mapping is, afaict, also using offset 24 of the tail page 1:

        struct address_space * mapping;  /*    24     8 */

I hope I did a wrong math somewhere, though.

David Hildenbrand Aug. 10, 2023, 5:47 p.m. UTC | #19

On 10.08.23 19:15, Peter Xu wrote:
> On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote:
>>> For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is
>>> really significant (judging from my experience on managing PageAnonExclusive
>>> using set_bit/test_bit/clear_bit when (un)mapping anon pages).
>>>
>>> As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be
>>> batching where possible (and Ryan is working on some more rmap batching).
>>
>> Yes, I've just posted [1] which batches the rmap removal. That would allow you
>> to convert the per-page atomic_dec() into a (usually) single per-large-folio
>> atomic_sub().
>>
>> [1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/
> 
> Right, that'll definitely make more sense, thanks for the link; I'd be very
> happy to read more later (finally I got some free time recently..).  But
> then does it mean David's patch can be attached at the end instead of
> proposed separately and early?

Not in my opinion. Batching rmap makes sense even without this change, 
and this change makes sense even without batching.

> 
> I was asking mostly because I read it as a standalone patch first, and
> honestly I don't know the effect.  It's based on not only the added atomic
> ops itself, but also the field changes.
> 
> For example, this patch moves Hugh's _nr_pages_mapped into the 2nd tail
> page, I think it means for any rmap change of any small page of a huge one
> we'll need to start touching one more 64B cacheline on x86.  I really have
> no idea what does it mean for especially a large SMP: see 292648ac5cf1 on
> why I had an impression of that.  But I've no enough experience or clue to
> prove it a problem either, maybe would be interesting to measure the time
> needed for some pte-mapped loops?  E.g., something like faulting in a thp,

Okay, so your speculation right now is:

1) The change in cacheline might be problematic.

2) The additional atomic operation might be problematic.

> then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it
> takes before/after this patch.

I can certainly try getting some numbers on that. If you're aware of 
other micro-benchmarks that would likely notice slower pte-mapping of 
THPs, please let me know.

> 
> When looking at this, I actually found one thing that is slightly
> confusing, not directly relevant to your patch, but regarding the reuse of
> tail page 1 on offset 24 bytes.  Current it's Hugh's _nr_pages_mapped,
> and you're proposing to replace it with the total mapcount:
> 
>          atomic_t   _nr_pages_mapped;     /*    88     4 */
> 
> Now my question is.. isn't byte 24 of tail page 1 used for keeping a
> poisoned mapping?  See prep_compound_tail() where it has:
> 
> 	p->mapping = TAIL_MAPPING;
> 
> While here mapping is, afaict, also using offset 24 of the tail page 1:
> 
>          struct address_space * mapping;  /*    24     8 */
> 
> I hope I did a wrong math somewhere, though.
> 

I think your math is correct.

prep_compound_head() is called after prep_compound_tail(), so 
prep_compound_head() wins.

In __split_huge_page_tail() there is a VM_BUG_ON_PAGE() that explains 
the situation:

/* ->mapping in first and second tail page is replaced by other uses */
VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
	       page_tail);

Thanks for raising that, I had to look into that myself.

Ryan Roberts Aug. 10, 2023, 7:02 p.m. UTC | #20

On 10/08/2023 18:47, David Hildenbrand wrote:
> On 10.08.23 19:15, Peter Xu wrote:
>> On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote:
>>>> For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is
>>>> really significant (judging from my experience on managing PageAnonExclusive
>>>> using set_bit/test_bit/clear_bit when (un)mapping anon pages).
>>>>
>>>> As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be
>>>> batching where possible (and Ryan is working on some more rmap batching).
>>>
>>> Yes, I've just posted [1] which batches the rmap removal. That would allow you
>>> to convert the per-page atomic_dec() into a (usually) single per-large-folio
>>> atomic_sub().
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/
>>
>> Right, that'll definitely make more sense, thanks for the link; I'd be very
>> happy to read more later (finally I got some free time recently..).  But
>> then does it mean David's patch can be attached at the end instead of
>> proposed separately and early?
> 
> Not in my opinion. Batching rmap makes sense even without this change, and this
> change makes sense even without batching.

FWIW, I agree that my series and David's series should be treated independently.
There is independent value in both.

It's also worth pointing out that with my series, the amount of batching you see
in practice still depends on large folios being mapped, which isn't quite the
common case yet.

Peter Xu Aug. 10, 2023, 8:57 p.m. UTC | #21

On Thu, Aug 10, 2023 at 07:47:35PM +0200, David Hildenbrand wrote:
> On 10.08.23 19:15, Peter Xu wrote:
> > On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote:
> > > > For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is
> > > > really significant (judging from my experience on managing PageAnonExclusive
> > > > using set_bit/test_bit/clear_bit when (un)mapping anon pages).
> > > > 
> > > > As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be
> > > > batching where possible (and Ryan is working on some more rmap batching).
> > > 
> > > Yes, I've just posted [1] which batches the rmap removal. That would allow you
> > > to convert the per-page atomic_dec() into a (usually) single per-large-folio
> > > atomic_sub().
> > > 
> > > [1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@arm.com/
> > 
> > Right, that'll definitely make more sense, thanks for the link; I'd be very
> > happy to read more later (finally I got some free time recently..).  But
> > then does it mean David's patch can be attached at the end instead of
> > proposed separately and early?
> 
> Not in my opinion. Batching rmap makes sense even without this change, and
> this change makes sense even without batching.
> 
> > 
> > I was asking mostly because I read it as a standalone patch first, and
> > honestly I don't know the effect.  It's based on not only the added atomic
> > ops itself, but also the field changes.
> > 
> > For example, this patch moves Hugh's _nr_pages_mapped into the 2nd tail
> > page, I think it means for any rmap change of any small page of a huge one
> > we'll need to start touching one more 64B cacheline on x86.  I really have
> > no idea what does it mean for especially a large SMP: see 292648ac5cf1 on
> > why I had an impression of that.  But I've no enough experience or clue to
> > prove it a problem either, maybe would be interesting to measure the time
> > needed for some pte-mapped loops?  E.g., something like faulting in a thp,
> 
> Okay, so your speculation right now is:
> 
> 1) The change in cacheline might be problematic.
> 
> 2) The additional atomic operation might be problematic.
> 
> > then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it
> > takes before/after this patch.
> 
> I can certainly try getting some numbers on that. If you're aware of other
> micro-benchmarks that would likely notice slower pte-mapping of THPs, please
> let me know.

Thanks.

> 
> > 
> > When looking at this, I actually found one thing that is slightly
> > confusing, not directly relevant to your patch, but regarding the reuse of
> > tail page 1 on offset 24 bytes.  Current it's Hugh's _nr_pages_mapped,
> > and you're proposing to replace it with the total mapcount:
> > 
> >          atomic_t   _nr_pages_mapped;     /*    88     4 */
> > 
> > Now my question is.. isn't byte 24 of tail page 1 used for keeping a
> > poisoned mapping?  See prep_compound_tail() where it has:
> > 
> > 	p->mapping = TAIL_MAPPING;
> > 
> > While here mapping is, afaict, also using offset 24 of the tail page 1:
> > 
> >          struct address_space * mapping;  /*    24     8 */
> > 
> > I hope I did a wrong math somewhere, though.
> > 
> 
> I think your math is correct.
> 
> prep_compound_head() is called after prep_compound_tail(), so
> prep_compound_head() wins.
> 
> In __split_huge_page_tail() there is a VM_BUG_ON_PAGE() that explains the
> situation:
> 
> /* ->mapping in first and second tail page is replaced by other uses */
> VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
> 	       page_tail);
> 
> Thanks for raising that, I had to look into that myself.

It's so confusing so I did try to document them a bit myself, then I found
maybe I should just post a patch for it and I just did:

https://lore.kernel.org/r/20230810204944.53471-1-peterx@redhat.com

It'll conflict with yours, but I marked RFC so it's never anything urgent,
but maybe that'll be helpful already for you to introduce any new fields
like total_mapcounts.

AFAICS if that patch was all correct (while I'm not yet sure..), you can
actually fit your new total mapcount field into page 1 so even avoid the
extra cacheline access.  You can have a look: the trick is refcount for
tail page 1 is still seems to be free on 32 bits (if that was your worry
before).  Then it'll be very nice if to keep Hugh's counter all in tail 1.

Peter Xu Aug. 10, 2023, 9:48 p.m. UTC | #22

On Thu, Aug 10, 2023 at 10:37:04AM +0200, David Hildenbrand wrote:
> On 10.08.23 05:25, Matthew Wilcox wrote:
> > On Wed, Aug 09, 2023 at 05:23:46PM -0400, Peter Xu wrote:
> > > Hi, David,
> > > 
> > > Some pure questions below..
> > > 
> > > On Wed, Aug 09, 2023 at 10:32:56AM +0200, David Hildenbrand wrote:
> > > > Let's track the total mapcount for all large folios in the first subpage.
> > > > 
> > > > The total mapcount is what we actually want to know in folio_mapcount()
> > > > and it is also sufficient for implementing folio_mapped(). This also
> > > > gets rid of any "raceiness" concerns as expressed in
> > > > folio_total_mapcount().
> > > 
> > > Any more information for that "raciness" described here?
> > 
> > UTSL.
> > 
> >          /*
> >           * Add all the PTE mappings of those pages mapped by PTE.
> >           * Limit the loop to folio_nr_pages_mapped()?
> >           * Perhaps: given all the raciness, that may be a good or a bad idea.
> >           */
> > 
> 
> Yes, that comment from Hugh primarily discusses how we could possibly
> optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the
> iterations would be racy. As far as I can see, there are cases where "it
> would be certainly a bad idea" :)

Is the race described about mapcount being changed right after it's read?
Are you aware of anything specific that will be broken, and will be fixed
with this patch?

I assume mapcount==1 will be very special in this case when e.g. holding a
pgtable lock, other than that I won't be surprised if mapcount changes in
parallel.  But I must confess I don't really have any thorough digests on
this whole matter.

> 
> 
> In the other comment in that function, it's also made clear what the
> traditional behavior with PMD-mappable THP was "In the common case, avoid
> the loop when no pages mapped by PTE", which will no longer hold with
> sub-PMD THP.

Having a total mapcount does sound helpful if partial folio is common
indeed.

I'm curious whether that'll be so common after the large anon folio work -
isn't it be sad if partial folio will be a norm?  It sounds to me that's
the case when small page sizes should be used.. and it's prone to waste?

Matthew Wilcox Aug. 10, 2023, 9:48 p.m. UTC | #23

On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote:
> AFAICS if that patch was all correct (while I'm not yet sure..), you can
> actually fit your new total mapcount field into page 1 so even avoid the
> extra cacheline access.  You can have a look: the trick is refcount for
> tail page 1 is still seems to be free on 32 bits (if that was your worry
> before).  Then it'll be very nice if to keep Hugh's counter all in tail 1.

No, refcount must be 0 on all tail pages.  We rely on this in many places
in the MM.

Matthew Wilcox Aug. 10, 2023, 9:54 p.m. UTC | #24

On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote:
> > Yes, that comment from Hugh primarily discusses how we could possibly
> > optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the
> > iterations would be racy. As far as I can see, there are cases where "it
> > would be certainly a bad idea" :)
> 
> Is the race described about mapcount being changed right after it's read?
> Are you aware of anything specific that will be broken, and will be fixed
> with this patch?

The problem is that people check the mapcount while holding no locks;
not the PTL, not the page lock.  So it's an unfixable race.

> Having a total mapcount does sound helpful if partial folio is common
> indeed.
> 
> I'm curious whether that'll be so common after the large anon folio work -
> isn't it be sad if partial folio will be a norm?  It sounds to me that's
> the case when small page sizes should be used.. and it's prone to waste?

The problem is that entire_mapcount isn't really entire_mapcount.
It's pmd_mapcount.  I have had thoughts about using it as entire_mapcount,
but it gets gnarly when people do partial unmaps.  So the _usual_ case
ends up touching every struct page.  Which sucks.  Also it's one of the
things which stands in the way of shrinking struct page.

But it's kind of annoying to explain all of this to you individually.
There have been hundreds of emails about it over the last months on
this mailing list.  It would be nice if you could catch up instead of
jumping in.

David Hildenbrand Aug. 10, 2023, 9:59 p.m. UTC | #25

On 10.08.23 23:54, Matthew Wilcox wrote:
> On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote:
>>> Yes, that comment from Hugh primarily discusses how we could possibly
>>> optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the
>>> iterations would be racy. As far as I can see, there are cases where "it
>>> would be certainly a bad idea" :)
>>
>> Is the race described about mapcount being changed right after it's read?
>> Are you aware of anything specific that will be broken, and will be fixed
>> with this patch?
> 
> The problem is that people check the mapcount while holding no locks;
> not the PTL, not the page lock.  So it's an unfixable race.
> 
>> Having a total mapcount does sound helpful if partial folio is common
>> indeed.
>>
>> I'm curious whether that'll be so common after the large anon folio work -
>> isn't it be sad if partial folio will be a norm?  It sounds to me that's
>> the case when small page sizes should be used.. and it's prone to waste?
> 
> The problem is that entire_mapcount isn't really entire_mapcount.
> It's pmd_mapcount.  I have had thoughts about using it as entire_mapcount,
> but it gets gnarly when people do partial unmaps.  So the _usual_ case
> ends up touching every struct page.  Which sucks.  Also it's one of the
> things which stands in the way of shrinking struct page.

Right, so one current idea is to have a single total_mapcount and look 
into removing the subpage mapcounts (which will require first removing 
_nr_pages_mapped, because that's still one of the important users).

Until we get there, also rmap code has to do eventually "more tracking" 
and might, unfortunately, end up slower.

> 
> But it's kind of annoying to explain all of this to you individually.
> There have been hundreds of emails about it over the last months on
> this mailing list.  It would be nice if you could catch up instead of
> jumping in.

To be fair, a lot of the details are not readily available and in the 
heads of selected people :)

Peter, if you're interested, we can discuss the current plans, issues 
and ideas offline!

David Hildenbrand Aug. 10, 2023, 10:16 p.m. UTC | #26

>> Okay, so your speculation right now is:
>>
>> 1) The change in cacheline might be problematic.
>>
>> 2) The additional atomic operation might be problematic.
>>
>>> then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it
>>> takes before/after this patch.
>>
>> I can certainly try getting some numbers on that. If you're aware of other
>> micro-benchmarks that would likely notice slower pte-mapping of THPs, please
>> let me know.
> 
> Thanks.

If I effectively only measure the real PTE->PMD remapping (only measure 
the for loop that mprotects() one 4k page inside each of 512 THPs ) 
without any of the mmap+populate+munmap, I can certainly measure a real 
difference.

I briefly looked at some perf data across the overall benchmark runtime.

For page_remove_rmap(), the new atomic_dec() doesn't seem to be 
significant. Data indicates that it's significantly less relevant than a 
later atomic_add_negative().

For page_add_anon_rmap(), it's a bit fuzzy. Definitely, the 
atomic_inc_return_relaxed(mapped) seems to stick out, but I cannot rule 
out that the atomic_add() also plays a role.


The PTE->PMD remapping effectively does (__split_huge_pmd_locked())

for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
	...
	page_add_anon_rmap(page + i, vma, addr, RMAP_NONE);
	...
}
...
page_remove_rmap(page, vma, true);


Inside that loop we're repeatedly accessing the total_mapcount and 
_nr_pages_mapped. So my best guess would have been that both are already 
hot in the cache.

RMAP batching certainly sounds like a good idea for 
__split_huge_pmd_locked(), independent of this patch.


What would probably also interesting is observing happens when we unmap 
a single PTE of a THP and we cannot batch, to see if the 
page_remove_rmap() matters in the bigger scale.

I'll do some more digging tomorrow to clarify some details. Running some 
kernel compile tests with thp=always at least didn't reveal any 
surprises so far.

David Hildenbrand Aug. 10, 2023, 10:27 p.m. UTC | #27

On 10.08.23 23:48, Matthew Wilcox wrote:
> On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote:
>> AFAICS if that patch was all correct (while I'm not yet sure..), you can
>> actually fit your new total mapcount field into page 1 so even avoid the
>> extra cacheline access.  You can have a look: the trick is refcount for
>> tail page 1 is still seems to be free on 32 bits (if that was your worry
>> before).  Then it'll be very nice if to keep Hugh's counter all in tail 1.
> 
> No, refcount must be 0 on all tail pages.  We rely on this in many places
> in the MM.

Very right.

One could theoretically

1) Move the compound/entire mapcount to page[2]
2) Make hugetlb stop using the entire mapcount and only the total
    mapcount.
3) Then leave total_mapcount and nr_pages_mapped in page[1]
4) Make page_mapcount() use total_mapcount for hugetlb.

When (un)mapping a PMD-mapped THP, we would go to page[2]. Otherwise, 
only page[1].

The entire mapcount, similarly to nr_pages_mapped, primarily serves to 
get memory stats right; well, and to implement page_mapcount() for THP.

But I'm not 100% sure yet if the overhead from having nr_pages_mapped in 
page[2] is significant enough at this point.

Peter Xu Aug. 11, 2023, 3:03 p.m. UTC | #28

On Thu, Aug 10, 2023 at 11:59:25PM +0200, David Hildenbrand wrote:
> On 10.08.23 23:54, Matthew Wilcox wrote:
> > On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote:
> > > > Yes, that comment from Hugh primarily discusses how we could possibly
> > > > optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the
> > > > iterations would be racy. As far as I can see, there are cases where "it
> > > > would be certainly a bad idea" :)
> > > 
> > > Is the race described about mapcount being changed right after it's read?
> > > Are you aware of anything specific that will be broken, and will be fixed
> > > with this patch?
> > 
> > The problem is that people check the mapcount while holding no locks;
> > not the PTL, not the page lock.  So it's an unfixable race.
> > 
> > > Having a total mapcount does sound helpful if partial folio is common
> > > indeed.
> > > 
> > > I'm curious whether that'll be so common after the large anon folio work -
> > > isn't it be sad if partial folio will be a norm?  It sounds to me that's
> > > the case when small page sizes should be used.. and it's prone to waste?
> > 
> > The problem is that entire_mapcount isn't really entire_mapcount.
> > It's pmd_mapcount.  I have had thoughts about using it as entire_mapcount,
> > but it gets gnarly when people do partial unmaps.  So the _usual_ case
> > ends up touching every struct page.  Which sucks.  Also it's one of the
> > things which stands in the way of shrinking struct page.
> 
> Right, so one current idea is to have a single total_mapcount and look into
> removing the subpage mapcounts (which will require first removing
> _nr_pages_mapped, because that's still one of the important users).
> 
> Until we get there, also rmap code has to do eventually "more tracking" and
> might, unfortunately, end up slower.
> 
> > 
> > But it's kind of annoying to explain all of this to you individually.
> > There have been hundreds of emails about it over the last months on
> > this mailing list.  It would be nice if you could catch up instead of
> > jumping in.
> 
> To be fair, a lot of the details are not readily available and in the heads
> of selected people :)
> 
> Peter, if you're interested, we can discuss the current plans, issues and
> ideas offline!

Thanks for offering help, David.

Personally I still am unclear yet on why entire_mapcount cannot be used as
full-folio mapcounts, and why "partial unmap" can happen a lot (I don't
expect), but yeah I can try to catch up to educate myself first.

The only issue regarding an offline sync-up is that even if David will help
Peter on catching up the bits, it'll not scale when another Peter2 had the
same question..  So David, rather than I waste your time on helping one
person, let me try to catch up with the public threads - I'm not sure how
far I can go myself; otoh thread links will definitely be helpful to be
replied here, so anyone else can reference too.  I collected a list (which
can be enriched) of few threads that might be related, just in case helpful
to anyone besides myself:

[PATCH 0/2] don't use mapcount() to check large folio sharing
https://lore.kernel.org/r/20230728161356.1784568-1-fengwei.yin@intel.com

[PATCH v1-v2 0/3] support large folio for mlock
https://lore.kernel.org/r/20230728070929.2487065-1-fengwei.yin@intel.com
https://lore.kernel.org/r/20230809061105.3369958-1-fengwei.yin@intel.com

[PATCH v1 0/4] Optimize mmap_exit for large folios
https://lore.kernel.org/r/20230810103332.3062143-1-ryan.roberts@arm.com

[PATCH v4-v5 0/5] variable-order, large folios for anonymous memory
https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/
https://lore.kernel.org/r/20230810142942.3169679-1-ryan.roberts@arm.com

[PATCH v3-v4 0/3] Optimize large folio interaction with deferred split
(I assumed Ryan's this one goes into the previous set v5 finally, so just
 the discussions as reference)
https://lore.kernel.org/r/20230720112955.643283-1-ryan.roberts@arm.com
https://lore.kernel.org/r/20230727141837.3386072-1-ryan.roberts@arm.com

[RFC PATCH v2 0/4] fix large folio for madvise_cold_or_pageout()
https://lore.kernel.org/r/20230721094043.2506691-1-fengwei.yin@intel.com

I'm not sure how far I'll go; maybe I'll start working on something else
before I finish all of them.  I'll see..

Not allowing people to jump in will definitely cause less interactions and
less involvement/open-ness for the mm community, as sometimes people can't
easily judge when it's proper to jump in.

IMHO the ideal solution is always keep all discussions public (either
meetings with recordings, or shared online documents, always use on-list
discussions, etc.), then share the links.

Zi Yan Aug. 11, 2023, 3:14 p.m. UTC | #29

On 11 Aug 2023, at 11:03, Peter Xu wrote:

> On Thu, Aug 10, 2023 at 11:59:25PM +0200, David Hildenbrand wrote:
>> On 10.08.23 23:54, Matthew Wilcox wrote:
>>> On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote:
>>>>> Yes, that comment from Hugh primarily discusses how we could possibly
>>>>> optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the
>>>>> iterations would be racy. As far as I can see, there are cases where "it
>>>>> would be certainly a bad idea" :)
>>>>
>>>> Is the race described about mapcount being changed right after it's read?
>>>> Are you aware of anything specific that will be broken, and will be fixed
>>>> with this patch?
>>>
>>> The problem is that people check the mapcount while holding no locks;
>>> not the PTL, not the page lock.  So it's an unfixable race.
>>>
>>>> Having a total mapcount does sound helpful if partial folio is common
>>>> indeed.
>>>>
>>>> I'm curious whether that'll be so common after the large anon folio work -
>>>> isn't it be sad if partial folio will be a norm?  It sounds to me that's
>>>> the case when small page sizes should be used.. and it's prone to waste?
>>>
>>> The problem is that entire_mapcount isn't really entire_mapcount.
>>> It's pmd_mapcount.  I have had thoughts about using it as entire_mapcount,
>>> but it gets gnarly when people do partial unmaps.  So the _usual_ case
>>> ends up touching every struct page.  Which sucks.  Also it's one of the
>>> things which stands in the way of shrinking struct page.
>>
>> Right, so one current idea is to have a single total_mapcount and look into
>> removing the subpage mapcounts (which will require first removing
>> _nr_pages_mapped, because that's still one of the important users).
>>
>> Until we get there, also rmap code has to do eventually "more tracking" and
>> might, unfortunately, end up slower.
>>
>>>
>>> But it's kind of annoying to explain all of this to you individually.
>>> There have been hundreds of emails about it over the last months on
>>> this mailing list.  It would be nice if you could catch up instead of
>>> jumping in.
>>
>> To be fair, a lot of the details are not readily available and in the heads
>> of selected people :)
>>
>> Peter, if you're interested, we can discuss the current plans, issues and
>> ideas offline!
>
> Thanks for offering help, David.
>
> Personally I still am unclear yet on why entire_mapcount cannot be used as
> full-folio mapcounts, and why "partial unmap" can happen a lot (I don't
> expect), but yeah I can try to catch up to educate myself first.

Separate entire_mapcount and per-page mapcount are needed to maintain precise
NR_{ANON,FILE}_MAPPED and NR_ANON_THPS. I wrote some explanation (third paragraph)
at: https://lore.kernel.org/linux-mm/A28053D6-E158-4726-8BE1-B9F4960AD570@nvidia.com/.
Let me know if it helps.

>
> The only issue regarding an offline sync-up is that even if David will help
> Peter on catching up the bits, it'll not scale when another Peter2 had the
> same question..  So David, rather than I waste your time on helping one
> person, let me try to catch up with the public threads - I'm not sure how
> far I can go myself; otoh thread links will definitely be helpful to be
> replied here, so anyone else can reference too.  I collected a list (which
> can be enriched) of few threads that might be related, just in case helpful
> to anyone besides myself:
>
> [PATCH 0/2] don't use mapcount() to check large folio sharing
> https://lore.kernel.org/r/20230728161356.1784568-1-fengwei.yin@intel.com
>
> [PATCH v1-v2 0/3] support large folio for mlock
> https://lore.kernel.org/r/20230728070929.2487065-1-fengwei.yin@intel.com
> https://lore.kernel.org/r/20230809061105.3369958-1-fengwei.yin@intel.com
>
> [PATCH v1 0/4] Optimize mmap_exit for large folios
> https://lore.kernel.org/r/20230810103332.3062143-1-ryan.roberts@arm.com
>
> [PATCH v4-v5 0/5] variable-order, large folios for anonymous memory
> https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/
> https://lore.kernel.org/r/20230810142942.3169679-1-ryan.roberts@arm.com
>
> [PATCH v3-v4 0/3] Optimize large folio interaction with deferred split
> (I assumed Ryan's this one goes into the previous set v5 finally, so just
>  the discussions as reference)
> https://lore.kernel.org/r/20230720112955.643283-1-ryan.roberts@arm.com
> https://lore.kernel.org/r/20230727141837.3386072-1-ryan.roberts@arm.com
>
> [RFC PATCH v2 0/4] fix large folio for madvise_cold_or_pageout()
> https://lore.kernel.org/r/20230721094043.2506691-1-fengwei.yin@intel.com
>
> I'm not sure how far I'll go; maybe I'll start working on something else
> before I finish all of them.  I'll see..
>
> Not allowing people to jump in will definitely cause less interactions and
> less involvement/open-ness for the mm community, as sometimes people can't
> easily judge when it's proper to jump in.
>
> IMHO the ideal solution is always keep all discussions public (either
> meetings with recordings, or shared online documents, always use on-list
> discussions, etc.), then share the links.
>
> -- 
> Peter Xu

--
Best Regards,
Yan, Zi

David Hildenbrand Aug. 11, 2023, 3:17 p.m. UTC | #30

On 11.08.23 17:03, Peter Xu wrote:
> On Thu, Aug 10, 2023 at 11:59:25PM +0200, David Hildenbrand wrote:
>> On 10.08.23 23:54, Matthew Wilcox wrote:
>>> On Thu, Aug 10, 2023 at 05:48:19PM -0400, Peter Xu wrote:
>>>>> Yes, that comment from Hugh primarily discusses how we could possibly
>>>>> optimize the loop, and if relying on folio_nr_pages_mapped() to reduce the
>>>>> iterations would be racy. As far as I can see, there are cases where "it
>>>>> would be certainly a bad idea" :)
>>>>
>>>> Is the race described about mapcount being changed right after it's read?
>>>> Are you aware of anything specific that will be broken, and will be fixed
>>>> with this patch?
>>>
>>> The problem is that people check the mapcount while holding no locks;
>>> not the PTL, not the page lock.  So it's an unfixable race.
>>>
>>>> Having a total mapcount does sound helpful if partial folio is common
>>>> indeed.
>>>>
>>>> I'm curious whether that'll be so common after the large anon folio work -
>>>> isn't it be sad if partial folio will be a norm?  It sounds to me that's
>>>> the case when small page sizes should be used.. and it's prone to waste?
>>>
>>> The problem is that entire_mapcount isn't really entire_mapcount.
>>> It's pmd_mapcount.  I have had thoughts about using it as entire_mapcount,
>>> but it gets gnarly when people do partial unmaps.  So the _usual_ case
>>> ends up touching every struct page.  Which sucks.  Also it's one of the
>>> things which stands in the way of shrinking struct page.
>>
>> Right, so one current idea is to have a single total_mapcount and look into
>> removing the subpage mapcounts (which will require first removing
>> _nr_pages_mapped, because that's still one of the important users).
>>
>> Until we get there, also rmap code has to do eventually "more tracking" and
>> might, unfortunately, end up slower.
>>
>>>
>>> But it's kind of annoying to explain all of this to you individually.
>>> There have been hundreds of emails about it over the last months on
>>> this mailing list.  It would be nice if you could catch up instead of
>>> jumping in.
>>
>> To be fair, a lot of the details are not readily available and in the heads
>> of selected people :)
>>
>> Peter, if you're interested, we can discuss the current plans, issues and
>> ideas offline!
> 
> Thanks for offering help, David.
> 
> Personally I still am unclear yet on why entire_mapcount cannot be used as
> full-folio mapcounts, and why "partial unmap" can happen a lot (I don't
> expect), but yeah I can try to catch up to educate myself first.

Using fork() is the easiest way. mremap(), MADV_DONTNEED, munmap, ...

You might end up having to scan page tables and/or the rmap to figure 
out which mapcount to adjust, which we should absolutely avoid.


> The only issue regarding an offline sync-up is that even if David will help
> Peter on catching up the bits, it'll not scale when another Peter2 had the
> same question..  So David, rather than I waste your time on helping one
> person, let me try to catch up with the public threads - I'm not sure how
> far I can go myself;

Sure. But note that it's a moving target, and some discussions have been 
going on for a long time. I recall there were various discussions, 
including LSF/MM, mm biweekly meeting, and more. So even if you scan 
through all that, you might either get outdated or incomplete information.

Peter Xu Aug. 11, 2023, 3:18 p.m. UTC | #31

On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote:
> On 10.08.23 23:48, Matthew Wilcox wrote:
> > On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote:
> > > AFAICS if that patch was all correct (while I'm not yet sure..), you can
> > > actually fit your new total mapcount field into page 1 so even avoid the
> > > extra cacheline access.  You can have a look: the trick is refcount for
> > > tail page 1 is still seems to be free on 32 bits (if that was your worry
> > > before).  Then it'll be very nice if to keep Hugh's counter all in tail 1.
> > 
> > No, refcount must be 0 on all tail pages.  We rely on this in many places
> > in the MM.
> 
> Very right.

Obviously I could have missed this in the past.. can I ask for an example
explaining why refcount will be referenced before knowing it's a head?

> 
> One could theoretically
> 
> 1) Move the compound/entire mapcount to page[2]
> 2) Make hugetlb stop using the entire mapcount and only the total
>    mapcount.
> 3) Then leave total_mapcount and nr_pages_mapped in page[1]
> 4) Make page_mapcount() use total_mapcount for hugetlb.
> 
> When (un)mapping a PMD-mapped THP, we would go to page[2]. Otherwise, only
> page[1].
> 
> The entire mapcount, similarly to nr_pages_mapped, primarily serves to get
> memory stats right; well, and to implement page_mapcount() for THP.
> 
> But I'm not 100% sure yet if the overhead from having nr_pages_mapped in
> page[2] is significant enough at this point.

Worth trying.

Besides the cachelines, cmpxchg should also lock the memory bus and
meanwhile affect the whole processor pipelines in some way. Maybe that
effect is amplified too when it runs in a loop with one more counter.

David Hildenbrand Aug. 11, 2023, 3:32 p.m. UTC | #32

On 11.08.23 17:18, Peter Xu wrote:
> On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote:
>> On 10.08.23 23:48, Matthew Wilcox wrote:
>>> On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote:
>>>> AFAICS if that patch was all correct (while I'm not yet sure..), you can
>>>> actually fit your new total mapcount field into page 1 so even avoid the
>>>> extra cacheline access.  You can have a look: the trick is refcount for
>>>> tail page 1 is still seems to be free on 32 bits (if that was your worry
>>>> before).  Then it'll be very nice if to keep Hugh's counter all in tail 1.
>>>
>>> No, refcount must be 0 on all tail pages.  We rely on this in many places
>>> in the MM.
>>
>> Very right.
> 
> Obviously I could have missed this in the past.. can I ask for an example
> explaining why refcount will be referenced before knowing it's a head?

I think the issue is, when coming from a PFN walker (or GUP-fast), you 
might see "oh, this is a folio, let's lookup the head page". And you do 
that.

Then, you try taking a reference on that head page. (see try_get_folio()).

But as you didn't hold a reference on the folio yet, it can happily get 
freed + repurposed in the meantime, so maybe it's not a head page anymore.

So if the field would get reused for something else, grabbing a 
reference would corrupt whatever is now stored in there.

Peter Xu Aug. 11, 2023, 3:58 p.m. UTC | #33

On Fri, Aug 11, 2023 at 05:32:37PM +0200, David Hildenbrand wrote:
> On 11.08.23 17:18, Peter Xu wrote:
> > On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote:
> > > On 10.08.23 23:48, Matthew Wilcox wrote:
> > > > On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote:
> > > > > AFAICS if that patch was all correct (while I'm not yet sure..), you can
> > > > > actually fit your new total mapcount field into page 1 so even avoid the
> > > > > extra cacheline access.  You can have a look: the trick is refcount for
> > > > > tail page 1 is still seems to be free on 32 bits (if that was your worry
> > > > > before).  Then it'll be very nice if to keep Hugh's counter all in tail 1.
> > > > 
> > > > No, refcount must be 0 on all tail pages.  We rely on this in many places
> > > > in the MM.
> > > 
> > > Very right.
> > 
> > Obviously I could have missed this in the past.. can I ask for an example
> > explaining why refcount will be referenced before knowing it's a head?
> 
> I think the issue is, when coming from a PFN walker (or GUP-fast), you might
> see "oh, this is a folio, let's lookup the head page". And you do that.
> 
> Then, you try taking a reference on that head page. (see try_get_folio()).
> 
> But as you didn't hold a reference on the folio yet, it can happily get
> freed + repurposed in the meantime, so maybe it's not a head page anymore.
> 
> So if the field would get reused for something else, grabbing a reference
> would corrupt whatever is now stored in there.

Not an issue before large folios, am I right?  Because having a head page
reused as tail cannot happen iiuc with current thps if only pmd-sized,
because the head page is guaranteed to be pmd aligned physically.

I don't really know, where a hugetlb 2M head can be reused by a 1G huge
later right during the window of fast-gup walking. But obviously that's not
common either if that could ever happen.

Maybe Matthew was referring to something else (per "in many places")?

Thanks,

David Hildenbrand Aug. 11, 2023, 4:08 p.m. UTC | #34

On 11.08.23 17:58, Peter Xu wrote:
> On Fri, Aug 11, 2023 at 05:32:37PM +0200, David Hildenbrand wrote:
>> On 11.08.23 17:18, Peter Xu wrote:
>>> On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote:
>>>> On 10.08.23 23:48, Matthew Wilcox wrote:
>>>>> On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote:
>>>>>> AFAICS if that patch was all correct (while I'm not yet sure..), you can
>>>>>> actually fit your new total mapcount field into page 1 so even avoid the
>>>>>> extra cacheline access.  You can have a look: the trick is refcount for
>>>>>> tail page 1 is still seems to be free on 32 bits (if that was your worry
>>>>>> before).  Then it'll be very nice if to keep Hugh's counter all in tail 1.
>>>>>
>>>>> No, refcount must be 0 on all tail pages.  We rely on this in many places
>>>>> in the MM.
>>>>
>>>> Very right.
>>>
>>> Obviously I could have missed this in the past.. can I ask for an example
>>> explaining why refcount will be referenced before knowing it's a head?
>>
>> I think the issue is, when coming from a PFN walker (or GUP-fast), you might
>> see "oh, this is a folio, let's lookup the head page". And you do that.
>>
>> Then, you try taking a reference on that head page. (see try_get_folio()).
>>
>> But as you didn't hold a reference on the folio yet, it can happily get
>> freed + repurposed in the meantime, so maybe it's not a head page anymore.
>>
>> So if the field would get reused for something else, grabbing a reference
>> would corrupt whatever is now stored in there.
> 
> Not an issue before large folios, am I right?  Because having a head page
> reused as tail cannot happen iiuc with current thps if only pmd-sized,
> because the head page is guaranteed to be pmd aligned physically.

There are other users of compound pages, no? THP and hugetlb are just 
two examples I think. For example, I can spot __GFP_COMP in slab code.

Must such compound pages would not be applicable to GUP, though, but to 
PFN walkers could end up trying to grab them.

> 
> I don't really know, where a hugetlb 2M head can be reused by a 1G huge
> later right during the window of fast-gup walking. But obviously that's not
> common either if that could ever happen.
> 
> Maybe Matthew was referring to something else (per "in many places")?

There are some other cases where PFN walkers want to identify tail pages 
to skip over them. See the comment in has_unmovable_pages().

Zi Yan Aug. 11, 2023, 4:11 p.m. UTC | #35

On 11 Aug 2023, at 12:08, David Hildenbrand wrote:

> On 11.08.23 17:58, Peter Xu wrote:
>> On Fri, Aug 11, 2023 at 05:32:37PM +0200, David Hildenbrand wrote:
>>> On 11.08.23 17:18, Peter Xu wrote:
>>>> On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote:
>>>>> On 10.08.23 23:48, Matthew Wilcox wrote:
>>>>>> On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote:
>>>>>>> AFAICS if that patch was all correct (while I'm not yet sure..), you can
>>>>>>> actually fit your new total mapcount field into page 1 so even avoid the
>>>>>>> extra cacheline access.  You can have a look: the trick is refcount for
>>>>>>> tail page 1 is still seems to be free on 32 bits (if that was your worry
>>>>>>> before).  Then it'll be very nice if to keep Hugh's counter all in tail 1.
>>>>>>
>>>>>> No, refcount must be 0 on all tail pages.  We rely on this in many places
>>>>>> in the MM.
>>>>>
>>>>> Very right.
>>>>
>>>> Obviously I could have missed this in the past.. can I ask for an example
>>>> explaining why refcount will be referenced before knowing it's a head?
>>>
>>> I think the issue is, when coming from a PFN walker (or GUP-fast), you might
>>> see "oh, this is a folio, let's lookup the head page". And you do that.
>>>
>>> Then, you try taking a reference on that head page. (see try_get_folio()).
>>>
>>> But as you didn't hold a reference on the folio yet, it can happily get
>>> freed + repurposed in the meantime, so maybe it's not a head page anymore.
>>>
>>> So if the field would get reused for something else, grabbing a reference
>>> would corrupt whatever is now stored in there.
>>
>> Not an issue before large folios, am I right?  Because having a head page
>> reused as tail cannot happen iiuc with current thps if only pmd-sized,
>> because the head page is guaranteed to be pmd aligned physically.
>
> There are other users of compound pages, no? THP and hugetlb are just two examples I think. For example, I can spot __GFP_COMP in slab code.
>
> Must such compound pages would not be applicable to GUP, though, but to PFN walkers could end up trying to grab them.
>
For FS supporting large folios, their page cache pages can be any order <= PMD_ORDER.
See page_cache_ra_order() in mm/readahead.c

>>
>> I don't really know, where a hugetlb 2M head can be reused by a 1G huge
>> later right during the window of fast-gup walking. But obviously that's not
>> common either if that could ever happen.
>>
>> Maybe Matthew was referring to something else (per "in many places")?
>
> There are some other cases where PFN walkers want to identify tail pages to skip over them. See the comment in has_unmovable_pages().
>
> -- 
> Cheers,
>
> David / dhildenb


--
Best Regards,
Yan, Zi

Peter Xu Aug. 11, 2023, 10:18 p.m. UTC | #36

On Fri, Aug 11, 2023 at 12:11:55PM -0400, Zi Yan wrote:
> On 11 Aug 2023, at 12:08, David Hildenbrand wrote:
> 
> > On 11.08.23 17:58, Peter Xu wrote:
> >> On Fri, Aug 11, 2023 at 05:32:37PM +0200, David Hildenbrand wrote:
> >>> On 11.08.23 17:18, Peter Xu wrote:
> >>>> On Fri, Aug 11, 2023 at 12:27:13AM +0200, David Hildenbrand wrote:
> >>>>> On 10.08.23 23:48, Matthew Wilcox wrote:
> >>>>>> On Thu, Aug 10, 2023 at 04:57:11PM -0400, Peter Xu wrote:
> >>>>>>> AFAICS if that patch was all correct (while I'm not yet sure..), you can
> >>>>>>> actually fit your new total mapcount field into page 1 so even avoid the
> >>>>>>> extra cacheline access.  You can have a look: the trick is refcount for
> >>>>>>> tail page 1 is still seems to be free on 32 bits (if that was your worry
> >>>>>>> before).  Then it'll be very nice if to keep Hugh's counter all in tail 1.
> >>>>>>
> >>>>>> No, refcount must be 0 on all tail pages.  We rely on this in many places
> >>>>>> in the MM.
> >>>>>
> >>>>> Very right.
> >>>>
> >>>> Obviously I could have missed this in the past.. can I ask for an example
> >>>> explaining why refcount will be referenced before knowing it's a head?
> >>>
> >>> I think the issue is, when coming from a PFN walker (or GUP-fast), you might
> >>> see "oh, this is a folio, let's lookup the head page". And you do that.
> >>>
> >>> Then, you try taking a reference on that head page. (see try_get_folio()).
> >>>
> >>> But as you didn't hold a reference on the folio yet, it can happily get
> >>> freed + repurposed in the meantime, so maybe it's not a head page anymore.
> >>>
> >>> So if the field would get reused for something else, grabbing a reference
> >>> would corrupt whatever is now stored in there.
> >>
> >> Not an issue before large folios, am I right?  Because having a head page
> >> reused as tail cannot happen iiuc with current thps if only pmd-sized,
> >> because the head page is guaranteed to be pmd aligned physically.
> >
> > There are other users of compound pages, no? THP and hugetlb are just two examples I think. For example, I can spot __GFP_COMP in slab code.
> >
> > Must such compound pages would not be applicable to GUP, though, but to PFN walkers could end up trying to grab them.
> >
> For FS supporting large folios, their page cache pages can be any order <= PMD_ORDER.
> See page_cache_ra_order() in mm/readahead.c

Ah yes..

> 
> >>
> >> I don't really know, where a hugetlb 2M head can be reused by a 1G huge
> >> later right during the window of fast-gup walking. But obviously that's not
> >> common either if that could ever happen.
> >>
> >> Maybe Matthew was referring to something else (per "in many places")?
> >
> > There are some other cases where PFN walkers want to identify tail pages to skip over them. See the comment in has_unmovable_pages().

Indeed.

Thanks!

[mm-unstable,v1] mm: add a total mapcount for large folios

Commit Message

Comments

Patch