diff mbox series

[v10,2/8] mm/huge_memory: add two new (not yet used) functions for folio_split()

Message ID 20250307174001.242794-3-ziy@nvidia.com (mailing list archive)
State New
Headers show
Series Buddy allocator like (or non-uniform) folio split | expand

Commit Message

Zi Yan March 7, 2025, 5:39 p.m. UTC
This is a preparation patch, both added functions are not used yet.

The added __split_unmapped_folio() is able to split a folio with its
mapping removed in two manners: 1) uniform split (the existing way), and
2) buddy allocator like (or non-uniform) split.

The added __split_folio_to_order() can split a folio into any lower order.
For uniform split, __split_unmapped_folio() calls it once to split the
given folio to the new order. For buddy allocator like (non-uniform)
split, __split_unmapped_folio() calls it (folio_order - new_order) times
and each time splits the folio containing the given page to one lower
order.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kairui Song <kasong@tencent.com>
---
 mm/huge_memory.c | 348 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 347 insertions(+), 1 deletion(-)

Comments

Zi Yan March 10, 2025, 4:14 p.m. UTC | #1
On 7 Mar 2025, at 12:39, Zi Yan wrote:

> This is a preparation patch, both added functions are not used yet.
>
> The added __split_unmapped_folio() is able to split a folio with its
> mapping removed in two manners: 1) uniform split (the existing way), and
> 2) buddy allocator like (or non-uniform) split.
>
> The added __split_folio_to_order() can split a folio into any lower order.
> For uniform split, __split_unmapped_folio() calls it once to split the
> given folio to the new order. For buddy allocator like (non-uniform)
> split, __split_unmapped_folio() calls it (folio_order - new_order) times
> and each time splits the folio containing the given page to one lower
> order.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Kairui Song <kasong@tencent.com>
> ---
>  mm/huge_memory.c | 348 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 347 insertions(+), 1 deletion(-)

Hi Andrew,

The patch below should fix the issues discovered by Hugh. Please fold
it into this patch. Thank you for all the help.


From 22ced0e84e756a1084a1eb32d1de596ca10e3b3c Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 10 Mar 2025 11:59:42 -0400
Subject: [PATCH] mm/huge_memory: unfreeze head folio after page cache entries
 are updated

Otherwise others can grab the head folio and see stale page cache entries.
Data corruption can happen because of that.

Drop large EOF tail folios with the right number of refs to prevent memory
leak.

Reported-by: Hugh Dickins <hughd@google.com>
Closes: https://lore.kernel.org/all/fcbadb7f-dd3e-21df-f9a7-2853b53183c4@google.com/
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a42150298de..f06508e4d242 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3573,17 +3573,18 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 			}

 			/*
-			 * Unfreeze refcount first. Additional reference from
-			 * page cache.
+			 * origin_folio should be kept frozon until page cache
+			 * entries are updated with all the other after-split
+			 * folios to prevent others seeing stale page cache
+			 * entries.
 			 */
-			folio_ref_unfreeze(release,
-				1 + ((!folio_test_anon(origin_folio) ||
-				     folio_test_swapcache(origin_folio)) ?
-					     folio_nr_pages(release) : 0));
-
 			if (release == origin_folio)
 				continue;

+			folio_ref_unfreeze(release, 1 +
+					((mapping || swap_cache) ?
+						folio_nr_pages(release) : 0));
+
 			lru_add_page_tail(origin_folio, &release->page,
 						lruvec, list);

@@ -3595,7 +3596,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 					folio_account_cleaned(release,
 						inode_to_wb(mapping->host));
 				__filemap_remove_folio(release, NULL);
-				folio_put(release);
+				folio_put_refs(release, folio_nr_pages(release));
 			} else if (mapping) {
 				__xa_store(&mapping->i_pages,
 						release->index, release, 0);
@@ -3607,6 +3608,15 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		}
 	}

+	/*
+	 * Unfreeze origin_folio only after all page cache entries, which used
+	 * to point to it, have been updated with new folios. Otherwise,
+	 * a parallel folio_try_get() can grab origin_folio and its caller can
+	 * see stale page cache entries.
+	 */
+	folio_ref_unfreeze(origin_folio, 1 +
+		((mapping || swap_cache) ? folio_nr_pages(origin_folio) : 0));
+
 	unlock_page_lruvec(lruvec);

 	if (swap_cache)
Matthew Wilcox March 10, 2025, 4:30 p.m. UTC | #2
On Fri, Mar 07, 2025 at 12:39:55PM -0500, Zi Yan wrote:
> +	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
> +		struct page *head = &folio->page;
> +		struct page *new_head = head + index;
> +
> +		/*
> +		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
> +		 * Don't pass it around before clear_compound_head().
> +		 */
> +		struct folio *new_folio = (struct folio *)new_head;
[...]
> +		/* ->mapping in first and second tail page is replaced by other uses */
> +		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
> +			       new_head);
> +		new_head->mapping = head->mapping;
> +		new_head->index = head->index + index;

Why are you using new_head->mapping and ->index instead of new_folio?
Zi Yan March 10, 2025, 4:39 p.m. UTC | #3
On 10 Mar 2025, at 12:30, Matthew Wilcox wrote:

> On Fri, Mar 07, 2025 at 12:39:55PM -0500, Zi Yan wrote:
>> +	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
>> +		struct page *head = &folio->page;
>> +		struct page *new_head = head + index;
>> +
>> +		/*
>> +		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
>> +		 * Don't pass it around before clear_compound_head().
>> +		 */
>> +		struct folio *new_folio = (struct folio *)new_head;
> [...]
>> +		/* ->mapping in first and second tail page is replaced by other uses */
>> +		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
>> +			       new_head);
>> +		new_head->mapping = head->mapping;
>> +		new_head->index = head->index + index;
>
> Why are you using new_head->mapping and ->index instead of new_folio?

Because of the “Careful” comment. But new_folio->* should be fine,
since it is the same as new_head. So I probably can replace all
new_head with new_folio except those VM_BUG_ON_PAGE checks?


Best Regards,
Yan, Zi
Zi Yan March 10, 2025, 4:42 p.m. UTC | #4
On 10 Mar 2025, at 12:39, Zi Yan wrote:

> On 10 Mar 2025, at 12:30, Matthew Wilcox wrote:
>
>> On Fri, Mar 07, 2025 at 12:39:55PM -0500, Zi Yan wrote:
>>> +	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
>>> +		struct page *head = &folio->page;
>>> +		struct page *new_head = head + index;
>>> +
>>> +		/*
>>> +		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
>>> +		 * Don't pass it around before clear_compound_head().
>>> +		 */
>>> +		struct folio *new_folio = (struct folio *)new_head;
>> [...]
>>> +		/* ->mapping in first and second tail page is replaced by other uses */
>>> +		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
>>> +			       new_head);
>>> +		new_head->mapping = head->mapping;
>>> +		new_head->index = head->index + index;
>>
>> Why are you using new_head->mapping and ->index instead of new_folio?
>
> Because of the “Careful” comment. But new_folio->* should be fine,
> since it is the same as new_head. So I probably can replace all
> new_head with new_folio except those VM_BUG_ON_PAGE checks?

Something like?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f06508e4d242..007c927536bb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3341,8 +3341,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		 * unreferenced sub-pages of an anonymous THP: we can simply drop
 		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
 		 */
-		new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
-		new_head->flags |= (head->flags &
+		new_folio->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		new_folio->flags |= (head->flags &
 				((1L << PG_referenced) |
 				 (1L << PG_swapbacked) |
 				 (1L << PG_swapcache) |
@@ -3364,8 +3364,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		/* ->mapping in first and second tail page is replaced by other uses */
 		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
 			       new_head);
-		new_head->mapping = head->mapping;
-		new_head->index = head->index + index;
+		new_folio->mapping = head->mapping;
+		new_folio->index = head->index + index;

 		/*
 		 * page->private should not be set in tail pages. Fix up and warn once



Best Regards,
Yan, Zi
Matthew Wilcox March 10, 2025, 5 p.m. UTC | #5
On Mon, Mar 10, 2025 at 12:42:06PM -0400, Zi Yan wrote:
> > Because of the “Careful” comment. But new_folio->* should be fine,
> > since it is the same as new_head. So I probably can replace all
> > new_head with new_folio except those VM_BUG_ON_PAGE checks?

Why not also the VM_BUG_ON_PAGE check?  I mean:

> @@ -3364,8 +3364,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  		/* ->mapping in first and second tail page is replaced by other uses */
>  		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
>  			       new_head);

		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_folio->mapping != TAIL_MAPPING, new_head);

(or we could just ditch the assert entirely; it's not all that useful)

> -		new_head->mapping = head->mapping;
> -		new_head->index = head->index + index;
> +		new_folio->mapping = head->mapping;
> +		new_folio->index = head->index + index;

	new_folio->mapping = folio->mapping
	new_folio->index = folio->index +index;

(um, and that index + index looks weird; better name might be just 'i')
Zi Yan March 10, 2025, 5:05 p.m. UTC | #6
On 10 Mar 2025, at 13:00, Matthew Wilcox wrote:

> On Mon, Mar 10, 2025 at 12:42:06PM -0400, Zi Yan wrote:
>>> Because of the “Careful” comment. But new_folio->* should be fine,
>>> since it is the same as new_head. So I probably can replace all
>>> new_head with new_folio except those VM_BUG_ON_PAGE checks?
>
> Why not also the VM_BUG_ON_PAGE check?  I mean:
>
>> @@ -3364,8 +3364,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>  		/* ->mapping in first and second tail page is replaced by other uses */
>>  		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
>>  			       new_head);
>
> 		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_folio->mapping != TAIL_MAPPING, new_head);

We are checking new_folio but dump new_head, so it can cause some confusion.
But it might not be that bad.
>
> (or we could just ditch the assert entirely; it's not all that useful)

I am open to that.

>
>> -		new_head->mapping = head->mapping;
>> -		new_head->index = head->index + index;
>> +		new_folio->mapping = head->mapping;
>> +		new_folio->index = head->index + index;
>
> 	new_folio->mapping = folio->mapping
> 	new_folio->index = folio->index +index;
>
> (um, and that index + index looks weird; better name might be just 'i')

OK. Let me make the changes you suggested and fold it to Hugh’s fix patch,
before Andrew picks that up.

Best Regards,
Yan, Zi
Zi Yan March 10, 2025, 5:32 p.m. UTC | #7
On 10 Mar 2025, at 12:14, Zi Yan wrote:

> On 7 Mar 2025, at 12:39, Zi Yan wrote:
>
>> This is a preparation patch, both added functions are not used yet.
>>
>> The added __split_unmapped_folio() is able to split a folio with its
>> mapping removed in two manners: 1) uniform split (the existing way), and
>> 2) buddy allocator like (or non-uniform) split.
>>
>> The added __split_folio_to_order() can split a folio into any lower order.
>> For uniform split, __split_unmapped_folio() calls it once to split the
>> given folio to the new order. For buddy allocator like (non-uniform)
>> split, __split_unmapped_folio() calls it (folio_order - new_order) times
>> and each time splits the folio containing the given page to one lower
>> order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: John Hubbard <jhubbard@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Miaohe Lin <linmiaohe@huawei.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Yang Shi <yang@os.amperecomputing.com>
>> Cc: Yu Zhao <yuzhao@google.com>
>> Cc: Kairui Song <kasong@tencent.com>
>> ---
>>  mm/huge_memory.c | 348 ++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 347 insertions(+), 1 deletion(-)
>
> Hi Andrew,
>
> The patch below should fix the issues discovered by Hugh. Please fold
> it into this patch. Thank you for all the help.
>

Hi Andrew,

This is the updated version including:
1. Hugh’s fix on unfreezing head folio correctly,
2. Hugh’s fix on drop the right number of refs on tail folio,
3. Matthew’s suggestion on using new_folio instead of new_head and
using i instead of index.


From 82b40c8d0fb3959d0d438929a3e4166f0785fe56 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 10 Mar 2025 11:59:42 -0400
Subject: [PATCH] mm/huge_memory: unfreeze head folio after page cache entries
 are updated

Otherwise others can grab the head folio and see stale page cache entries.
Data corruption can happen because of that.

Drop large EOF tail folios with the right number of refs to prevent memory
leak.

Also include Matthew's suggestion on __split_folio_to_order()[1]

[1] https://lore.kernel.org/all/Z88ar5YS99HsIRYo@casper.infradead.org/

Reported-by: Hugh Dickins <hughd@google.com>
Closes: https://lore.kernel.org/all/fcbadb7f-dd3e-21df-f9a7-2853b53183c4@google.com/
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 52 +++++++++++++++++++++++++++---------------------
 1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c10ee77189bd..220a6e833003 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3525,15 +3525,14 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 {
 	long new_nr_pages = 1 << new_order;
 	long nr_pages = 1 << old_order;
-	long index;
+	long i;

 	/*
 	 * Skip the first new_nr_pages, since the new folio from them have all
 	 * the flags from the original folio.
 	 */
-	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
-		struct page *head = &folio->page;
-		struct page *new_head = head + index;
+	for (i = new_nr_pages; i < nr_pages; i += new_nr_pages) {
+		struct page *new_head = &folio->page + i;

 		/*
 		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
@@ -3541,7 +3540,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		 */
 		struct folio *new_folio = (struct folio *)new_head;

-		VM_BUG_ON_PAGE(atomic_read(&new_head->_mapcount) != -1, new_head);
+		VM_BUG_ON_PAGE(atomic_read(&new_folio->_mapcount) != -1, new_head);

 		/*
 		 * Clone page flags before unfreezing refcount.
@@ -3556,8 +3555,8 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		 * unreferenced sub-pages of an anonymous THP: we can simply drop
 		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
 		 */
-		new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
-		new_head->flags |= (head->flags &
+		new_folio->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		new_folio->flags |= (folio->flags &
 				((1L << PG_referenced) |
 				 (1L << PG_swapbacked) |
 				 (1L << PG_swapcache) |
@@ -3576,23 +3575,20 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 				 (1L << PG_dirty) |
 				 LRU_GEN_MASK | LRU_REFS_MASK));

-		/* ->mapping in first and second tail page is replaced by other uses */
-		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
-			       new_head);
-		new_head->mapping = head->mapping;
-		new_head->index = head->index + index;
+		new_folio->mapping = folio->mapping;
+		new_folio->index = folio->index + i;

 		/*
 		 * page->private should not be set in tail pages. Fix up and warn once
 		 * if private is unexpectedly set.
 		 */
-		if (unlikely(new_head->private)) {
+		if (unlikely(new_folio->private)) {
 			VM_WARN_ON_ONCE_PAGE(true, new_head);
-			new_head->private = 0;
+			new_folio->private = 0;
 		}

 		if (folio_test_swapcache(folio))
-			new_folio->swap.val = folio->swap.val + index;
+			new_folio->swap.val = folio->swap.val + i;

 		/* Page flags must be visible before we make the page non-compound. */
 		smp_wmb();
@@ -3788,17 +3784,18 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 			}

 			/*
-			 * Unfreeze refcount first. Additional reference from
-			 * page cache.
+			 * origin_folio should be kept frozon until page cache
+			 * entries are updated with all the other after-split
+			 * folios to prevent others seeing stale page cache
+			 * entries.
 			 */
-			folio_ref_unfreeze(release,
-				1 + ((!folio_test_anon(origin_folio) ||
-				     folio_test_swapcache(origin_folio)) ?
-					     folio_nr_pages(release) : 0));
-
 			if (release == origin_folio)
 				continue;

+			folio_ref_unfreeze(release, 1 +
+					((mapping || swap_cache) ?
+						folio_nr_pages(release) : 0));
+
 			lru_add_page_tail(origin_folio, &release->page,
 						lruvec, list);

@@ -3810,7 +3807,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 					folio_account_cleaned(release,
 						inode_to_wb(mapping->host));
 				__filemap_remove_folio(release, NULL);
-				folio_put(release);
+				folio_put_refs(release, folio_nr_pages(release));
 			} else if (mapping) {
 				__xa_store(&mapping->i_pages,
 						release->index, release, 0);
@@ -3822,6 +3819,15 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		}
 	}

+	/*
+	 * Unfreeze origin_folio only after all page cache entries, which used
+	 * to point to it, have been updated with new folios. Otherwise,
+	 * a parallel folio_try_get() can grab origin_folio and its caller can
+	 * see stale page cache entries.
+	 */
+	folio_ref_unfreeze(origin_folio, 1 +
+		((mapping || swap_cache) ? folio_nr_pages(origin_folio) : 0));
+
 	unlock_page_lruvec(lruvec);

 	if (swap_cache)
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3c5d01aecac8..c10ee77189bd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3265,7 +3265,6 @@  static void remap_page(struct folio *folio, unsigned long nr, int flags)
 static void lru_add_page_tail(struct folio *folio, struct page *tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
 	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
@@ -3517,6 +3516,353 @@  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 					caller_pins;
 }
 
+/*
+ * It splits @folio into @new_order folios and copies the @folio metadata to
+ * all the resulting folios.
+ */
+static void __split_folio_to_order(struct folio *folio, int old_order,
+		int new_order)
+{
+	long new_nr_pages = 1 << new_order;
+	long nr_pages = 1 << old_order;
+	long index;
+
+	/*
+	 * Skip the first new_nr_pages, since the new folio from them have all
+	 * the flags from the original folio.
+	 */
+	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
+		struct page *head = &folio->page;
+		struct page *new_head = head + index;
+
+		/*
+		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
+		 * Don't pass it around before clear_compound_head().
+		 */
+		struct folio *new_folio = (struct folio *)new_head;
+
+		VM_BUG_ON_PAGE(atomic_read(&new_head->_mapcount) != -1, new_head);
+
+		/*
+		 * Clone page flags before unfreezing refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow flags change,
+		 * for example lock_page() which set PG_waiters.
+		 *
+		 * Note that for mapped sub-pages of an anonymous THP,
+		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
+		 * the migration entry instead from where remap_page() will restore it.
+		 * We can still have PG_anon_exclusive set on effectively unmapped and
+		 * unreferenced sub-pages of an anonymous THP: we can simply drop
+		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
+		 */
+		new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		new_head->flags |= (head->flags &
+				((1L << PG_referenced) |
+				 (1L << PG_swapbacked) |
+				 (1L << PG_swapcache) |
+				 (1L << PG_mlocked) |
+				 (1L << PG_uptodate) |
+				 (1L << PG_active) |
+				 (1L << PG_workingset) |
+				 (1L << PG_locked) |
+				 (1L << PG_unevictable) |
+#ifdef CONFIG_ARCH_USES_PG_ARCH_2
+				 (1L << PG_arch_2) |
+#endif
+#ifdef CONFIG_ARCH_USES_PG_ARCH_3
+				 (1L << PG_arch_3) |
+#endif
+				 (1L << PG_dirty) |
+				 LRU_GEN_MASK | LRU_REFS_MASK));
+
+		/* ->mapping in first and second tail page is replaced by other uses */
+		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
+			       new_head);
+		new_head->mapping = head->mapping;
+		new_head->index = head->index + index;
+
+		/*
+		 * page->private should not be set in tail pages. Fix up and warn once
+		 * if private is unexpectedly set.
+		 */
+		if (unlikely(new_head->private)) {
+			VM_WARN_ON_ONCE_PAGE(true, new_head);
+			new_head->private = 0;
+		}
+
+		if (folio_test_swapcache(folio))
+			new_folio->swap.val = folio->swap.val + index;
+
+		/* Page flags must be visible before we make the page non-compound. */
+		smp_wmb();
+
+		/*
+		 * Clear PageTail before unfreezing page refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow put_page()
+		 * which needs correct compound_head().
+		 */
+		clear_compound_head(new_head);
+		if (new_order) {
+			prep_compound_page(new_head, new_order);
+			folio_set_large_rmappable(new_folio);
+		}
+
+		if (folio_test_young(folio))
+			folio_set_young(new_folio);
+		if (folio_test_idle(folio))
+			folio_set_idle(new_folio);
+
+		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
+	}
+
+	if (new_order)
+		folio_set_order(folio, new_order);
+	else
+		ClearPageCompound(&folio->page);
+}
+
+/*
+ * It splits an unmapped @folio to lower order smaller folios in two ways.
+ * @folio: the to-be-split folio
+ * @new_order: the smallest order of the after split folios (since buddy
+ *             allocator like split generates folios with orders from @folio's
+ *             order - 1 to new_order).
+ * @split_at: in buddy allocator like split, the folio containing @split_at
+ *            will be split until its order becomes @new_order.
+ * @lock_at: the folio containing @lock_at is left locked for caller.
+ * @list: the after split folios will be added to @list if it is not NULL,
+ *        otherwise to LRU lists.
+ * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
+ * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
+ * @mapping: @folio->mapping
+ * @uniform_split: if the split is uniform or not (buddy allocator like split)
+ *
+ *
+ * 1. uniform split: the given @folio into multiple @new_order small folios,
+ *    where all small folios have the same order. This is done when
+ *    uniform_split is true.
+ * 2. buddy allocator like (non-uniform) split: the given @folio is split into
+ *    half and one of the half (containing the given page) is split into half
+ *    until the given @page's order becomes @new_order. This is done when
+ *    uniform_split is false.
+ *
+ * The high level flow for these two methods are:
+ * 1. uniform split: a single __split_folio_to_order() is called to split the
+ *    @folio into @new_order, then we traverse all the resulting folios one by
+ *    one in PFN ascending order and perform stats, unfreeze, adding to list,
+ *    and file mapping index operations.
+ * 2. non-uniform split: in general, folio_order - @new_order calls to
+ *    __split_folio_to_order() are made in a for loop to split the @folio
+ *    to one lower order at a time. The resulting small folios are processed
+ *    like what is done during the traversal in 1, except the one containing
+ *    @page, which is split in next for loop.
+ *
+ * After splitting, the caller's folio reference will be transferred to the
+ * folio containing @page. The other folios may be freed if they are not mapped.
+ *
+ * In terms of locking, after splitting,
+ * 1. uniform split leaves @page (or the folio contains it) locked;
+ * 2. buddy allocator like (non-uniform) split leaves @folio locked.
+ *
+ *
+ * For !uniform_split, when -ENOMEM is returned, the original folio might be
+ * split. The caller needs to check the input folio.
+ */
+static int __split_unmapped_folio(struct folio *folio, int new_order,
+		struct page *split_at, struct page *lock_at,
+		struct list_head *list, pgoff_t end,
+		struct xa_state *xas, struct address_space *mapping,
+		bool uniform_split)
+{
+	struct lruvec *lruvec;
+	struct address_space *swap_cache = NULL;
+	struct folio *origin_folio = folio;
+	struct folio *next_folio = folio_next(folio);
+	struct folio *new_folio;
+	struct folio *next;
+	int order = folio_order(folio);
+	int split_order;
+	int start_order = uniform_split ? new_order : order - 1;
+	int nr_dropped = 0;
+	int ret = 0;
+	bool stop_split = false;
+
+	if (folio_test_swapcache(folio)) {
+		VM_BUG_ON(mapping);
+
+		/* a swapcache folio can only be uniformly split to order-0 */
+		if (!uniform_split || new_order != 0)
+			return -EINVAL;
+
+		swap_cache = swap_address_space(folio->swap);
+		xa_lock(&swap_cache->i_pages);
+	}
+
+	if (folio_test_anon(folio))
+		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
+
+	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
+	lruvec = folio_lruvec_lock(folio);
+
+	folio_clear_has_hwpoisoned(folio);
+
+	/*
+	 * split to new_order one order at a time. For uniform split,
+	 * folio is split to new_order directly.
+	 */
+	for (split_order = start_order;
+	     split_order >= new_order && !stop_split;
+	     split_order--) {
+		int old_order = folio_order(folio);
+		struct folio *release;
+		struct folio *end_folio = folio_next(folio);
+
+		/* order-1 anonymous folio is not supported */
+		if (folio_test_anon(folio) && split_order == 1)
+			continue;
+		if (uniform_split && split_order != new_order)
+			continue;
+
+		if (mapping) {
+			/*
+			 * uniform split has xas_split_alloc() called before
+			 * irq is disabled to allocate enough memory, whereas
+			 * non-uniform split can handle ENOMEM.
+			 */
+			if (uniform_split)
+				xas_split(xas, folio, old_order);
+			else {
+				xas_set_order(xas, folio->index, split_order);
+				xas_try_split(xas, folio, old_order);
+				if (xas_error(xas)) {
+					ret = xas_error(xas);
+					stop_split = true;
+					goto after_split;
+				}
+			}
+		}
+
+		/*
+		 * Reset any memcg data overlay in the tail pages.
+		 * folio_nr_pages() is unreliable until prep_compound_page()
+		 * was called again.
+		 */
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+		folio->_nr_pages = 0;
+#endif
+
+
+		/* complete memcg works before add pages to LRU */
+		split_page_memcg(&folio->page, old_order, split_order);
+		split_page_owner(&folio->page, old_order, split_order);
+		pgalloc_tag_split(folio, old_order, split_order);
+
+		__split_folio_to_order(folio, old_order, split_order);
+
+after_split:
+		/*
+		 * Iterate through after-split folios and perform related
+		 * operations. But in buddy allocator like split, the folio
+		 * containing the specified page is skipped until its order
+		 * is new_order, since the folio will be worked on in next
+		 * iteration.
+		 */
+		for (release = folio; release != end_folio; release = next) {
+			next = folio_next(release);
+			/*
+			 * for buddy allocator like split, the folio containing
+			 * page will be split next and should not be released,
+			 * until the folio's order is new_order or stop_split
+			 * is set to true by the above xas_split() failure.
+			 */
+			if (release == page_folio(split_at)) {
+				folio = release;
+				if (split_order != new_order && !stop_split)
+					continue;
+			}
+			if (folio_test_anon(release)) {
+				mod_mthp_stat(folio_order(release),
+						MTHP_STAT_NR_ANON, 1);
+			}
+
+			/*
+			 * Unfreeze refcount first. Additional reference from
+			 * page cache.
+			 */
+			folio_ref_unfreeze(release,
+				1 + ((!folio_test_anon(origin_folio) ||
+				     folio_test_swapcache(origin_folio)) ?
+					     folio_nr_pages(release) : 0));
+
+			if (release == origin_folio)
+				continue;
+
+			lru_add_page_tail(origin_folio, &release->page,
+						lruvec, list);
+
+			/* Some pages can be beyond EOF: drop them from cache */
+			if (release->index >= end) {
+				if (shmem_mapping(mapping))
+					nr_dropped += folio_nr_pages(release);
+				else if (folio_test_clear_dirty(release))
+					folio_account_cleaned(release,
+						inode_to_wb(mapping->host));
+				__filemap_remove_folio(release, NULL);
+				folio_put(release);
+			} else if (mapping) {
+				__xa_store(&mapping->i_pages,
+						release->index, release, 0);
+			} else if (swap_cache) {
+				__xa_store(&swap_cache->i_pages,
+						swap_cache_index(release->swap),
+						release, 0);
+			}
+		}
+	}
+
+	unlock_page_lruvec(lruvec);
+
+	if (swap_cache)
+		xa_unlock(&swap_cache->i_pages);
+	if (mapping)
+		xa_unlock(&mapping->i_pages);
+
+	/* Caller disabled irqs, so they are still disabled here */
+	local_irq_enable();
+
+	if (nr_dropped)
+		shmem_uncharge(mapping->host, nr_dropped);
+
+	remap_page(origin_folio, 1 << order,
+			folio_test_anon(origin_folio) ?
+				RMP_USE_SHARED_ZEROPAGE : 0);
+
+	/*
+	 * At this point, folio should contain the specified page.
+	 * For uniform split, it is left for caller to unlock.
+	 * For buddy allocator like split, the first after-split folio is left
+	 * for caller to unlock.
+	 */
+	for (new_folio = origin_folio; new_folio != next_folio; new_folio = next) {
+		next = folio_next(new_folio);
+		if (new_folio == page_folio(lock_at))
+			continue;
+
+		folio_unlock(new_folio);
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		free_page_and_swap_cache(&new_folio->page);
+	}
+	return ret;
+}
+
 /*
  * This function splits a large folio into smaller folios of order @new_order.
  * @page can point to any page of the large folio to split. The split operation