[v6,0/7] Buddy allocator like (or non-uniform) folio split

Message ID	20250205031417.1771278-1-ziy@nvidia.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Zi Yan <ziy@nvidia.com> To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com>, Hugh Dickins <hughd@google.com>, David Hildenbrand <david@redhat.com>, Yang Shi <yang@os.amperecomputing.com>, Miaohe Lin <linmiaohe@huawei.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Yu Zhao <yuzhao@google.com>, John Hubbard <jhubbard@nvidia.com>, Baolin Wang <baolin.wang@linux.alibaba.com>, linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org, Zi Yan <ziy@nvidia.com> Subject: [PATCH v6 0/7] Buddy allocator like (or non-uniform) folio split Date: Tue, 4 Feb 2025 22:14:10 -0500 Message-ID: <20250205031417.1771278-1-ziy@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Buddy allocator like (or non-uniform) folio split \| expand [v6,0/7] Buddy allocator like (or non-uniform) folio split [v6,1/7] mm/huge_memory: add two new (not yet used) functions for folio_split() [v6,2/7] mm/huge_memory: move folio split common code to __folio_split() [v6,3/7] mm/huge_memory: add buddy allocator like folio_split() [v6,4/7] mm/huge_memory: remove the old, unused __split_huge_page() [v6,5/7] mm/huge_memory: add folio_split() to debugfs testing interface. [v6,6/7] mm/truncate: use buddy allocator like folio split for truncate operation. [v6,7/7] selftests/mm: add tests for folio_split(), buddy allocator like split.

Zi Yan Feb. 5, 2025, 3:14 a.m. UTC

Hi all,

This patchset adds a new buddy allocator like (or non-uniform) large folio
split to reduce the total number of after-split folios, the amount of memory
needed for multi-index xarray split, and keep more large folios after a split.
It is on top of mm-everything-2025-02-01-05-58. It is ready to be merged.

Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like (or non-uniform) split.
All existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().

xfstests quick group passed for both tmpfs and xfs.


Changelog
===
From V5[7]:
1. Split shmem to any lower order patches are in mm tree, so dropped
   from this series.
2. Rename split_folio_at() to try_folio_split() to clarify that
   non-uniform split will not be used if it is not supported.

From V4[6]:
1. Enabled shmem support in both uniform and buddy allocator like split
   and added selftests for it.
2. Added functions to check if uniform split and buddy allocator like
   split are supported for the given folio and order.
3. Made truncate fall back to uniform split if buddy allocator split is
   not supported (CONFIG_READ_ONLY_THP_FOR_FS and FS without large folio).
4. Added the missing folio_clear_has_hwpoisoned() to
   __split_unmapped_folio().

From V3[5]:
1. Used xas_split_alloc(GFP_NOWAIT) instead of xas_nomem(), since extra
   operations inside xas_split_alloc() are needed for correctness.
2. Enabled folio_split() for shmem and no issue was found with xfstests
   quick test group.
3. Split both ends of a truncate range in truncate_inode_partial_folio()
   to avoid wasting memory in shmem truncate (per David Hildenbrand).
4. Removed page_in_folio_offset() since page_folio() does the same
   thing.
5. Finished truncate related tests from xfstests quick test group on XFS and
   tmpfs without issues.
6. Disabled buddy allocator like split on CONFIG_READ_ONLY_THP_FOR_FS
   and FS without large folio. This check was missed in the prior
   versions.

From V2[3]:
1. Incorporated all the feedback from Kirill[4].
2. Used GFP_NOWAIT for xas_nomem().
3. Tested the code path when xas_nomem() fails.
4. Added selftests for folio_split().
5. Fixed no THP config build error.

From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
   Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
   for xas_split() during buddy allocator like split.

From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
   folio_split(). The same code is used for both uniform split and buddy
   allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
   instead of the one containing the given page, since
   the caller of truncate_inode_partial_folio() locks and unlocks the
   first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().


Design
===

folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1,      O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.

The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
   the whole folio into multiple smaller ones with the same order in a
   shot, this approach splits the folio iteratively. Taking the example
   above, this approach first splits the original order-9 into two order-8,
   then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
   adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.


__split_unmapped_folio() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__split_unmapped_folio() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
   uniformly split the given folio. All resulting folios are put back to
   the list after split. The folio containing the given page is left to
   caller to unlock and others are unlocked.

2. buddy allocator like (or non-uniform) split: (old_order - new_order) calls
   to __split_folio_to_order() are used to split the given folio at order N to
   order N-1. After each call, the target folio is changed to the one
   containing the page, which is given as a folio_split() parameter.
   After each call, folios not containing the page are put back to the list.
   The folio containing the page is put back to the list when its order
   is new_order. All folios are unlocked except the first folio, which
   is left to caller to unlock.


Patch Overview
===
1. Patch 1 added __split_unmapped_folio() and __split_folio_to_order() to
   prepare for moving to new backend split code.

2. Patch 2 moved common code in split_huge_page_to_list_to_order() to
   __folio_split().

3. Patch 3 added new folio_split() and made
   split_huge_page_to_list_to_order() share the new
   __split_unmapped_folio() with folio_split().

4. Patch 4 removed no longer used __split_huge_page() and
   __split_huge_page_tail().

5. Patch 5 added a new in_folio_offset to split_huge_page debugfs for
   folio_split() test.

6. Patch 6 used try_folio_split() for truncate operation.

7. Patch 7 added folio_split() tests.


Any comments and/or suggestions are welcome. Thanks.

[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20241101150357.1752726-1-ziy@nvidia.com/
[4] https://lore.kernel.org/linux-mm/e6ppwz5t4p4kvir6eqzoto4y5fmdjdxdyvxvtw43ncly4l4ogr@7ruqsay6i2h2/
[5] https://lore.kernel.org/linux-mm/20241205001839.2582020-1-ziy@nvidia.com/
[6] https://lore.kernel.org/linux-mm/20250106165513.104899-1-ziy@nvidia.com/
[7] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/


Zi Yan (7):
  mm/huge_memory: add two new (not yet used) functions for folio_split()
  mm/huge_memory: move folio split common code to __folio_split()
  mm/huge_memory: add buddy allocator like folio_split()
  mm/huge_memory: remove the old, unused __split_huge_page()
  mm/huge_memory: add folio_split() to debugfs testing interface.
  mm/truncate: use buddy allocator like folio split for truncate
    operation.
  selftests/mm: add tests for folio_split(), buddy allocator like split.

 include/linux/huge_mm.h                       |  36 +
 mm/huge_memory.c                              | 749 ++++++++++++------
 mm/truncate.c                                 |  31 +-
 .../selftests/mm/split_huge_page_test.c       |  34 +-
 4 files changed, 582 insertions(+), 268 deletions(-)

Andrew Morton Feb. 6, 2025, 8:01 a.m. UTC | #1

On Tue,  4 Feb 2025 22:14:10 -0500 Zi Yan <ziy@nvidia.com> wrote:

> This patchset adds a new buddy allocator like (or non-uniform) large folio
> split to reduce the total number of after-split folios, the amount of memory
> needed for multi-index xarray split, and keep more large folios after a split.

It would be useful (vital, really) to provide some measurements which
help others understand the magnitude of these resource savings, please.

Zi Yan Feb. 7, 2025, 2:11 p.m. UTC | #2

On 6 Feb 2025, at 3:01, Andrew Morton wrote:

> On Tue,  4 Feb 2025 22:14:10 -0500 Zi Yan <ziy@nvidia.com> wrote:
>
>> This patchset adds a new buddy allocator like (or non-uniform) large folio
>> split to reduce the total number of after-split folios, the amount of memory
>> needed for multi-index xarray split, and keep more large folios after a split.
>
> It would be useful (vital, really) to provide some measurements which
> help others understand the magnitude of these resource savings, please.

Hi Andrew,

Can you please drop this series for now? I find that, after your above request,
I misunderstood how xas_split_alloc() and xas_split() works in xarray, thus,
my current implementation allocates more than enough xa_node during non-uniform
split, although the excessive ones are freed at the end. It defeats the purpose
of reducing memory consumption of multi-index xarray split, even if
folio_split() has no function issue AFAICT. I am working on a better
implementation that might require new xarray operations. I will post it as v7
later. I really appreciate that you asked about more info above. :)

More details on memory saving for multi-index xarray split during non-uniform
split compared to existing uniform split (I will add this to commit log in the
next version):

Existing uniform split requires 2^(order % XA_CHUNK_SHIFT) xa_node allocations
during split, when the folio needs to be split to order-0. But non-uniform split
only requires at most 1 xa_node allocation. For example, to split an order-9
folio, 8 xa_nodes are needed for uniform split, since the folio takes 8
multi-index slots in the xarray. But for non-uniform split, only the slot
containing the given struct page needs a xa_node after the split. There will be
a 7 xa_node saving.

Hi Matthew,

Do you mind checking my statement above on xarray memory saving? And correct me
if I miss anything. Thanks.

Best Regards,
Yan, Zi

Matthew Wilcox Feb. 7, 2025, 2:25 p.m. UTC | #3

On Fri, Feb 07, 2025 at 09:11:39AM -0500, Zi Yan wrote:
> Existing uniform split requires 2^(order % XA_CHUNK_SHIFT) xa_node allocations
> during split, when the folio needs to be split to order-0. But non-uniform split
> only requires at most 1 xa_node allocation. For example, to split an order-9
> folio, 8 xa_nodes are needed for uniform split, since the folio takes 8
> multi-index slots in the xarray. But for non-uniform split, only the slot
> containing the given struct page needs a xa_node after the split. There will be
> a 7 xa_node saving.
> 
> Hi Matthew,
> 
> Do you mind checking my statement above on xarray memory saving? And correct me
> if I miss anything. Thanks.

We currently have a bug where we can't split order-12 (or above) to order-0 (or anything in the range 0-5) as we'd need to allocate two layers of nodes, and
the preallocation can't do that.

As part of your series, I'd like to remove that limitation, so we'd need
to allocate log_64(n - m) [ok, more complex than that, but ykwim].  So
it's not quite "only allocate one node", but it's allocate O(log(current
number of nodes needed to be allocated)).

Makes sense?

Zi Yan Feb. 7, 2025, 2:35 p.m. UTC | #4

On 7 Feb 2025, at 9:25, Matthew Wilcox wrote:

> On Fri, Feb 07, 2025 at 09:11:39AM -0500, Zi Yan wrote:
>> Existing uniform split requires 2^(order % XA_CHUNK_SHIFT) xa_node allocations
>> during split, when the folio needs to be split to order-0. But non-uniform split
>> only requires at most 1 xa_node allocation. For example, to split an order-9
>> folio, 8 xa_nodes are needed for uniform split, since the folio takes 8
>> multi-index slots in the xarray. But for non-uniform split, only the slot
>> containing the given struct page needs a xa_node after the split. There will be
>> a 7 xa_node saving.
>>
>> Hi Matthew,
>>
>> Do you mind checking my statement above on xarray memory saving? And correct me
>> if I miss anything. Thanks.
>
> We currently have a bug where we can't split order-12 (or above) to order-0 (or anything in the range 0-5) as we'd need to allocate two layers of nodes, and
> the preallocation can't do that.
>
> As part of your series, I'd like to remove that limitation, so we'd need
> to allocate log_64(n - m) [ok, more complex than that, but ykwim].  So
> it's not quite "only allocate one node", but it's allocate O(log(current
> number of nodes needed to be allocated)).
>
> Makes sense?

Yes.

To remove that order-12 limitation, do shmem_split_large_entry() and
__filemap_add_folio() need some change as well? Both call xas_split_alloc().
But I do not know if they will see splitting order-12 to order-(0 to 5).


Best Regards,
Yan, Zi

Matthew Wilcox Feb. 7, 2025, 3 p.m. UTC | #5

On Fri, Feb 07, 2025 at 09:35:27AM -0500, Zi Yan wrote:
> On 7 Feb 2025, at 9:25, Matthew Wilcox wrote:
> > As part of your series, I'd like to remove that limitation, so we'd need
> > to allocate log_64(n - m) [ok, more complex than that, but ykwim].  So
> > it's not quite "only allocate one node", but it's allocate O(log(current
> > number of nodes needed to be allocated)).
> >
> > Makes sense?
> 
> Yes.
> 
> To remove that order-12 limitation, do shmem_split_large_entry() and
> __filemap_add_folio() need some change as well? Both call xas_split_alloc().
> But I do not know if they will see splitting order-12 to order-(0 to 5).

__filemap_add_folio() doesn't need to fracture like it currently does;
it can do the same minimum split.  The situation is that we've got a
shadow entry which covers 2^n slots, and now we want to add a folio
which only covers 2^m slots with m < n.  Leaving n-m shadow entries in the tree
with orders ranging from m to n-1 makes more sense than the eager split.

shmem is the same, except that it's storing swap entries instead of
shadow entries.

[v6,0/7] Buddy allocator like (or non-uniform) folio split

Message

Comments