Message ID | 20250205031417.1771278-1-ziy@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | Buddy allocator like (or non-uniform) folio split | expand |
On Tue, 4 Feb 2025 22:14:10 -0500 Zi Yan <ziy@nvidia.com> wrote: > This patchset adds a new buddy allocator like (or non-uniform) large folio > split to reduce the total number of after-split folios, the amount of memory > needed for multi-index xarray split, and keep more large folios after a split. It would be useful (vital, really) to provide some measurements which help others understand the magnitude of these resource savings, please.
On 6 Feb 2025, at 3:01, Andrew Morton wrote: > On Tue, 4 Feb 2025 22:14:10 -0500 Zi Yan <ziy@nvidia.com> wrote: > >> This patchset adds a new buddy allocator like (or non-uniform) large folio >> split to reduce the total number of after-split folios, the amount of memory >> needed for multi-index xarray split, and keep more large folios after a split. > > It would be useful (vital, really) to provide some measurements which > help others understand the magnitude of these resource savings, please. Hi Andrew, Can you please drop this series for now? I find that, after your above request, I misunderstood how xas_split_alloc() and xas_split() works in xarray, thus, my current implementation allocates more than enough xa_node during non-uniform split, although the excessive ones are freed at the end. It defeats the purpose of reducing memory consumption of multi-index xarray split, even if folio_split() has no function issue AFAICT. I am working on a better implementation that might require new xarray operations. I will post it as v7 later. I really appreciate that you asked about more info above. :) More details on memory saving for multi-index xarray split during non-uniform split compared to existing uniform split (I will add this to commit log in the next version): Existing uniform split requires 2^(order % XA_CHUNK_SHIFT) xa_node allocations during split, when the folio needs to be split to order-0. But non-uniform split only requires at most 1 xa_node allocation. For example, to split an order-9 folio, 8 xa_nodes are needed for uniform split, since the folio takes 8 multi-index slots in the xarray. But for non-uniform split, only the slot containing the given struct page needs a xa_node after the split. There will be a 7 xa_node saving. Hi Matthew, Do you mind checking my statement above on xarray memory saving? And correct me if I miss anything. Thanks. Best Regards, Yan, Zi
On Fri, Feb 07, 2025 at 09:11:39AM -0500, Zi Yan wrote: > Existing uniform split requires 2^(order % XA_CHUNK_SHIFT) xa_node allocations > during split, when the folio needs to be split to order-0. But non-uniform split > only requires at most 1 xa_node allocation. For example, to split an order-9 > folio, 8 xa_nodes are needed for uniform split, since the folio takes 8 > multi-index slots in the xarray. But for non-uniform split, only the slot > containing the given struct page needs a xa_node after the split. There will be > a 7 xa_node saving. > > Hi Matthew, > > Do you mind checking my statement above on xarray memory saving? And correct me > if I miss anything. Thanks. We currently have a bug where we can't split order-12 (or above) to order-0 (or anything in the range 0-5) as we'd need to allocate two layers of nodes, and the preallocation can't do that. As part of your series, I'd like to remove that limitation, so we'd need to allocate log_64(n - m) [ok, more complex than that, but ykwim]. So it's not quite "only allocate one node", but it's allocate O(log(current number of nodes needed to be allocated)). Makes sense?
On 7 Feb 2025, at 9:25, Matthew Wilcox wrote: > On Fri, Feb 07, 2025 at 09:11:39AM -0500, Zi Yan wrote: >> Existing uniform split requires 2^(order % XA_CHUNK_SHIFT) xa_node allocations >> during split, when the folio needs to be split to order-0. But non-uniform split >> only requires at most 1 xa_node allocation. For example, to split an order-9 >> folio, 8 xa_nodes are needed for uniform split, since the folio takes 8 >> multi-index slots in the xarray. But for non-uniform split, only the slot >> containing the given struct page needs a xa_node after the split. There will be >> a 7 xa_node saving. >> >> Hi Matthew, >> >> Do you mind checking my statement above on xarray memory saving? And correct me >> if I miss anything. Thanks. > > We currently have a bug where we can't split order-12 (or above) to order-0 (or anything in the range 0-5) as we'd need to allocate two layers of nodes, and > the preallocation can't do that. > > As part of your series, I'd like to remove that limitation, so we'd need > to allocate log_64(n - m) [ok, more complex than that, but ykwim]. So > it's not quite "only allocate one node", but it's allocate O(log(current > number of nodes needed to be allocated)). > > Makes sense? Yes. To remove that order-12 limitation, do shmem_split_large_entry() and __filemap_add_folio() need some change as well? Both call xas_split_alloc(). But I do not know if they will see splitting order-12 to order-(0 to 5). Best Regards, Yan, Zi
On Fri, Feb 07, 2025 at 09:35:27AM -0500, Zi Yan wrote: > On 7 Feb 2025, at 9:25, Matthew Wilcox wrote: > > As part of your series, I'd like to remove that limitation, so we'd need > > to allocate log_64(n - m) [ok, more complex than that, but ykwim]. So > > it's not quite "only allocate one node", but it's allocate O(log(current > > number of nodes needed to be allocated)). > > > > Makes sense? > > Yes. > > To remove that order-12 limitation, do shmem_split_large_entry() and > __filemap_add_folio() need some change as well? Both call xas_split_alloc(). > But I do not know if they will see splitting order-12 to order-(0 to 5). __filemap_add_folio() doesn't need to fracture like it currently does; it can do the same minimum split. The situation is that we've got a shadow entry which covers 2^n slots, and now we want to add a folio which only covers 2^m slots with m < n. Leaving n-m shadow entries in the tree with orders ranging from m to n-1 makes more sense than the eager split. shmem is the same, except that it's storing swap entries instead of shadow entries.