Message ID | cover.1717673614.git.baolin.wang@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | support large folio swap-out and swap-in for shmem | expand |
Hi Baolin, On Thu, Jun 06, 2024 at 07:58:55PM +0800, Baolin Wang wrote: > In the following patches, shmem will support the swap out of large folios, > which means the shmem mappings may contain large order swap entries, so an > 'orders' array is added for find_get_entries() and find_lock_entries() to > obtain the order size of shmem swap entries, which will help in the release > of shmem large folio swap entries. > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > --- > mm/filemap.c | 27 +++++++++++++++++++++++++-- > mm/internal.h | 4 ++-- > mm/shmem.c | 17 +++++++++-------- > mm/truncate.c | 8 ++++---- > 4 files changed, 40 insertions(+), 16 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 37061aafd191..47fcd9ee6012 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2036,14 +2036,24 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max, > * Return: The number of entries which were found. > */ > unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, > - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices) > + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, > + int *orders) > { > XA_STATE(xas, &mapping->i_pages, *start); > struct folio *folio; > + int order; > > rcu_read_lock(); > while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) { > indices[fbatch->nr] = xas.xa_index; > + if (orders) { > + if (!xa_is_value(folio)) > + order = folio_order(folio); > + else > + order = xa_get_order(xas.xa, xas.xa_index); > + > + orders[fbatch->nr] = order; > + } > if (!folio_batch_add(fbatch, folio)) > break; > } > @@ -2056,6 +2066,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, > folio = fbatch->folios[idx]; > if (!xa_is_value(folio)) > nr = folio_nr_pages(folio); > + else if (orders) > + nr = 1 << orders[idx]; > *start = indices[idx] + nr; > } > return folio_batch_count(fbatch); > @@ -2082,10 +2094,12 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, > * Return: The number of entries which were found. > */ > unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, > - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices) > + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, > + int *orders) > { > XA_STATE(xas, &mapping->i_pages, *start); > struct folio *folio; > + int order; > > rcu_read_lock(); > while ((folio = find_get_entry(&xas, end, XA_PRESENT))) { > @@ -2099,9 +2113,16 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, > if (folio->mapping != mapping || > folio_test_writeback(folio)) > goto unlock; > + if (orders) > + order = folio_order(folio); > VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index), > folio); > + } else if (orders) { > + order = xa_get_order(xas.xa, xas.xa_index); > } > + > + if (orders) > + orders[fbatch->nr] = order; > indices[fbatch->nr] = xas.xa_index; > if (!folio_batch_add(fbatch, folio)) > break; > @@ -2120,6 +2141,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, > folio = fbatch->folios[idx]; > if (!xa_is_value(folio)) > nr = folio_nr_pages(folio); > + else if (orders) > + nr = 1 << orders[idx]; > *start = indices[idx] + nr; > } > return folio_batch_count(fbatch); > diff --git a/mm/internal.h b/mm/internal.h > index 3419c329b3bc..0b5adb6c33cc 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -339,9 +339,9 @@ static inline void force_page_cache_readahead(struct address_space *mapping, > } > > unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, > - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices); > + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders); > unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, > - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices); > + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders); > void filemap_free_folio(struct address_space *mapping, struct folio *folio); > int truncate_inode_folio(struct address_space *mapping, struct folio *folio); > bool truncate_inode_partial_folio(struct folio *folio, loff_t start, > diff --git a/mm/shmem.c b/mm/shmem.c > index 0ac71580decb..28ba603d87b8 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -840,14 +840,14 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap) > * Remove swap entry from page cache, free the swap and its page cache. > */ > static int shmem_free_swap(struct address_space *mapping, > - pgoff_t index, void *radswap) > + pgoff_t index, void *radswap, int order) > { > void *old; Matthew Wilcox suggested [1] returning the number of pages freed in shmem_free_swap(). [1] https://lore.kernel.org/all/ZQRf2pGWurrE0uO+@casper.infradead.org/ Which I submitted here: https://lore.kernel.org/all/20231028211518.3424020-5-da.gomez@samsung.com/ Do you agree with the suggestion? If so, could we update my patch to use free_swap_and_cache_nr() or include that here? > > old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0); > if (old != radswap) > return -ENOENT; > - free_swap_and_cache(radix_to_swp_entry(radswap)); > + free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order); > return 0; > } > > @@ -981,6 +981,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, > pgoff_t end = (lend + 1) >> PAGE_SHIFT; > struct folio_batch fbatch; > pgoff_t indices[PAGEVEC_SIZE]; > + int orders[PAGEVEC_SIZE]; > struct folio *folio; > bool same_folio; > long nr_swaps_freed = 0; > @@ -996,15 +997,15 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, > folio_batch_init(&fbatch); > index = start; > while (index < end && find_lock_entries(mapping, &index, end - 1, > - &fbatch, indices)) { > + &fbatch, indices, orders)) { > for (i = 0; i < folio_batch_count(&fbatch); i++) { > folio = fbatch.folios[i]; > > if (xa_is_value(folio)) { > if (unfalloc) > continue; > - nr_swaps_freed += !shmem_free_swap(mapping, > - indices[i], folio); > + if (!shmem_free_swap(mapping, indices[i], folio, orders[i])) > + nr_swaps_freed += 1 << orders[i]; > continue; > } > > @@ -1058,7 +1059,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, > cond_resched(); > > if (!find_get_entries(mapping, &index, end - 1, &fbatch, > - indices)) { > + indices, orders)) { > /* If all gone or hole-punch or unfalloc, we're done */ > if (index == start || end != -1) > break; > @@ -1072,12 +1073,12 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, > if (xa_is_value(folio)) { > if (unfalloc) > continue; > - if (shmem_free_swap(mapping, indices[i], folio)) { > + if (shmem_free_swap(mapping, indices[i], folio, orders[i])) { > /* Swap was replaced by page: retry */ > index = indices[i]; > break; > } > - nr_swaps_freed++; > + nr_swaps_freed += 1 << orders[i]; > continue; > } > > diff --git a/mm/truncate.c b/mm/truncate.c > index 5ce62a939e55..3a4bc9dba451 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -352,7 +352,7 @@ void truncate_inode_pages_range(struct address_space *mapping, > folio_batch_init(&fbatch); > index = start; > while (index < end && find_lock_entries(mapping, &index, end - 1, > - &fbatch, indices)) { > + &fbatch, indices, NULL)) { > truncate_folio_batch_exceptionals(mapping, &fbatch, indices); > for (i = 0; i < folio_batch_count(&fbatch); i++) > truncate_cleanup_folio(fbatch.folios[i]); > @@ -392,7 +392,7 @@ void truncate_inode_pages_range(struct address_space *mapping, > while (index < end) { > cond_resched(); > if (!find_get_entries(mapping, &index, end - 1, &fbatch, > - indices)) { > + indices, NULL)) { > /* If all gone from start onwards, we're done */ > if (index == start) > break; > @@ -496,7 +496,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping, > int i; > > folio_batch_init(&fbatch); > - while (find_lock_entries(mapping, &index, end, &fbatch, indices)) { > + while (find_lock_entries(mapping, &index, end, &fbatch, indices, NULL)) { > for (i = 0; i < folio_batch_count(&fbatch); i++) { > struct folio *folio = fbatch.folios[i]; > > @@ -622,7 +622,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, > > folio_batch_init(&fbatch); > index = start; > - while (find_get_entries(mapping, &index, end, &fbatch, indices)) { > + while (find_get_entries(mapping, &index, end, &fbatch, indices, NULL)) { > for (i = 0; i < folio_batch_count(&fbatch); i++) { > struct folio *folio = fbatch.folios[i]; > > -- > 2.39.3 > Daniel
On Thu, Jun 06, 2024 at 07:58:55PM +0800, Baolin Wang wrote: > In the following patches, shmem will support the swap out of large folios, > which means the shmem mappings may contain large order swap entries, so an > 'orders' array is added for find_get_entries() and find_lock_entries() to > obtain the order size of shmem swap entries, which will help in the release > of shmem large folio swap entries. I am not a fan. I was hoping that 'order' would be encoded in the swap entry, not passed as a separate parameter. As I understand it, we currently have a free bit, or swp_to_radix_entry() would not work. We can use that as detailed here to encode the order in a single bit. https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder
On 2024/6/10 23:23, Daniel Gomez wrote: > Hi Baolin, > > On Thu, Jun 06, 2024 at 07:58:55PM +0800, Baolin Wang wrote: >> In the following patches, shmem will support the swap out of large folios, >> which means the shmem mappings may contain large order swap entries, so an >> 'orders' array is added for find_get_entries() and find_lock_entries() to >> obtain the order size of shmem swap entries, which will help in the release >> of shmem large folio swap entries. >> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> --- >> mm/filemap.c | 27 +++++++++++++++++++++++++-- >> mm/internal.h | 4 ++-- >> mm/shmem.c | 17 +++++++++-------- >> mm/truncate.c | 8 ++++---- >> 4 files changed, 40 insertions(+), 16 deletions(-) >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 37061aafd191..47fcd9ee6012 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -2036,14 +2036,24 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max, >> * Return: The number of entries which were found. >> */ >> unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, >> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices) >> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, >> + int *orders) >> { >> XA_STATE(xas, &mapping->i_pages, *start); >> struct folio *folio; >> + int order; >> >> rcu_read_lock(); >> while ((folio = find_get_entry(&xas, end, XA_PRESENT)) != NULL) { >> indices[fbatch->nr] = xas.xa_index; >> + if (orders) { >> + if (!xa_is_value(folio)) >> + order = folio_order(folio); >> + else >> + order = xa_get_order(xas.xa, xas.xa_index); >> + >> + orders[fbatch->nr] = order; >> + } >> if (!folio_batch_add(fbatch, folio)) >> break; >> } >> @@ -2056,6 +2066,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, >> folio = fbatch->folios[idx]; >> if (!xa_is_value(folio)) >> nr = folio_nr_pages(folio); >> + else if (orders) >> + nr = 1 << orders[idx]; >> *start = indices[idx] + nr; >> } >> return folio_batch_count(fbatch); >> @@ -2082,10 +2094,12 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, >> * Return: The number of entries which were found. >> */ >> unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, >> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices) >> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, >> + int *orders) >> { >> XA_STATE(xas, &mapping->i_pages, *start); >> struct folio *folio; >> + int order; >> >> rcu_read_lock(); >> while ((folio = find_get_entry(&xas, end, XA_PRESENT))) { >> @@ -2099,9 +2113,16 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, >> if (folio->mapping != mapping || >> folio_test_writeback(folio)) >> goto unlock; >> + if (orders) >> + order = folio_order(folio); >> VM_BUG_ON_FOLIO(!folio_contains(folio, xas.xa_index), >> folio); >> + } else if (orders) { >> + order = xa_get_order(xas.xa, xas.xa_index); >> } >> + >> + if (orders) >> + orders[fbatch->nr] = order; >> indices[fbatch->nr] = xas.xa_index; >> if (!folio_batch_add(fbatch, folio)) >> break; >> @@ -2120,6 +2141,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, >> folio = fbatch->folios[idx]; >> if (!xa_is_value(folio)) >> nr = folio_nr_pages(folio); >> + else if (orders) >> + nr = 1 << orders[idx]; >> *start = indices[idx] + nr; >> } >> return folio_batch_count(fbatch); >> diff --git a/mm/internal.h b/mm/internal.h >> index 3419c329b3bc..0b5adb6c33cc 100644 >> --- a/mm/internal.h >> +++ b/mm/internal.h >> @@ -339,9 +339,9 @@ static inline void force_page_cache_readahead(struct address_space *mapping, >> } >> >> unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, >> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices); >> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders); >> unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, >> - pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices); >> + pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices, int *orders); >> void filemap_free_folio(struct address_space *mapping, struct folio *folio); >> int truncate_inode_folio(struct address_space *mapping, struct folio *folio); >> bool truncate_inode_partial_folio(struct folio *folio, loff_t start, >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 0ac71580decb..28ba603d87b8 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -840,14 +840,14 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap) >> * Remove swap entry from page cache, free the swap and its page cache. >> */ >> static int shmem_free_swap(struct address_space *mapping, >> - pgoff_t index, void *radswap) >> + pgoff_t index, void *radswap, int order) >> { >> void *old; > > Matthew Wilcox suggested [1] returning the number of pages freed in shmem_free_swap(). > > [1] https://lore.kernel.org/all/ZQRf2pGWurrE0uO+@casper.infradead.org/ > > Which I submitted here: > https://lore.kernel.org/all/20231028211518.3424020-5-da.gomez@samsung.com/ > > Do you agree with the suggestion? If so, could we update my patch to use > free_swap_and_cache_nr() or include that here? Yes, this looks good to me. But we still need some modification for find_lock_entries() and find_get_entries() to update the '*start' correctly. I will include your changes into this patch in next version. Thanks.
On 2024/6/11 00:59, Matthew Wilcox wrote: > On Thu, Jun 06, 2024 at 07:58:55PM +0800, Baolin Wang wrote: >> In the following patches, shmem will support the swap out of large folios, >> which means the shmem mappings may contain large order swap entries, so an >> 'orders' array is added for find_get_entries() and find_lock_entries() to >> obtain the order size of shmem swap entries, which will help in the release >> of shmem large folio swap entries. > > I am not a fan. With Daniel's suggestion, I think I can drop the 'order' parameter if you don't like it. I was hoping that 'order' would be encoded in the swap > entry, not passed as a separate parameter. > > As I understand it, we currently have a free bit, or > swp_to_radix_entry() would not work. We can use that as detailed > here to encode the order in a single bit. > > https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder OK. This seems to deserve a separate patch set. I will look at your suggestion in detail. Thanks.
On Thu, 6 Jun 2024, Baolin Wang wrote: > Shmem will support large folio allocation [1] [2] to get a better performance, > however, the memory reclaim still splits the precious large folios when trying > to swap-out shmem, which may lead to the memory fragmentation issue and can not > take advantage of the large folio for shmeme. > > Moreover, the swap code already supports for swapping out large folio without > split, and large folio swap-in[3] series is queued into mm-unstable branch. > Hence this patch set also supports the large folio swap-out and swap-in for > shmem. > > [1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/ > [2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ > [3] https://lore.kernel.org/all/20240508224040.190469-6-21cnbao@gmail.com/T/ > > Changes from RFC: > - Rebased to the latest mm-unstable. > - Drop the counter name fixing patch, which was queued into mm-hotfixes-stable > branch. > > Baolin Wang (7): > mm: vmscan: add validation before spliting shmem large folio > mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM > flag setting > mm: shmem: support large folio allocation for shmem_replace_folio() > mm: shmem: extend shmem_partial_swap_usage() to support large folio > swap > mm: add new 'orders' parameter for find_get_entries() and > find_lock_entries() > mm: shmem: use swap_free_nr() to free shmem swap entries > mm: shmem: support large folio swap out > > drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 + > include/linux/swap.h | 4 +- > include/linux/writeback.h | 1 + > mm/filemap.c | 27 ++++++- > mm/internal.h | 4 +- > mm/shmem.c | 58 ++++++++------ > mm/swapfile.c | 98 ++++++++++++----------- > mm/truncate.c | 8 +- > mm/vmscan.c | 22 ++++- > 9 files changed, 140 insertions(+), 83 deletions(-) I wanted to have some tests running, while looking through these and your shmem mTHP patches; but I wasted too much time on that by applying these on top and hitting crash, OOMs and dreadful thrashing - testing did not get very far at all. Perhaps all easily fixed, but I don't have more time to spend on it, and think this series cannot expect to go into 6.11: I'll have another try with it next cycle. I really must turn my attention to your shmem mTHP series: no doubt I'll have minor adjustments to ask there - but several other people are also waiting for me to respond (or given up on me completely). The little crash fix needed in this series appears to be: --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2053,7 +2053,8 @@ static int shmem_swapin_folio(struct ino goto failed; } - error = shmem_add_to_page_cache(folio, mapping, index, + error = shmem_add_to_page_cache(folio, mapping, + round_down(index, nr_pages), swp_to_radix_entry(swap), gfp); if (error) goto failed; Then the OOMs and dreadful thrashing are due to refcount confusion: I did not even glance at these patches to work out what's wanted, but a printk in __remove_mapping() showed that folio->_refcount was 1024 where 513 was expected, so reclaim was freeing none of them. Hugh
Hi Hugh, On 2024/6/12 13:46, Hugh Dickins wrote: > On Thu, 6 Jun 2024, Baolin Wang wrote: > >> Shmem will support large folio allocation [1] [2] to get a better performance, >> however, the memory reclaim still splits the precious large folios when trying >> to swap-out shmem, which may lead to the memory fragmentation issue and can not >> take advantage of the large folio for shmeme. >> >> Moreover, the swap code already supports for swapping out large folio without >> split, and large folio swap-in[3] series is queued into mm-unstable branch. >> Hence this patch set also supports the large folio swap-out and swap-in for >> shmem. >> >> [1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/ >> [2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ >> [3] https://lore.kernel.org/all/20240508224040.190469-6-21cnbao@gmail.com/T/ >> >> Changes from RFC: >> - Rebased to the latest mm-unstable. >> - Drop the counter name fixing patch, which was queued into mm-hotfixes-stable >> branch. >> >> Baolin Wang (7): >> mm: vmscan: add validation before spliting shmem large folio >> mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM >> flag setting >> mm: shmem: support large folio allocation for shmem_replace_folio() >> mm: shmem: extend shmem_partial_swap_usage() to support large folio >> swap >> mm: add new 'orders' parameter for find_get_entries() and >> find_lock_entries() >> mm: shmem: use swap_free_nr() to free shmem swap entries >> mm: shmem: support large folio swap out >> >> drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 1 + >> include/linux/swap.h | 4 +- >> include/linux/writeback.h | 1 + >> mm/filemap.c | 27 ++++++- >> mm/internal.h | 4 +- >> mm/shmem.c | 58 ++++++++------ >> mm/swapfile.c | 98 ++++++++++++----------- >> mm/truncate.c | 8 +- >> mm/vmscan.c | 22 ++++- >> 9 files changed, 140 insertions(+), 83 deletions(-) > > I wanted to have some tests running, while looking through these > and your shmem mTHP patches; but I wasted too much time on that by > applying these on top and hitting crash, OOMs and dreadful thrashing - > testing did not get very far at all. Thanks for testing. I am sorry I haven't found the issues with my testing. > Perhaps all easily fixed, but I don't have more time to spend on it, > and think this series cannot expect to go into 6.11: I'll have another > try with it next cycle. > > I really must turn my attention to your shmem mTHP series: no doubt > I'll have minor adjustments to ask there - but several other people > are also waiting for me to respond (or given up on me completely). Sure. Thanks. > > The little crash fix needed in this series appears to be: > > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2053,7 +2053,8 @@ static int shmem_swapin_folio(struct ino > goto failed; > } > > - error = shmem_add_to_page_cache(folio, mapping, index, > + error = shmem_add_to_page_cache(folio, mapping, > + round_down(index, nr_pages), > swp_to_radix_entry(swap), gfp); > if (error) > goto failed; Good catch. I missed this. > Then the OOMs and dreadful thrashing are due to refcount confusion: > I did not even glance at these patches to work out what's wanted, > but a printk in __remove_mapping() showed that folio->_refcount was > 1024 where 513 was expected, so reclaim was freeing none of them. I will look at this issue and continue to do more tesing before sending out new version. Thanks.