Message ID | CALf+9Yc3WjbG89uRWiDC_qYshJ5z_9WCrbEJe42Vbv+gJjs26g@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Optimize TLB flushes during page reclaim | expand |
Sorry, the previous patch was unreadable due to damaged whitespace. Here is the same patch with fixed indentation. Signed-off-by: Vinay Banakar <vny@google.com> --- mm/vmscan.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- 1 file changed, 74 insertions(+), 33 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index bd489c1af..1bd510622 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, struct folio_batch free_folios; LIST_HEAD(ret_folios); LIST_HEAD(demote_folios); + LIST_HEAD(pageout_list); unsigned int nr_reclaimed = 0; unsigned int pgactivate = 0; bool do_demote_pass; @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (!sc->may_writepage) goto keep_locked; - /* - * Folio is dirty. Flush the TLB if a writable entry - * potentially exists to avoid CPU writes after I/O - * starts and then write it out here. - */ - try_to_unmap_flush_dirty(); - switch (pageout(folio, mapping, &plug)) { - case PAGE_KEEP: - goto keep_locked; - case PAGE_ACTIVATE: - goto activate_locked; - case PAGE_SUCCESS: - stat->nr_pageout += nr_pages; - - if (folio_test_writeback(folio)) - goto keep; - if (folio_test_dirty(folio)) - goto keep; - - /* - * A synchronous write - probably a ramdisk. Go - * ahead and try to reclaim the folio. - */ - if (!folio_trylock(folio)) - goto keep; - if (folio_test_dirty(folio) || - folio_test_writeback(folio)) - goto keep_locked; - mapping = folio_mapping(folio); - fallthrough; - case PAGE_CLEAN: - ; /* try to free the folio below */ - } + /* Add to pageout list for defered bio submissions */ + list_add(&folio->lru, &pageout_list); + continue; } /* @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, } /* 'folio_list' is always empty here */ + if (!list_empty(&pageout_list)) { + /* + * Batch TLB flushes by flushing once before processing all dirty pages. + * Since we operate on one PMD at a time, this batches TLB flushes at + * PMD granularity rather than per-page, reducing IPIs. + */ + struct address_space *mapping; + try_to_unmap_flush_dirty(); + + while (!list_empty(&pageout_list)) { + struct folio *folio = lru_to_folio(&pageout_list); + list_del(&folio->lru); + + /* Recheck if page got reactivated */ + if (folio_test_active(folio) || + (folio_mapped(folio) && folio_test_young(folio))) + goto skip_pageout_locked; + + mapping = folio_mapping(folio); + pageout_t pageout_res = pageout(folio, mapping, &plug); + switch (pageout_res) { + case PAGE_KEEP: + goto skip_pageout_locked; + case PAGE_ACTIVATE: + goto skip_pageout_locked; + case PAGE_SUCCESS: + stat->nr_pageout += folio_nr_pages(folio); + + if (folio_test_writeback(folio) || + folio_test_dirty(folio)) + goto skip_pageout; + + /* + * A synchronous write - probably a ramdisk. Go + * ahead and try to reclaim the folio. + */ + if (!folio_trylock(folio)) + goto skip_pageout; + if (folio_test_dirty(folio) || + folio_test_writeback(folio)) + goto skip_pageout_locked; + + // Try to free the page + if (!mapping || + !__remove_mapping(mapping, folio, true, + sc->target_mem_cgroup)) + goto skip_pageout_locked; + + nr_reclaimed += folio_nr_pages(folio); + folio_unlock(folio); + continue; + + case PAGE_CLEAN: + if (!mapping || + !__remove_mapping(mapping, folio, true, + sc->target_mem_cgroup)) + goto skip_pageout_locked; + + nr_reclaimed += folio_nr_pages(folio); + folio_unlock(folio); + continue; + } + +skip_pageout_locked: + folio_unlock(folio); +skip_pageout: + list_add(&folio->lru, &ret_folios); + } + } + /* Migrate folios selected for demotion */ nr_reclaimed += demote_folio_list(&demote_folios, pgdat); /* Folios that could not be demoted are still in @demote_folios */ On Mon, Jan 20, 2025 at 4:47 PM Vinay Banakar <vny@google.com> wrote: > > The current implementation in shrink_folio_list() performs full TLB > flushes and issues IPIs for each individual page being reclaimed. This > causes unnecessary overhead during memory reclaim, whether triggered > by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where > applications are actively moving cold pages to swap while maintaining > high performance requirements for hot pages. > > The current code: > 1. Clears PTE and unmaps each page individually > 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or > issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage > 3. Submits each page individually to BIO > > This approach results in: > - Excessive full TLB flushes across all cores > - Unnecessary IPI storms when processing multiple pages > - Suboptimal I/O submission patterns > > I initially tried using selective TLB shootdowns (invlpg) instead of > full TLB flushes per each page to avoid interference with other > threads. However, this approach still required sending IPIs to all > cores for each page, which did not significantly improve application > throughput. > > This patch instead optimizes the process by batching operations, > issuing one IPI per PMD instead of per page. This reduces interrupts > by a factor of 512 and enables batching page submissions to BIO. The > new approach: > 1. Collect dirty pages that need to be written back > 2. Issue a single TLB flush for all dirty pages in the batch > 3. Process the collected pages for writebacks (submit to BIO) > > Testing shows significant reduction in application throughput impact > during page-out operations. Applications maintain better performance > during memory reclaim, when triggered by explicit > madvise(MADV_PAGEOUT) calls. > > I'd appreciate your feedback on this approach, especially on the > correctness of batched BIO submissions. Looking forward to your > comments. > > Signed-off-by: Vinay Banakar <vny@google.com> > --- > mm/vmscan.c | 107 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- > 1 file changed, 74 insertions(+), 33 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bd489c1af..1bd510622 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > struct folio_batch free_folios; > LIST_HEAD(ret_folios); > LIST_HEAD(demote_folios); > + LIST_HEAD(pageout_list); > unsigned int nr_reclaimed = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > if (!sc->may_writepage) > goto keep_locked; > > - /* > - * Folio is dirty. Flush the TLB if a writable entry > - * potentially exists to avoid CPU writes after I/O > - * starts and then write it out here. > - */ > - try_to_unmap_flush_dirty(); > - switch (pageout(folio, mapping, &plug)) { > - case PAGE_KEEP: > - goto keep_locked; > - case PAGE_ACTIVATE: > - goto activate_locked; > - case PAGE_SUCCESS: > - stat->nr_pageout += nr_pages; > - > - if (folio_test_writeback(folio)) > - goto keep; > - if (folio_test_dirty(folio)) > - goto keep; > - > - /* > - * A synchronous write - probably a ramdisk. Go > - * ahead and try to reclaim the folio. > - */ > - if (!folio_trylock(folio)) > - goto keep; > - if (folio_test_dirty(folio) || > - folio_test_writeback(folio)) > - goto keep_locked; > - mapping = folio_mapping(folio); > - fallthrough; > - case PAGE_CLEAN: > - ; /* try to free the folio below */ > - } > + /* Add to pageout list for defered bio submissions */ > + list_add(&folio->lru, &pageout_list); > + continue; > } > > /* > @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > } > /* 'folio_list' is always empty here */ > > + if (!list_empty(&pageout_list)) { > + /* > + * Batch TLB flushes by flushing once before processing all dirty pages. > + * Since we operate on one PMD at a time, this batches TLB flushes at > + * PMD granularity rather than per-page, reducing IPIs. > + */ > + struct address_space *mapping; > + try_to_unmap_flush_dirty(); > + > + while (!list_empty(&pageout_list)) { > + struct folio *folio = lru_to_folio(&pageout_list); > + list_del(&folio->lru); > + > + /* Recheck if page got reactivated */ > + if (folio_test_active(folio) || > + (folio_mapped(folio) && folio_test_young(folio))) > + goto skip_pageout_locked; > + > + mapping = folio_mapping(folio); > + pageout_t pageout_res = pageout(folio, mapping, &plug); > + switch (pageout_res) { > + case PAGE_KEEP: > + goto skip_pageout_locked; > + case PAGE_ACTIVATE: > + goto skip_pageout_locked; > + case PAGE_SUCCESS: > + stat->nr_pageout += folio_nr_pages(folio); > + > + if (folio_test_writeback(folio) || > + folio_test_dirty(folio)) > + goto skip_pageout; > + > + /* > + * A synchronous write - probably a ramdisk. Go > + * ahead and try to reclaim the folio. > + */ > + if (!folio_trylock(folio)) > + goto skip_pageout; > + if (folio_test_dirty(folio) || > + folio_test_writeback(folio)) > + goto skip_pageout_locked; > + > + // Try to free the page > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + > + case PAGE_CLEAN: > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + } > + > +skip_pageout_locked: > + folio_unlock(folio); > +skip_pageout: > + list_add(&folio->lru, &ret_folios); > + } > + } > + > /* Migrate folios selected for demotion */ > nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > /* Folios that could not be demoted are still in @demote_folios */
On Mon, Jan 20, 2025 at 04:47:29PM -0600, Vinay Banakar wrote: > The current implementation in shrink_folio_list() performs full TLB > flushes and issues IPIs for each individual page being reclaimed. This > causes unnecessary overhead during memory reclaim, whether triggered > by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where > applications are actively moving cold pages to swap while maintaining > high performance requirements for hot pages. > > The current code: > 1. Clears PTE and unmaps each page individually > 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or > issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage > 3. Submits each page individually to BIO > > This approach results in: > - Excessive full TLB flushes across all cores > - Unnecessary IPI storms when processing multiple pages > - Suboptimal I/O submission patterns > > I initially tried using selective TLB shootdowns (invlpg) instead of > full TLB flushes per each page to avoid interference with other > threads. However, this approach still required sending IPIs to all > cores for each page, which did not significantly improve application > throughput. > > This patch instead optimizes the process by batching operations, > issuing one IPI per PMD instead of per page. This reduces interrupts > by a factor of 512 and enables batching page submissions to BIO. The > new approach: > 1. Collect dirty pages that need to be written back > 2. Issue a single TLB flush for all dirty pages in the batch > 3. Process the collected pages for writebacks (submit to BIO) The *interesting* IPIs will be reduced by 1/512 at most. Can we see the improvement number? Byungchul > Testing shows significant reduction in application throughput impact > during page-out operations. Applications maintain better performance > during memory reclaim, when triggered by explicit > madvise(MADV_PAGEOUT) calls. > > I'd appreciate your feedback on this approach, especially on the > correctness of batched BIO submissions. Looking forward to your > comments. > > Signed-off-by: Vinay Banakar <vny@google.com> > --- > mm/vmscan.c | 107 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- > 1 file changed, 74 insertions(+), 33 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bd489c1af..1bd510622 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > struct folio_batch free_folios; > LIST_HEAD(ret_folios); > LIST_HEAD(demote_folios); > + LIST_HEAD(pageout_list); > unsigned int nr_reclaimed = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > if (!sc->may_writepage) > goto keep_locked; > > - /* > - * Folio is dirty. Flush the TLB if a writable entry > - * potentially exists to avoid CPU writes after I/O > - * starts and then write it out here. > - */ > - try_to_unmap_flush_dirty(); > - switch (pageout(folio, mapping, &plug)) { > - case PAGE_KEEP: > - goto keep_locked; > - case PAGE_ACTIVATE: > - goto activate_locked; > - case PAGE_SUCCESS: > - stat->nr_pageout += nr_pages; > - > - if (folio_test_writeback(folio)) > - goto keep; > - if (folio_test_dirty(folio)) > - goto keep; > - > - /* > - * A synchronous write - probably a ramdisk. Go > - * ahead and try to reclaim the folio. > - */ > - if (!folio_trylock(folio)) > - goto keep; > - if (folio_test_dirty(folio) || > - folio_test_writeback(folio)) > - goto keep_locked; > - mapping = folio_mapping(folio); > - fallthrough; > - case PAGE_CLEAN: > - ; /* try to free the folio below */ > - } > + /* Add to pageout list for defered bio submissions */ > + list_add(&folio->lru, &pageout_list); > + continue; > } > > /* > @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > } > /* 'folio_list' is always empty here */ > > + if (!list_empty(&pageout_list)) { > + /* > + * Batch TLB flushes by flushing once before processing all dirty pages. > + * Since we operate on one PMD at a time, this batches TLB flushes at > + * PMD granularity rather than per-page, reducing IPIs. > + */ > + struct address_space *mapping; > + try_to_unmap_flush_dirty(); > + > + while (!list_empty(&pageout_list)) { > + struct folio *folio = lru_to_folio(&pageout_list); > + list_del(&folio->lru); > + > + /* Recheck if page got reactivated */ > + if (folio_test_active(folio) || > + (folio_mapped(folio) && folio_test_young(folio))) > + goto skip_pageout_locked; > + > + mapping = folio_mapping(folio); > + pageout_t pageout_res = pageout(folio, mapping, &plug); > + switch (pageout_res) { > + case PAGE_KEEP: > + goto skip_pageout_locked; > + case PAGE_ACTIVATE: > + goto skip_pageout_locked; > + case PAGE_SUCCESS: > + stat->nr_pageout += folio_nr_pages(folio); > + > + if (folio_test_writeback(folio) || > + folio_test_dirty(folio)) > + goto skip_pageout; > + > + /* > + * A synchronous write - probably a ramdisk. Go > + * ahead and try to reclaim the folio. > + */ > + if (!folio_trylock(folio)) > + goto skip_pageout; > + if (folio_test_dirty(folio) || > + folio_test_writeback(folio)) > + goto skip_pageout_locked; > + > + // Try to free the page > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + > + case PAGE_CLEAN: > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + } > + > +skip_pageout_locked: > + folio_unlock(folio); > +skip_pageout: > + list_add(&folio->lru, &ret_folios); > + } > + } > + > /* Migrate folios selected for demotion */ > nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > /* Folios that could not be demoted are still in @demote_folios */
On Mon, Jan 20, 2025 at 7:44 PM Byungchul Park <byungchul@sk.com> wrote:
> The *interesting* IPIs will be reduced by 1/512 at most. Can we see the
improvement number?
Yes, we reduce IPIs by a factor of 512 by sending one IPI (for TLB
flush) per PMD rather than per page. Since shrink_folio_list()
operates on one PMD at a time, I believe we can safely batch these
operations here.
Here's a concrete example:
When swapping out 20 GiB (5.2M pages):
- Current: Each page triggers an IPI to all cores
- With 6 cores: 31.4M total interrupts (6 cores × 5.2M pages)
- With patch: One IPI per PMD (512 pages)
- Only 10.2K IPIs required (5.2M/512)
- With 6 cores: 61.4K total interrupts
- Results in ~99% reduction in total interrupts
Application performance impact varies by workload, but here's a
representative test case:
- Thread 1: Continuously accesses a 2 GiB private anonymous map (64B
chunks at random offsets)
- Thread 2: Pinned to different core, uses MADV_PAGEOUT on 20 GiB
private anonymous map to swap it out to SSD
- The threads only access their respective maps.
Results:
- Without patch: Thread 1 sees ~53% throughput reduction during
swap. If there are multiple worker threads (like thread 1), the
cumulative throughput degradation will be much higher
- With patch: Thread 1 maintains normal throughput
I expect a similar application performance impact when memory reclaim
is triggered by kswapd.
On 21-Jan-25 5:35 AM, Vinay Banakar wrote: > Sorry, the previous patch was unreadable due to damaged whitespace. > Here is the same patch with fixed indentation. > > Signed-off-by: Vinay Banakar <vny@google.com> > --- > mm/vmscan.c | 107 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- > 1 file changed, 74 insertions(+), 33 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bd489c1af..1bd510622 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > struct folio_batch free_folios; > LIST_HEAD(ret_folios); > LIST_HEAD(demote_folios); > + LIST_HEAD(pageout_list); > unsigned int nr_reclaimed = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > if (!sc->may_writepage) > goto keep_locked; > > - /* > - * Folio is dirty. Flush the TLB if a writable entry > - * potentially exists to avoid CPU writes after I/O > - * starts and then write it out here. > - */ > - try_to_unmap_flush_dirty(); > - switch (pageout(folio, mapping, &plug)) { > - case PAGE_KEEP: > - goto keep_locked; > - case PAGE_ACTIVATE: > - goto activate_locked; > - case PAGE_SUCCESS: > - stat->nr_pageout += nr_pages; > - > - if (folio_test_writeback(folio)) > - goto keep; > - if (folio_test_dirty(folio)) > - goto keep; > - > - /* > - * A synchronous write - probably a ramdisk. Go > - * ahead and try to reclaim the folio. > - */ > - if (!folio_trylock(folio)) > - goto keep; > - if (folio_test_dirty(folio) || > - folio_test_writeback(folio)) > - goto keep_locked; > - mapping = folio_mapping(folio); > - fallthrough; > - case PAGE_CLEAN: > - ; /* try to free the folio below */ > - } > + /* Add to pageout list for defered bio submissions */ > + list_add(&folio->lru, &pageout_list); > + continue; The dirty pages are collected in a list here... > } > > /* > @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > } > /* 'folio_list' is always empty here */ > > + if (!list_empty(&pageout_list)) { > + /* > + * Batch TLB flushes by flushing once before processing > all dirty pages. > + * Since we operate on one PMD at a time, this batches > TLB flushes at > + * PMD granularity rather than per-page, reducing IPIs. > + */ > + struct address_space *mapping; > + try_to_unmap_flush_dirty(); and one flush request is issued for the entire list. Where is the PMD level (512) batching done? Is that implicit elsewhere in the flow? > + > + while (!list_empty(&pageout_list)) { > + struct folio *folio = lru_to_folio(&pageout_list); > + list_del(&folio->lru); > + > + /* Recheck if page got reactivated */ > + if (folio_test_active(folio) || > + (folio_mapped(folio) && folio_test_young(folio))) > + goto skip_pageout_locked; > + > + mapping = folio_mapping(folio); > + pageout_t pageout_res = pageout(folio, mapping, &plug); > + switch (pageout_res) { > + case PAGE_KEEP: > + goto skip_pageout_locked; > + case PAGE_ACTIVATE: > + goto skip_pageout_locked; > + case PAGE_SUCCESS: > + stat->nr_pageout += folio_nr_pages(folio); > + > + if (folio_test_writeback(folio) || > + folio_test_dirty(folio)) > + goto skip_pageout; > + > + /* > + * A synchronous write - probably a ramdisk. Go > + * ahead and try to reclaim the folio. > + */ > + if (!folio_trylock(folio)) > + goto skip_pageout; > + if (folio_test_dirty(folio) || > + folio_test_writeback(folio)) > + goto skip_pageout_locked; > + > + // Try to free the page > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + > + case PAGE_CLEAN: > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + } > + > +skip_pageout_locked: > + folio_unlock(folio); > +skip_pageout: > + list_add(&folio->lru, &ret_folios); > + } > + } > + > /* Migrate folios selected for demotion */ > nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > /* Folios that could not be demoted are still in @demote_folios */ Regards, Bharata.
On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote: > and one flush request is issued for the entire list. Where is the PMD > level (512) batching done? Is that implicit elsewhere in the flow? Yes, shrink_folio_list() operates on one PMD at a time, so pageout_list will contain at most 512 pages during each call. Please refer to madvise_cold_or_pageout_pte_range() in mm/madvise.c for more details. Thanks! Vinay
On 22-Jan-25 4:39 PM, Vinay Banakar wrote: > On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote: >> and one flush request is issued for the entire list. Where is the PMD >> level (512) batching done? Is that implicit elsewhere in the flow? > > Yes, shrink_folio_list() operates on one PMD at a time, so > pageout_list will contain at most 512 pages during each call. Please > refer to madvise_cold_or_pageout_pte_range() in mm/madvise.c for more > details. While that may be true for MADV_PAGEOUT path, does the same assumption hold good for other paths from which shrink_folio_list() gets called? Regards, Bharata.
On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote: > While that may be true for MADV_PAGEOUT path, does the same assumption > hold good for other paths from which shrink_folio_list() gets called? shrink_folio_list() is called by three other functions, each with different batching behavior: - reclaim_clean_pages_from_list(): Doesn't do PMD batching but only processes clean pages, so it won't take the path affected by this patch. This is called from the contiguous memory allocator (cma_alloc#alloc_contig_range) - shrink_inactive_list(): Reclaims inactive pages at SWAP_CLUSTER_MAX (default 32) at a time. With this patch, we will reduce IPIs for TLB flushes by a factor of 32 in kswapd - evict_folios(): In the MGLRU case, the number of pages shrink_folio_list() processes can vary between 64 (MIN_LRU_BATCH) and 4096 (MAX_LRU_BATCH). The reduction in IPIs will vary accordingly Thanks! Vinay
On Wed, 22 Jan 2025 07:28:56 -0600 Vinay Banakar <vny@google.com> wrote: > On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote: > > While that may be true for MADV_PAGEOUT path, does the same assumption > > hold good for other paths from which shrink_folio_list() gets called? > > shrink_folio_list() is called by three other functions, each with > different batching behavior: > - reclaim_clean_pages_from_list(): Doesn't do PMD batching but only > processes clean pages, so it won't take the path affected by this > patch. This is called from the contiguous memory allocator > (cma_alloc#alloc_contig_range) > - shrink_inactive_list(): Reclaims inactive pages at SWAP_CLUSTER_MAX > (default 32) at a time. With this patch, we will reduce IPIs for TLB > flushes by a factor of 32 in kswapd > - evict_folios(): In the MGLRU case, the number of pages > shrink_folio_list() processes can vary between 64 (MIN_LRU_BATCH) and > 4096 (MAX_LRU_BATCH). The reduction in IPIs will vary accordingly damon_pa_pageout() from mm/damon/paddr.c also calls shrink_folio_list() similar to madvise.c, but it doesn't aware such batching behavior. Have you checked that path? Thanks, SJ > > Thanks! > Vinay
On Mon, 2025-01-20 at 16:47 -0600, Vinay Banakar wrote: > > This patch instead optimizes the process by batching operations, > issuing one IPI per PMD instead of per page. This reduces interrupts > by a factor of 512 and enables batching page submissions to BIO. The > new approach: > 1. Collect dirty pages that need to be written back > 2. Issue a single TLB flush for all dirty pages in the batch > 3. Process the collected pages for writebacks (submit to BIO) > I see how moving the arch_tlbbatch_flush to between unmapping the pages, and issuing the IO could reduce TLB flushing operations significantly. However, how does that lead to PMD level operations? I don't see any code in your patch that gathers things at the PMD level. The code simply operates on whatever folio size is on the list that is being passed to shrink_folio_list() Is the PMD level thing some MGLRU specific feature? > > I'd appreciate your feedback on this approach, especially on the > correctness of batched BIO submissions. Looking forward to your > comments. > Maybe the filesystem people have more comments on this aspect, but from a VM perspective I suspect that doing that batch flush in one spot, and then iterating over the pages should be fine. My main quibble is with the changelog, and the comment that refers to "PMD level" operations, when the code does not appear to be doing any PMD level coalescing.
On Wed, Jan 22, 2025 at 2:05 PM SeongJae Park <sj@kernel.org> wrote: > damon_pa_pageout() from mm/damon/paddr.c also calls shrink_folio_list() similar > to madvise.c, but it doesn't aware such batching behavior. Have you checked > that path? Thanks for catching this path. In damon_pa_pageout(), shrink_folio_list() processes all pages from a single NUMA node that were collected (filtered) from a single DAMON region (r->ar.start to r->ar.end). This means it could be processing anywhere from 1 page up to ULONG_MAX pages from a single node at once. With the patch, we'll send a single IPI for TLB flush for the entire region, reducing IPIs by a factor equal to the number of pages being reclaimed by DAMON at once (decided by damon_reclaim_quota). My only concern here would be the overhead of maintaining the temporary pageout_list for batching. However, during BIO submission, the patch checks if the folio was reactivated, so submitting to BIO in bulk should be safe. Another option would be to modify shrink_folio_list() to force batch flushes for up to N pages (512) at a time, rather than relying on callers to do the batching via folio_list. Thanks! Vinay
On Thu, 23 Jan 2025 11:11:13 -0600 Vinay Banakar <vny@google.com> wrote: > On Wed, Jan 22, 2025 at 2:05 PM SeongJae Park <sj@kernel.org> wrote: > > damon_pa_pageout() from mm/damon/paddr.c also calls shrink_folio_list() similar > > to madvise.c, but it doesn't aware such batching behavior. Have you checked > > that path? > > Thanks for catching this path. In damon_pa_pageout(), > shrink_folio_list() processes all pages from a single NUMA node that > were collected (filtered) from a single DAMON region (r->ar.start to > r->ar.end). This means it could be processing anywhere from 1 page up > to ULONG_MAX pages from a single node at once. Thank you Vinay. That's same to my understanding, except that it is not limited to a single NUMA node. A region can have any start and end physical addresses, so it could cover memory of different NUMA nodes. > With the patch, we'll > send a single IPI for TLB flush for the entire region, reducing IPIs > by a factor equal to the number of pages being reclaimed by DAMON at > once (decided by damon_reclaim_quota). I guess the fact that the pages could belong to differnt NUMA nodes doesn't make difference here? > > My only concern here would be the overhead of maintaining the > temporary pageout_list for batching. However, during BIO submission, > the patch checks if the folio was reactivated, so submitting to BIO in > bulk should be safe. > > Another option would be to modify shrink_folio_list() to force batch > flushes for up to N pages (512) at a time, rather than relying on > callers to do the batching via folio_list. Both sounds good to me :) Thanks, SJ > > Thanks! > Vinay
On Thu, Jan 23, 2025 at 11:11:13AM -0600, Vinay Banakar wrote: > Another option would be to modify shrink_folio_list() to force batch > flushes for up to N pages (512) at a time, rather than relying on > callers to do the batching via folio_list. If you really want to improve performance, consider converting shrink_folio_list() to take a folio_batch. I did this for page freeing: https://lore.kernel.org/all/20240227174254.710559-1-willy@infradead.org/ and we did in fact see a regression (due to shrinking the batch size from 32 to 15). We then increased the batch size from 15 to 31 and saw a 12.5% performance improvement on the page_fault2 benchmark over the original batch size of 32. Commit 9cecde80aae0 and https://lore.kernel.org/oe-lkp/202403151058.7048f6a8-oliver.sang@intel.com/ have some more details. I think you should be able to see a noticable improvement by going from a list with 512 entries on it to a batch of 31 entries.
On Wed, Jan 22, 2025 at 10:20 PM Rik van Riel <riel@surriel.com> wrote: > I see how moving the arch_tlbbatch_flush to > between unmapping the pages, and issuing the > IO could reduce TLB flushing operations > significantly. > > However, how does that lead to PMD level > operations? > > I don't see any code in your patch that gathers > things at the PMD level. The code simply operates > on whatever folio size is on the list that is > being passed to shrink_folio_list() > > Is the PMD level thing some MGLRU specific feature? > My main quibble is with the changelog, and the > comment that refers to "PMD level" operations, > when the code does not appear to be doing any > PMD level coalescing. I initially assumed that all paths leading to shrink_folio_list() submitted 512 pages at a time. Turns out this is only true for the madvise path. The other paths have different batch sizes: - shrink_inactive_list(): Uses SWAP_CLUSTER_MAX (32 pages) - evict_folios(): Varies between 64 and 4096 pages (MGLRU) - damon_pa_pageout(): Variable size based on DAMON region - reclaim_clean_pages_from_list(): Clean pages only, unaffected by this patch We have two options: 1. Keep the current logic where TLB flush batching varies by caller 2. Enforce consistent 512-page batching in shrink_folio_list() and also convert to folio_batch as suggested by Matthew > > I'd appreciate your feedback on this approach, especially on the > > correctness of batched BIO submissions. Looking forward to your > > comments. > Maybe the filesystem people have more comments on > this aspect, but from a VM perspective I suspect > that doing that batch flush in one spot, and then > iterating over the pages should be fine. Thank you for commenting on this! Vinay
On Thu, 2025-01-23 at 13:16 -0600, Vinay Banakar wrote: > > I initially assumed that all paths leading to shrink_folio_list() > submitted 512 pages at a time. Turns out this is only true for the > madvise path. > The other paths have different batch sizes: > - shrink_inactive_list(): Uses SWAP_CLUSTER_MAX (32 pages) > - evict_folios(): Varies between 64 and 4096 pages (MGLRU) > - damon_pa_pageout(): Variable size based on DAMON region > - reclaim_clean_pages_from_list(): Clean pages only, unaffected by > this patch > > We have two options: > 1. Keep the current logic where TLB flush batching varies by caller > 2. Enforce consistent 512-page batching in shrink_folio_list() and > also convert to folio_batch as suggested by Matthew My preference would be to fix one thing at a time, and simply remove the comments that suggest everything is done at PMD level. I'll ACK the patch if the misleading comments and changelog text are removed or fixed up :)
diff --git a/mm/vmscan.c b/mm/vmscan.c index bd489c1af..1bd510622 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, struct folio_batch free_folios; LIST_HEAD(ret_folios); LIST_HEAD(demote_folios); + LIST_HEAD(pageout_list); unsigned int nr_reclaimed = 0; unsigned int pgactivate = 0;
The current implementation in shrink_folio_list() performs full TLB flushes and issues IPIs for each individual page being reclaimed. This causes unnecessary overhead during memory reclaim, whether triggered by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where applications are actively moving cold pages to swap while maintaining high performance requirements for hot pages. The current code: 1. Clears PTE and unmaps each page individually 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage 3. Submits each page individually to BIO This approach results in: - Excessive full TLB flushes across all cores - Unnecessary IPI storms when processing multiple pages - Suboptimal I/O submission patterns I initially tried using selective TLB shootdowns (invlpg) instead of full TLB flushes per each page to avoid interference with other threads. However, this approach still required sending IPIs to all cores for each page, which did not significantly improve application throughput. This patch instead optimizes the process by batching operations, issuing one IPI per PMD instead of per page. This reduces interrupts by a factor of 512 and enables batching page submissions to BIO. The new approach: 1. Collect dirty pages that need to be written back 2. Issue a single TLB flush for all dirty pages in the batch 3. Process the collected pages for writebacks (submit to BIO) Testing shows significant reduction in application throughput impact during page-out operations. Applications maintain better performance during memory reclaim, when triggered by explicit madvise(MADV_PAGEOUT) calls. I'd appreciate your feedback on this approach, especially on the correctness of batched BIO submissions. Looking forward to your comments. Signed-off-by: Vinay Banakar <vny@google.com> --- mm/vmscan.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- 1 file changed, 74 insertions(+), 33 deletions(-) bool do_demote_pass; @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (!sc->may_writepage) goto keep_locked; - /* - * Folio is dirty. Flush the TLB if a writable entry - * potentially exists to avoid CPU writes after I/O - * starts and then write it out here. - */ - try_to_unmap_flush_dirty(); - switch (pageout(folio, mapping, &plug)) { - case PAGE_KEEP: - goto keep_locked; - case PAGE_ACTIVATE: - goto activate_locked; - case PAGE_SUCCESS: - stat->nr_pageout += nr_pages; - - if (folio_test_writeback(folio)) - goto keep; - if (folio_test_dirty(folio)) - goto keep; - - /* - * A synchronous write - probably a ramdisk. Go - * ahead and try to reclaim the folio. - */ - if (!folio_trylock(folio)) - goto keep; - if (folio_test_dirty(folio) || - folio_test_writeback(folio)) - goto keep_locked; - mapping = folio_mapping(folio); - fallthrough; - case PAGE_CLEAN: - ; /* try to free the folio below */ - } + /* Add to pageout list for defered bio submissions */ + list_add(&folio->lru, &pageout_list); + continue; } /* @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, } /* 'folio_list' is always empty here */ + if (!list_empty(&pageout_list)) { + /* + * Batch TLB flushes by flushing once before processing all dirty pages. + * Since we operate on one PMD at a time, this batches TLB flushes at + * PMD granularity rather than per-page, reducing IPIs. + */ + struct address_space *mapping; + try_to_unmap_flush_dirty(); + + while (!list_empty(&pageout_list)) { + struct folio *folio = lru_to_folio(&pageout_list); + list_del(&folio->lru); + + /* Recheck if page got reactivated */ + if (folio_test_active(folio) || + (folio_mapped(folio) && folio_test_young(folio))) + goto skip_pageout_locked; + + mapping = folio_mapping(folio); + pageout_t pageout_res = pageout(folio, mapping, &plug); + switch (pageout_res) { + case PAGE_KEEP: + goto skip_pageout_locked; + case PAGE_ACTIVATE: + goto skip_pageout_locked; + case PAGE_SUCCESS: + stat->nr_pageout += folio_nr_pages(folio); + + if (folio_test_writeback(folio) || + folio_test_dirty(folio)) + goto skip_pageout; + + /* + * A synchronous write - probably a ramdisk. Go + * ahead and try to reclaim the folio. + */ + if (!folio_trylock(folio)) + goto skip_pageout; + if (folio_test_dirty(folio) || + folio_test_writeback(folio)) + goto skip_pageout_locked; + + // Try to free the page + if (!mapping || + !__remove_mapping(mapping, folio, true, + sc->target_mem_cgroup)) + goto skip_pageout_locked; + + nr_reclaimed += folio_nr_pages(folio); + folio_unlock(folio); + continue; + + case PAGE_CLEAN: + if (!mapping || + !__remove_mapping(mapping, folio, true, + sc->target_mem_cgroup)) + goto skip_pageout_locked; + + nr_reclaimed += folio_nr_pages(folio); + folio_unlock(folio); + continue; + } + +skip_pageout_locked: + folio_unlock(folio); +skip_pageout: + list_add(&folio->lru, &ret_folios); + } + } + /* Migrate folios selected for demotion */ nr_reclaimed += demote_folio_list(&demote_folios, pgdat); /* Folios that could not be demoted are still in @demote_folios */