mm: Optimize TLB flushes during page reclaim

Message ID	CALf+9Yc3WjbG89uRWiDC_qYshJ5z_9WCrbEJe42Vbv+gJjs26g@mail.gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> MIME-Version: 1.0 From: Vinay Banakar <vny@google.com> Date: Mon, 20 Jan 2025 16:47:29 -0600 Message-ID: <CALf+9Yc3WjbG89uRWiDC_qYshJ5z_9WCrbEJe42Vbv+gJjs26g@mail.gmail.com> Subject: [PATCH] mm: Optimize TLB flushes during page reclaim To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: akpm@linux-foundation.org, willy@infradead.org, mgorman@suse.de, Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Optimize TLB flushes during page reclaim \| expand mm: Optimize TLB flushes during page reclaim

Vinay Banakar Jan. 20, 2025, 10:47 p.m. UTC

The current implementation in shrink_folio_list() performs full TLB
flushes and issues IPIs for each individual page being reclaimed. This
causes unnecessary overhead during memory reclaim, whether triggered
by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where
applications are actively moving cold pages to swap while maintaining
high performance requirements for hot pages.

The current code:
1. Clears PTE and unmaps each page individually
2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or
issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage
3. Submits each page individually to BIO

This approach results in:
- Excessive full TLB flushes across all cores
- Unnecessary IPI storms when processing multiple pages
- Suboptimal I/O submission patterns

I initially tried using selective TLB shootdowns (invlpg) instead of
full TLB flushes per each page to avoid interference with other
threads. However, this approach still required sending IPIs to all
cores for each page, which did not significantly improve application
throughput.

This patch instead optimizes the process by batching operations,
issuing one IPI per PMD instead of per page. This reduces interrupts
by a factor of 512 and enables batching page submissions to BIO. The
new approach:
1. Collect dirty pages that need to be written back
2. Issue a single TLB flush for all dirty pages in the batch
3. Process the collected pages for writebacks (submit to BIO)

Testing shows significant reduction in application throughput impact
during page-out operations. Applications maintain better performance
during memory reclaim, when triggered by explicit
madvise(MADV_PAGEOUT) calls.

I'd appreciate your feedback on this approach, especially on the
correctness of batched BIO submissions. Looking forward to your
comments.

Signed-off-by: Vinay Banakar <vny@google.com>
---
 mm/vmscan.c | 107
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
 1 file changed, 74 insertions(+), 33 deletions(-)

  bool do_demote_pass;
@@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
  if (!sc->may_writepage)
  goto keep_locked;

- /*
- * Folio is dirty. Flush the TLB if a writable entry
- * potentially exists to avoid CPU writes after I/O
- * starts and then write it out here.
- */
- try_to_unmap_flush_dirty();
- switch (pageout(folio, mapping, &plug)) {
- case PAGE_KEEP:
- goto keep_locked;
- case PAGE_ACTIVATE:
- goto activate_locked;
- case PAGE_SUCCESS:
- stat->nr_pageout += nr_pages;
-
- if (folio_test_writeback(folio))
- goto keep;
- if (folio_test_dirty(folio))
- goto keep;
-
- /*
- * A synchronous write - probably a ramdisk.  Go
- * ahead and try to reclaim the folio.
- */
- if (!folio_trylock(folio))
- goto keep;
- if (folio_test_dirty(folio) ||
-    folio_test_writeback(folio))
- goto keep_locked;
- mapping = folio_mapping(folio);
- fallthrough;
- case PAGE_CLEAN:
- ; /* try to free the folio below */
- }
+ /* Add to pageout list for defered bio submissions */
+ list_add(&folio->lru, &pageout_list);
+ continue;
  }

  /*
@@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
  }
  /* 'folio_list' is always empty here */

+ if (!list_empty(&pageout_list)) {
+ /*
+ * Batch TLB flushes by flushing once before processing all dirty pages.
+ * Since we operate on one PMD at a time, this batches TLB flushes at
+ * PMD granularity rather than per-page, reducing IPIs.
+ */
+ struct address_space *mapping;
+ try_to_unmap_flush_dirty();
+
+ while (!list_empty(&pageout_list)) {
+ struct folio *folio = lru_to_folio(&pageout_list);
+ list_del(&folio->lru);
+
+ /* Recheck if page got reactivated */
+ if (folio_test_active(folio) ||
+    (folio_mapped(folio) && folio_test_young(folio)))
+ goto skip_pageout_locked;
+
+ mapping = folio_mapping(folio);
+ pageout_t pageout_res = pageout(folio, mapping, &plug);
+ switch (pageout_res) {
+ case PAGE_KEEP:
+ goto skip_pageout_locked;
+ case PAGE_ACTIVATE:
+ goto skip_pageout_locked;
+ case PAGE_SUCCESS:
+ stat->nr_pageout += folio_nr_pages(folio);
+
+ if (folio_test_writeback(folio) ||
+    folio_test_dirty(folio))
+ goto skip_pageout;
+
+ /*
+ * A synchronous write - probably a ramdisk.  Go
+ * ahead and try to reclaim the folio.
+ */
+ if (!folio_trylock(folio))
+ goto skip_pageout;
+ if (folio_test_dirty(folio) ||
+    folio_test_writeback(folio))
+ goto skip_pageout_locked;
+
+ // Try to free the page
+ if (!mapping ||
+    !__remove_mapping(mapping, folio, true,
+      sc->target_mem_cgroup))
+ goto skip_pageout_locked;
+
+ nr_reclaimed += folio_nr_pages(folio);
+ folio_unlock(folio);
+ continue;
+
+ case PAGE_CLEAN:
+ if (!mapping ||
+    !__remove_mapping(mapping, folio, true,
+      sc->target_mem_cgroup))
+ goto skip_pageout_locked;
+
+ nr_reclaimed += folio_nr_pages(folio);
+ folio_unlock(folio);
+ continue;
+ }
+
+skip_pageout_locked:
+ folio_unlock(folio);
+skip_pageout:
+ list_add(&folio->lru, &ret_folios);
+ }
+ }
+
  /* Migrate folios selected for demotion */
  nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
  /* Folios that could not be demoted are still in @demote_folios */

Vinay Banakar Jan. 21, 2025, 12:05 a.m. UTC | #1

Sorry, the previous patch was unreadable due to damaged whitespace.
Here is the same patch with fixed indentation.

Signed-off-by: Vinay Banakar <vny@google.com>
---
 mm/vmscan.c | 107
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
 1 file changed, 74 insertions(+), 33 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd489c1af..1bd510622 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
        struct folio_batch free_folios;
        LIST_HEAD(ret_folios);
        LIST_HEAD(demote_folios);
+       LIST_HEAD(pageout_list);
        unsigned int nr_reclaimed = 0;
        unsigned int pgactivate = 0;
        bool do_demote_pass;
@@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
                        if (!sc->may_writepage)
                                goto keep_locked;

-                       /*
-                        * Folio is dirty. Flush the TLB if a writable entry
-                        * potentially exists to avoid CPU writes after I/O
-                        * starts and then write it out here.
-                        */
-                       try_to_unmap_flush_dirty();
-                       switch (pageout(folio, mapping, &plug)) {
-                       case PAGE_KEEP:
-                               goto keep_locked;
-                       case PAGE_ACTIVATE:
-                               goto activate_locked;
-                       case PAGE_SUCCESS:
-                               stat->nr_pageout += nr_pages;
-
-                               if (folio_test_writeback(folio))
-                                       goto keep;
-                               if (folio_test_dirty(folio))
-                                       goto keep;
-
-                               /*
-                                * A synchronous write - probably a ramdisk.  Go
-                                * ahead and try to reclaim the folio.
-                                */
-                               if (!folio_trylock(folio))
-                                       goto keep;
-                               if (folio_test_dirty(folio) ||
-                                   folio_test_writeback(folio))
-                                       goto keep_locked;
-                               mapping = folio_mapping(folio);
-                               fallthrough;
-                       case PAGE_CLEAN:
-                               ; /* try to free the folio below */
-                       }
+                       /* Add to pageout list for defered bio submissions */
+                       list_add(&folio->lru, &pageout_list);
+                       continue;
        }

        /*
@@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
        }
        /* 'folio_list' is always empty here */

+       if (!list_empty(&pageout_list)) {
+               /*
+               * Batch TLB flushes by flushing once before processing
all dirty pages.
+               * Since we operate on one PMD at a time, this batches
TLB flushes at
+               * PMD granularity rather than per-page, reducing IPIs.
+               */
+               struct address_space *mapping;
+               try_to_unmap_flush_dirty();
+
+               while (!list_empty(&pageout_list)) {
+                       struct folio *folio = lru_to_folio(&pageout_list);
+                       list_del(&folio->lru);
+
+                       /* Recheck if page got reactivated */
+                       if (folio_test_active(folio) ||
+                           (folio_mapped(folio) && folio_test_young(folio)))
+                               goto skip_pageout_locked;
+
+                       mapping = folio_mapping(folio);
+                       pageout_t pageout_res = pageout(folio, mapping, &plug);
+                       switch (pageout_res) {
+                       case PAGE_KEEP:
+                               goto skip_pageout_locked;
+                       case PAGE_ACTIVATE:
+                               goto skip_pageout_locked;
+                       case PAGE_SUCCESS:
+                               stat->nr_pageout += folio_nr_pages(folio);
+
+                               if (folio_test_writeback(folio) ||
+                                   folio_test_dirty(folio))
+                                       goto skip_pageout;
+
+                               /*
+                                * A synchronous write - probably a ramdisk.  Go
+                                * ahead and try to reclaim the folio.
+                                */
+                               if (!folio_trylock(folio))
+                                       goto skip_pageout;
+                               if (folio_test_dirty(folio) ||
+                                   folio_test_writeback(folio))
+                                       goto skip_pageout_locked;
+
+                               // Try to free the page
+                               if (!mapping ||
+                                   !__remove_mapping(mapping, folio, true,
+                                                   sc->target_mem_cgroup))
+                                       goto skip_pageout_locked;
+
+                               nr_reclaimed += folio_nr_pages(folio);
+                               folio_unlock(folio);
+                               continue;
+
+                       case PAGE_CLEAN:
+                               if (!mapping ||
+                                   !__remove_mapping(mapping, folio, true,
+                                                   sc->target_mem_cgroup))
+                                       goto skip_pageout_locked;
+
+                               nr_reclaimed += folio_nr_pages(folio);
+                               folio_unlock(folio);
+                               continue;
+                       }
+
+skip_pageout_locked:
+                       folio_unlock(folio);
+skip_pageout:
+                       list_add(&folio->lru, &ret_folios);
+               }
+       }
+
        /* Migrate folios selected for demotion */
        nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
        /* Folios that could not be demoted are still in @demote_folios */

On Mon, Jan 20, 2025 at 4:47 PM Vinay Banakar <vny@google.com> wrote:
>
> The current implementation in shrink_folio_list() performs full TLB
> flushes and issues IPIs for each individual page being reclaimed. This
> causes unnecessary overhead during memory reclaim, whether triggered
> by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where
> applications are actively moving cold pages to swap while maintaining
> high performance requirements for hot pages.
>
> The current code:
> 1. Clears PTE and unmaps each page individually
> 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or
> issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage
> 3. Submits each page individually to BIO
>
> This approach results in:
> - Excessive full TLB flushes across all cores
> - Unnecessary IPI storms when processing multiple pages
> - Suboptimal I/O submission patterns
>
> I initially tried using selective TLB shootdowns (invlpg) instead of
> full TLB flushes per each page to avoid interference with other
> threads. However, this approach still required sending IPIs to all
> cores for each page, which did not significantly improve application
> throughput.
>
> This patch instead optimizes the process by batching operations,
> issuing one IPI per PMD instead of per page. This reduces interrupts
> by a factor of 512 and enables batching page submissions to BIO. The
> new approach:
> 1. Collect dirty pages that need to be written back
> 2. Issue a single TLB flush for all dirty pages in the batch
> 3. Process the collected pages for writebacks (submit to BIO)
>
> Testing shows significant reduction in application throughput impact
> during page-out operations. Applications maintain better performance
> during memory reclaim, when triggered by explicit
> madvise(MADV_PAGEOUT) calls.
>
> I'd appreciate your feedback on this approach, especially on the
> correctness of batched BIO submissions. Looking forward to your
> comments.
>
> Signed-off-by: Vinay Banakar <vny@google.com>
> ---
>  mm/vmscan.c | 107
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 74 insertions(+), 33 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bd489c1af..1bd510622 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   struct folio_batch free_folios;
>   LIST_HEAD(ret_folios);
>   LIST_HEAD(demote_folios);
> + LIST_HEAD(pageout_list);
>   unsigned int nr_reclaimed = 0;
>   unsigned int pgactivate = 0;
>   bool do_demote_pass;
> @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   if (!sc->may_writepage)
>   goto keep_locked;
>
> - /*
> - * Folio is dirty. Flush the TLB if a writable entry
> - * potentially exists to avoid CPU writes after I/O
> - * starts and then write it out here.
> - */
> - try_to_unmap_flush_dirty();
> - switch (pageout(folio, mapping, &plug)) {
> - case PAGE_KEEP:
> - goto keep_locked;
> - case PAGE_ACTIVATE:
> - goto activate_locked;
> - case PAGE_SUCCESS:
> - stat->nr_pageout += nr_pages;
> -
> - if (folio_test_writeback(folio))
> - goto keep;
> - if (folio_test_dirty(folio))
> - goto keep;
> -
> - /*
> - * A synchronous write - probably a ramdisk.  Go
> - * ahead and try to reclaim the folio.
> - */
> - if (!folio_trylock(folio))
> - goto keep;
> - if (folio_test_dirty(folio) ||
> -    folio_test_writeback(folio))
> - goto keep_locked;
> - mapping = folio_mapping(folio);
> - fallthrough;
> - case PAGE_CLEAN:
> - ; /* try to free the folio below */
> - }
> + /* Add to pageout list for defered bio submissions */
> + list_add(&folio->lru, &pageout_list);
> + continue;
>   }
>
>   /*
> @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   }
>   /* 'folio_list' is always empty here */
>
> + if (!list_empty(&pageout_list)) {
> + /*
> + * Batch TLB flushes by flushing once before processing all dirty pages.
> + * Since we operate on one PMD at a time, this batches TLB flushes at
> + * PMD granularity rather than per-page, reducing IPIs.
> + */
> + struct address_space *mapping;
> + try_to_unmap_flush_dirty();
> +
> + while (!list_empty(&pageout_list)) {
> + struct folio *folio = lru_to_folio(&pageout_list);
> + list_del(&folio->lru);
> +
> + /* Recheck if page got reactivated */
> + if (folio_test_active(folio) ||
> +    (folio_mapped(folio) && folio_test_young(folio)))
> + goto skip_pageout_locked;
> +
> + mapping = folio_mapping(folio);
> + pageout_t pageout_res = pageout(folio, mapping, &plug);
> + switch (pageout_res) {
> + case PAGE_KEEP:
> + goto skip_pageout_locked;
> + case PAGE_ACTIVATE:
> + goto skip_pageout_locked;
> + case PAGE_SUCCESS:
> + stat->nr_pageout += folio_nr_pages(folio);
> +
> + if (folio_test_writeback(folio) ||
> +    folio_test_dirty(folio))
> + goto skip_pageout;
> +
> + /*
> + * A synchronous write - probably a ramdisk.  Go
> + * ahead and try to reclaim the folio.
> + */
> + if (!folio_trylock(folio))
> + goto skip_pageout;
> + if (folio_test_dirty(folio) ||
> +    folio_test_writeback(folio))
> + goto skip_pageout_locked;
> +
> + // Try to free the page
> + if (!mapping ||
> +    !__remove_mapping(mapping, folio, true,
> +      sc->target_mem_cgroup))
> + goto skip_pageout_locked;
> +
> + nr_reclaimed += folio_nr_pages(folio);
> + folio_unlock(folio);
> + continue;
> +
> + case PAGE_CLEAN:
> + if (!mapping ||
> +    !__remove_mapping(mapping, folio, true,
> +      sc->target_mem_cgroup))
> + goto skip_pageout_locked;
> +
> + nr_reclaimed += folio_nr_pages(folio);
> + folio_unlock(folio);
> + continue;
> + }
> +
> +skip_pageout_locked:
> + folio_unlock(folio);
> +skip_pageout:
> + list_add(&folio->lru, &ret_folios);
> + }
> + }
> +
>   /* Migrate folios selected for demotion */
>   nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
>   /* Folios that could not be demoted are still in @demote_folios */

Byungchul Park Jan. 21, 2025, 1:43 a.m. UTC | #2

On Mon, Jan 20, 2025 at 04:47:29PM -0600, Vinay Banakar wrote:
> The current implementation in shrink_folio_list() performs full TLB
> flushes and issues IPIs for each individual page being reclaimed. This
> causes unnecessary overhead during memory reclaim, whether triggered
> by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where
> applications are actively moving cold pages to swap while maintaining
> high performance requirements for hot pages.
> 
> The current code:
> 1. Clears PTE and unmaps each page individually
> 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or
> issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage
> 3. Submits each page individually to BIO
> 
> This approach results in:
> - Excessive full TLB flushes across all cores
> - Unnecessary IPI storms when processing multiple pages
> - Suboptimal I/O submission patterns
> 
> I initially tried using selective TLB shootdowns (invlpg) instead of
> full TLB flushes per each page to avoid interference with other
> threads. However, this approach still required sending IPIs to all
> cores for each page, which did not significantly improve application
> throughput.
> 
> This patch instead optimizes the process by batching operations,
> issuing one IPI per PMD instead of per page. This reduces interrupts
> by a factor of 512 and enables batching page submissions to BIO. The
> new approach:
> 1. Collect dirty pages that need to be written back
> 2. Issue a single TLB flush for all dirty pages in the batch
> 3. Process the collected pages for writebacks (submit to BIO)

The *interesting* IPIs will be reduced by 1/512 at most.  Can we see the
improvement number?

	Byungchul

> Testing shows significant reduction in application throughput impact
> during page-out operations. Applications maintain better performance
> during memory reclaim, when triggered by explicit
> madvise(MADV_PAGEOUT) calls.
> 
> I'd appreciate your feedback on this approach, especially on the
> correctness of batched BIO submissions. Looking forward to your
> comments.
> 
> Signed-off-by: Vinay Banakar <vny@google.com>
> ---
>  mm/vmscan.c | 107
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 74 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bd489c1af..1bd510622 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   struct folio_batch free_folios;
>   LIST_HEAD(ret_folios);
>   LIST_HEAD(demote_folios);
> + LIST_HEAD(pageout_list);
>   unsigned int nr_reclaimed = 0;
>   unsigned int pgactivate = 0;
>   bool do_demote_pass;
> @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   if (!sc->may_writepage)
>   goto keep_locked;
> 
> - /*
> - * Folio is dirty. Flush the TLB if a writable entry
> - * potentially exists to avoid CPU writes after I/O
> - * starts and then write it out here.
> - */
> - try_to_unmap_flush_dirty();
> - switch (pageout(folio, mapping, &plug)) {
> - case PAGE_KEEP:
> - goto keep_locked;
> - case PAGE_ACTIVATE:
> - goto activate_locked;
> - case PAGE_SUCCESS:
> - stat->nr_pageout += nr_pages;
> -
> - if (folio_test_writeback(folio))
> - goto keep;
> - if (folio_test_dirty(folio))
> - goto keep;
> -
> - /*
> - * A synchronous write - probably a ramdisk.  Go
> - * ahead and try to reclaim the folio.
> - */
> - if (!folio_trylock(folio))
> - goto keep;
> - if (folio_test_dirty(folio) ||
> -    folio_test_writeback(folio))
> - goto keep_locked;
> - mapping = folio_mapping(folio);
> - fallthrough;
> - case PAGE_CLEAN:
> - ; /* try to free the folio below */
> - }
> + /* Add to pageout list for defered bio submissions */
> + list_add(&folio->lru, &pageout_list);
> + continue;
>   }
> 
>   /*
> @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>   }
>   /* 'folio_list' is always empty here */
> 
> + if (!list_empty(&pageout_list)) {
> + /*
> + * Batch TLB flushes by flushing once before processing all dirty pages.
> + * Since we operate on one PMD at a time, this batches TLB flushes at
> + * PMD granularity rather than per-page, reducing IPIs.
> + */
> + struct address_space *mapping;
> + try_to_unmap_flush_dirty();
> +
> + while (!list_empty(&pageout_list)) {
> + struct folio *folio = lru_to_folio(&pageout_list);
> + list_del(&folio->lru);
> +
> + /* Recheck if page got reactivated */
> + if (folio_test_active(folio) ||
> +    (folio_mapped(folio) && folio_test_young(folio)))
> + goto skip_pageout_locked;
> +
> + mapping = folio_mapping(folio);
> + pageout_t pageout_res = pageout(folio, mapping, &plug);
> + switch (pageout_res) {
> + case PAGE_KEEP:
> + goto skip_pageout_locked;
> + case PAGE_ACTIVATE:
> + goto skip_pageout_locked;
> + case PAGE_SUCCESS:
> + stat->nr_pageout += folio_nr_pages(folio);
> +
> + if (folio_test_writeback(folio) ||
> +    folio_test_dirty(folio))
> + goto skip_pageout;
> +
> + /*
> + * A synchronous write - probably a ramdisk.  Go
> + * ahead and try to reclaim the folio.
> + */
> + if (!folio_trylock(folio))
> + goto skip_pageout;
> + if (folio_test_dirty(folio) ||
> +    folio_test_writeback(folio))
> + goto skip_pageout_locked;
> +
> + // Try to free the page
> + if (!mapping ||
> +    !__remove_mapping(mapping, folio, true,
> +      sc->target_mem_cgroup))
> + goto skip_pageout_locked;
> +
> + nr_reclaimed += folio_nr_pages(folio);
> + folio_unlock(folio);
> + continue;
> +
> + case PAGE_CLEAN:
> + if (!mapping ||
> +    !__remove_mapping(mapping, folio, true,
> +      sc->target_mem_cgroup))
> + goto skip_pageout_locked;
> +
> + nr_reclaimed += folio_nr_pages(folio);
> + folio_unlock(folio);
> + continue;
> + }
> +
> +skip_pageout_locked:
> + folio_unlock(folio);
> +skip_pageout:
> + list_add(&folio->lru, &ret_folios);
> + }
> + }
> +
>   /* Migrate folios selected for demotion */
>   nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
>   /* Folios that could not be demoted are still in @demote_folios */

Vinay Banakar Jan. 21, 2025, 6:03 p.m. UTC | #3

On Mon, Jan 20, 2025 at 7:44 PM Byungchul Park <byungchul@sk.com> wrote:
> The *interesting* IPIs will be reduced by 1/512 at most.  Can we see the
improvement number?

Yes, we reduce IPIs by a factor of 512 by sending one IPI (for TLB
flush) per PMD rather than per page. Since shrink_folio_list()
operates on one PMD at a time, I believe we can safely batch these
operations here.

Here's a concrete example:
When swapping out 20 GiB (5.2M pages):
- Current: Each page triggers an IPI to all cores
  - With 6 cores: 31.4M total interrupts (6 cores × 5.2M pages)
- With patch: One IPI per PMD (512 pages)
  - Only 10.2K IPIs required (5.2M/512)
  - With 6 cores: 61.4K total interrupts
  - Results in ~99% reduction in total interrupts

Application performance impact varies by workload, but here's a
representative test case:
- Thread 1: Continuously accesses a 2 GiB private anonymous map (64B
chunks at random offsets)
- Thread 2: Pinned to different core, uses MADV_PAGEOUT on 20 GiB
private anonymous map to swap it out to SSD
- The threads only access their respective maps.
Results:
  - Without patch: Thread 1 sees ~53% throughput reduction during
swap. If there are multiple worker threads (like thread 1), the
cumulative throughput degradation will be much higher
  - With patch: Thread 1 maintains normal throughput

I expect a similar application performance impact when memory reclaim
is triggered by kswapd.

Bharata B Rao Jan. 22, 2025, 8:59 a.m. UTC | #4

On 21-Jan-25 5:35 AM, Vinay Banakar wrote:
> Sorry, the previous patch was unreadable due to damaged whitespace.
> Here is the same patch with fixed indentation.
> 
> Signed-off-by: Vinay Banakar <vny@google.com>
> ---
>   mm/vmscan.c | 107
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
>   1 file changed, 74 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bd489c1af..1bd510622 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>          struct folio_batch free_folios;
>          LIST_HEAD(ret_folios);
>          LIST_HEAD(demote_folios);
> +       LIST_HEAD(pageout_list);
>          unsigned int nr_reclaimed = 0;
>          unsigned int pgactivate = 0;
>          bool do_demote_pass;
> @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                          if (!sc->may_writepage)
>                                  goto keep_locked;
> 
> -                       /*
> -                        * Folio is dirty. Flush the TLB if a writable entry
> -                        * potentially exists to avoid CPU writes after I/O
> -                        * starts and then write it out here.
> -                        */
> -                       try_to_unmap_flush_dirty();
> -                       switch (pageout(folio, mapping, &plug)) {
> -                       case PAGE_KEEP:
> -                               goto keep_locked;
> -                       case PAGE_ACTIVATE:
> -                               goto activate_locked;
> -                       case PAGE_SUCCESS:
> -                               stat->nr_pageout += nr_pages;
> -
> -                               if (folio_test_writeback(folio))
> -                                       goto keep;
> -                               if (folio_test_dirty(folio))
> -                                       goto keep;
> -
> -                               /*
> -                                * A synchronous write - probably a ramdisk.  Go
> -                                * ahead and try to reclaim the folio.
> -                                */
> -                               if (!folio_trylock(folio))
> -                                       goto keep;
> -                               if (folio_test_dirty(folio) ||
> -                                   folio_test_writeback(folio))
> -                                       goto keep_locked;
> -                               mapping = folio_mapping(folio);
> -                               fallthrough;
> -                       case PAGE_CLEAN:
> -                               ; /* try to free the folio below */
> -                       }
> +                       /* Add to pageout list for defered bio submissions */
> +                       list_add(&folio->lru, &pageout_list);
> +                       continue;

The dirty pages are collected in a list here...

>          }
> 
>          /*
> @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>          }
>          /* 'folio_list' is always empty here */
> 
> +       if (!list_empty(&pageout_list)) {
> +               /*
> +               * Batch TLB flushes by flushing once before processing
> all dirty pages.
> +               * Since we operate on one PMD at a time, this batches
> TLB flushes at
> +               * PMD granularity rather than per-page, reducing IPIs.
> +               */
> +               struct address_space *mapping;
> +               try_to_unmap_flush_dirty();

and one flush request is issued for the entire list. Where is the PMD 
level (512) batching done? Is that implicit elsewhere in the flow?

> +
> +               while (!list_empty(&pageout_list)) {
> +                       struct folio *folio = lru_to_folio(&pageout_list);
> +                       list_del(&folio->lru);
> +
> +                       /* Recheck if page got reactivated */
> +                       if (folio_test_active(folio) ||
> +                           (folio_mapped(folio) && folio_test_young(folio)))
> +                               goto skip_pageout_locked;
> +
> +                       mapping = folio_mapping(folio);
> +                       pageout_t pageout_res = pageout(folio, mapping, &plug);
> +                       switch (pageout_res) {
> +                       case PAGE_KEEP:
> +                               goto skip_pageout_locked;
> +                       case PAGE_ACTIVATE:
> +                               goto skip_pageout_locked;
> +                       case PAGE_SUCCESS:
> +                               stat->nr_pageout += folio_nr_pages(folio);
> +
> +                               if (folio_test_writeback(folio) ||
> +                                   folio_test_dirty(folio))
> +                                       goto skip_pageout;
> +
> +                               /*
> +                                * A synchronous write - probably a ramdisk.  Go
> +                                * ahead and try to reclaim the folio.
> +                                */
> +                               if (!folio_trylock(folio))
> +                                       goto skip_pageout;
> +                               if (folio_test_dirty(folio) ||
> +                                   folio_test_writeback(folio))
> +                                       goto skip_pageout_locked;
> +
> +                               // Try to free the page
> +                               if (!mapping ||
> +                                   !__remove_mapping(mapping, folio, true,
> +                                                   sc->target_mem_cgroup))
> +                                       goto skip_pageout_locked;
> +
> +                               nr_reclaimed += folio_nr_pages(folio);
> +                               folio_unlock(folio);
> +                               continue;
> +
> +                       case PAGE_CLEAN:
> +                               if (!mapping ||
> +                                   !__remove_mapping(mapping, folio, true,
> +                                                   sc->target_mem_cgroup))
> +                                       goto skip_pageout_locked;
> +
> +                               nr_reclaimed += folio_nr_pages(folio);
> +                               folio_unlock(folio);
> +                               continue;
> +                       }
> +
> +skip_pageout_locked:
> +                       folio_unlock(folio);
> +skip_pageout:
> +                       list_add(&folio->lru, &ret_folios);
> +               }
> +       }
> +
>          /* Migrate folios selected for demotion */
>          nr_reclaimed += demote_folio_list(&demote_folios, pgdat);
>          /* Folios that could not be demoted are still in @demote_folios */

Regards,
Bharata.

Vinay Banakar Jan. 22, 2025, 11:09 a.m. UTC | #5

On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote:
> and one flush request is issued for the entire list. Where is the PMD
> level (512) batching done? Is that implicit elsewhere in the flow?

Yes, shrink_folio_list() operates on one PMD at a time, so
pageout_list will contain at most 512 pages during each call. Please
refer to madvise_cold_or_pageout_pte_range() in mm/madvise.c for more
details.

Thanks!
Vinay

Bharata B Rao Jan. 22, 2025, 11:31 a.m. UTC | #6

On 22-Jan-25 4:39 PM, Vinay Banakar wrote:
> On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote:
>> and one flush request is issued for the entire list. Where is the PMD
>> level (512) batching done? Is that implicit elsewhere in the flow?
> 
> Yes, shrink_folio_list() operates on one PMD at a time, so
> pageout_list will contain at most 512 pages during each call. Please
> refer to madvise_cold_or_pageout_pte_range() in mm/madvise.c for more
> details.

While that may be true for MADV_PAGEOUT path, does the same assumption 
hold good for other paths from which shrink_folio_list() gets called?

Regards,
Bharata.

Vinay Banakar Jan. 22, 2025, 1:28 p.m. UTC | #7

On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote:
> While that may be true for MADV_PAGEOUT path, does the same assumption
> hold good for other paths from which shrink_folio_list() gets called?

shrink_folio_list() is called by three other functions, each with
different batching behavior:
- reclaim_clean_pages_from_list(): Doesn't do PMD batching but only
processes clean pages, so it won't take the path affected by this
patch. This is called from the contiguous memory allocator
(cma_alloc#alloc_contig_range)
- shrink_inactive_list(): Reclaims inactive pages at SWAP_CLUSTER_MAX
(default 32) at a time. With this patch, we will reduce IPIs for TLB
flushes by a factor of 32 in kswapd
- evict_folios(): In the MGLRU case, the number of pages
shrink_folio_list() processes can vary between 64 (MIN_LRU_BATCH) and
4096 (MAX_LRU_BATCH). The reduction in IPIs will vary accordingly

Thanks!
Vinay

SeongJae Park Jan. 22, 2025, 8:05 p.m. UTC | #8

On Wed, 22 Jan 2025 07:28:56 -0600 Vinay Banakar <vny@google.com> wrote:

> On Wed, Jan 22, 2025 at 2:59 AM Bharata B Rao <bharata@amd.com> wrote:
> > While that may be true for MADV_PAGEOUT path, does the same assumption
> > hold good for other paths from which shrink_folio_list() gets called?
> 
> shrink_folio_list() is called by three other functions, each with
> different batching behavior:
> - reclaim_clean_pages_from_list(): Doesn't do PMD batching but only
> processes clean pages, so it won't take the path affected by this
> patch. This is called from the contiguous memory allocator
> (cma_alloc#alloc_contig_range)
> - shrink_inactive_list(): Reclaims inactive pages at SWAP_CLUSTER_MAX
> (default 32) at a time. With this patch, we will reduce IPIs for TLB
> flushes by a factor of 32 in kswapd
> - evict_folios(): In the MGLRU case, the number of pages
> shrink_folio_list() processes can vary between 64 (MIN_LRU_BATCH) and
> 4096 (MAX_LRU_BATCH). The reduction in IPIs will vary accordingly

damon_pa_pageout() from mm/damon/paddr.c also calls shrink_folio_list() similar
to madvise.c, but it doesn't aware such batching behavior.  Have you checked
that path?


Thanks,
SJ

> 
> Thanks!
> Vinay

Rik van Riel Jan. 23, 2025, 4:17 a.m. UTC | #9

On Mon, 2025-01-20 at 16:47 -0600, Vinay Banakar wrote:
> 
> This patch instead optimizes the process by batching operations,
> issuing one IPI per PMD instead of per page. This reduces interrupts
> by a factor of 512 and enables batching page submissions to BIO. The
> new approach:
> 1. Collect dirty pages that need to be written back
> 2. Issue a single TLB flush for all dirty pages in the batch
> 3. Process the collected pages for writebacks (submit to BIO)
> 
I see how moving the arch_tlbbatch_flush to
between unmapping the pages, and issuing the
IO could reduce TLB flushing operations
significantly.

However, how does that lead to PMD level
operations?

I don't see any code in your patch that gathers
things at the PMD level. The code simply operates
on whatever folio size is on the list that is
being passed to shrink_folio_list()

Is the PMD level thing some MGLRU specific feature?

> 
> I'd appreciate your feedback on this approach, especially on the
> correctness of batched BIO submissions. Looking forward to your
> comments.
> 
Maybe the filesystem people have more comments on
this aspect, but from a VM perspective I suspect
that doing that batch flush in one spot, and then
iterating over the pages should be fine.

My main quibble is with the changelog, and the
comment that refers to "PMD level" operations,
when the code does not appear to be doing any
PMD level coalescing.

Vinay Banakar Jan. 23, 2025, 5:11 p.m. UTC | #10

On Wed, Jan 22, 2025 at 2:05 PM SeongJae Park <sj@kernel.org> wrote:
> damon_pa_pageout() from mm/damon/paddr.c also calls shrink_folio_list() similar
> to madvise.c, but it doesn't aware such batching behavior.  Have you checked
> that path?

Thanks for catching this path. In damon_pa_pageout(),
shrink_folio_list() processes all pages from a single NUMA node that
were collected (filtered) from a single DAMON region (r->ar.start to
r->ar.end). This means it could be processing anywhere from 1 page up
to ULONG_MAX pages from a single node at once. With the patch, we'll
send a single IPI for TLB flush for the entire region, reducing IPIs
by a factor equal to the number of pages being reclaimed by DAMON at
once (decided by damon_reclaim_quota).

My only concern here would be the overhead of maintaining the
temporary pageout_list for batching. However, during BIO submission,
the patch checks if the folio was reactivated, so submitting to BIO in
bulk should be safe.

Another option would be to modify shrink_folio_list() to force batch
flushes for up to N pages (512) at a time, rather than relying on
callers to do the batching via folio_list.

Thanks!
Vinay

SeongJae Park Jan. 23, 2025, 5:23 p.m. UTC | #11

On Thu, 23 Jan 2025 11:11:13 -0600 Vinay Banakar <vny@google.com> wrote:

> On Wed, Jan 22, 2025 at 2:05 PM SeongJae Park <sj@kernel.org> wrote:
> > damon_pa_pageout() from mm/damon/paddr.c also calls shrink_folio_list() similar
> > to madvise.c, but it doesn't aware such batching behavior.  Have you checked
> > that path?
> 
> Thanks for catching this path. In damon_pa_pageout(),
> shrink_folio_list() processes all pages from a single NUMA node that
> were collected (filtered) from a single DAMON region (r->ar.start to
> r->ar.end). This means it could be processing anywhere from 1 page up
> to ULONG_MAX pages from a single node at once.

Thank you Vinay.  That's same to my understanding, except that it is not
limited to a single NUMA node.  A region can have any start and end physical
addresses, so it could cover memory of different NUMA nodes.

> With the patch, we'll
> send a single IPI for TLB flush for the entire region, reducing IPIs
> by a factor equal to the number of pages being reclaimed by DAMON at
> once (decided by damon_reclaim_quota).

I guess the fact that the pages could belong to differnt NUMA nodes doesn't
make difference here?

> 
> My only concern here would be the overhead of maintaining the
> temporary pageout_list for batching. However, during BIO submission,
> the patch checks if the folio was reactivated, so submitting to BIO in
> bulk should be safe.
> 
> Another option would be to modify shrink_folio_list() to force batch
> flushes for up to N pages (512) at a time, rather than relying on
> callers to do the batching via folio_list.

Both sounds good to me :)


Thanks,
SJ

> 
> Thanks!
> Vinay

Matthew Wilcox Jan. 23, 2025, 6:17 p.m. UTC | #12

On Thu, Jan 23, 2025 at 11:11:13AM -0600, Vinay Banakar wrote:
> Another option would be to modify shrink_folio_list() to force batch
> flushes for up to N pages (512) at a time, rather than relying on
> callers to do the batching via folio_list.

If you really want to improve performance, consider converting
shrink_folio_list() to take a folio_batch.  I did this for
page freeing:
https://lore.kernel.org/all/20240227174254.710559-1-willy@infradead.org/
and we did in fact see a regression (due to shrinking the batch size
from 32 to 15).  We then increased the batch size from 15 to 31 and
saw a 12.5% performance improvement on the page_fault2 benchmark
over the original batch size of 32.  Commit 9cecde80aae0 and
https://lore.kernel.org/oe-lkp/202403151058.7048f6a8-oliver.sang@intel.com/
have some more details.

I think you should be able to see a noticable improvement by going from
a list with 512 entries on it to a batch of 31 entries.

Vinay Banakar Jan. 23, 2025, 7:16 p.m. UTC | #13

On Wed, Jan 22, 2025 at 10:20 PM Rik van Riel <riel@surriel.com> wrote:
> I see how moving the arch_tlbbatch_flush to
> between unmapping the pages, and issuing the
> IO could reduce TLB flushing operations
> significantly.
>
> However, how does that lead to PMD level
> operations?
>
> I don't see any code in your patch that gathers
> things at the PMD level. The code simply operates
> on whatever folio size is on the list that is
> being passed to shrink_folio_list()
>
> Is the PMD level thing some MGLRU specific feature?
> My main quibble is with the changelog, and the
> comment that refers to "PMD level" operations,
> when the code does not appear to be doing any
> PMD level coalescing.

I initially assumed that all paths leading to shrink_folio_list()
submitted 512 pages at a time. Turns out this is only true for the
madvise path.
The other paths have different batch sizes:
- shrink_inactive_list(): Uses SWAP_CLUSTER_MAX (32 pages)
- evict_folios(): Varies between 64 and 4096 pages (MGLRU)
- damon_pa_pageout(): Variable size based on DAMON region
- reclaim_clean_pages_from_list(): Clean pages only, unaffected by this patch

We have two options:
1. Keep the current logic where TLB flush batching varies by caller
2. Enforce consistent 512-page batching in shrink_folio_list() and
also convert to folio_batch as suggested by Matthew

> > I'd appreciate your feedback on this approach, especially on the
> > correctness of batched BIO submissions. Looking forward to your
> > comments.
> Maybe the filesystem people have more comments on
> this aspect, but from a VM perspective I suspect
> that doing that batch flush in one spot, and then
> iterating over the pages should be fine.

Thank you for commenting on this!

Vinay

Rik van Riel Jan. 28, 2025, 10:01 p.m. UTC | #14

On Thu, 2025-01-23 at 13:16 -0600, Vinay Banakar wrote:
> 
> I initially assumed that all paths leading to shrink_folio_list()
> submitted 512 pages at a time. Turns out this is only true for the
> madvise path.
> The other paths have different batch sizes:
> - shrink_inactive_list(): Uses SWAP_CLUSTER_MAX (32 pages)
> - evict_folios(): Varies between 64 and 4096 pages (MGLRU)
> - damon_pa_pageout(): Variable size based on DAMON region
> - reclaim_clean_pages_from_list(): Clean pages only, unaffected by
> this patch
> 
> We have two options:
> 1. Keep the current logic where TLB flush batching varies by caller
> 2. Enforce consistent 512-page batching in shrink_folio_list() and
> also convert to folio_batch as suggested by Matthew

My preference would be to fix one thing at a time,
and simply remove the comments that suggest everything
is done at PMD level.

I'll ACK the patch if the misleading comments and
changelog text are removed or fixed up :)

mm: Optimize TLB flushes during page reclaim

Commit Message

Comments

Patch