diff mbox series

[-V2] mm: move idle swap cache pages to the tail of LRU after COW

Message ID 20210527084953.573788-1-ying.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series [-V2] mm: move idle swap cache pages to the tail of LRU after COW | expand

Commit Message

Huang, Ying May 27, 2021, 8:49 a.m. UTC
With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after
COW, the idle swap cache (neither the page nor the corresponding swap
entry is mapped by any process) will be left at the original position
in the LRU list.  While it may be in the active list or the head of
the inactive list, so that vmscan may take more overhead or time to
reclaim these actually unused pages.

To help the page reclaiming, in this patch, after COW, the idle swap
cache will be tried to be moved to the tail of the inactive LRU list.
To avoid to introduce much overhead to the hot COW code path, all
locks are acquired with try locking.

To test the patch, we used pmbench memory accessing benchmark with
working-set larger than available memory on a 2-socket Intel server
with a NVMe SSD as swap device.  Test results shows that the pmbench
score increases up to 21.8% with the decreased size of swap cache and
swapin throughput.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>

V2:

- Move trylock_page() to try_to_free_idle_swapcache() per Rik and
  Linus' comments.
- Fix PageLRU() checking.
- Fix THP processing.
- Rename the function.
---
 include/linux/memcontrol.h | 10 ++++++++++
 include/linux/swap.h       |  3 +++
 mm/memcontrol.c            | 12 ++++++++++++
 mm/memory.c                |  2 ++
 mm/swapfile.c              | 39 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 66 insertions(+)

Comments

Johannes Weiner May 27, 2021, 4:52 p.m. UTC | #1
On Thu, May 27, 2021 at 04:49:53PM +0800, Huang Ying wrote:
> With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after
> COW, the idle swap cache (neither the page nor the corresponding swap
> entry is mapped by any process) will be left at the original position
> in the LRU list.  While it may be in the active list or the head of
> the inactive list, so that vmscan may take more overhead or time to
> reclaim these actually unused pages.
> 
> To help the page reclaiming, in this patch, after COW, the idle swap
> cache will be tried to be moved to the tail of the inactive LRU list.
> To avoid to introduce much overhead to the hot COW code path, all
> locks are acquired with try locking.
> 
> To test the patch, we used pmbench memory accessing benchmark with
> working-set larger than available memory on a 2-socket Intel server
> with a NVMe SSD as swap device.  Test results shows that the pmbench
> score increases up to 21.8% with the decreased size of swap cache and
> swapin throughput.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> 
> V2:
> 
> - Move trylock_page() to try_to_free_idle_swapcache() per Rik and
>   Linus' comments.
> - Fix PageLRU() checking.
> - Fix THP processing.
> - Rename the function.
> ---
>  include/linux/memcontrol.h | 10 ++++++++++
>  include/linux/swap.h       |  3 +++
>  mm/memcontrol.c            | 12 ++++++++++++
>  mm/memory.c                |  2 ++
>  mm/swapfile.c              | 39 ++++++++++++++++++++++++++++++++++++++
>  5 files changed, 66 insertions(+)

Sorry the discussion fizzled out on the last patch.

Let me try to recap this series: on your first submission you directly
freed the old page if we copied. Linus was worried about overhead in
the COW path that wouldn't pay off in a real workload. Before getting
numbers, it was then suggested to move the pages to the tail of the
LRU and leaving them to reclaim - which was also met with skepticism.

V2 presented the LRU moving version with pmbench numbers that indeed
show it pays off. However, much simpler direct freeing produces even
better numbers in the same benchmark. We don't have numbers showing if
the LRU shuffling would significantly fare better in other workloads.

Purely looking at the code: whether we defer or free, we need to lock
the page, take the LRU spinlock for this page, and touch the LRU
linkage. If we free, we add the swapcache deletion and the page
allocator, but it's most likely the percpu-cached fastpath. If we
defer, reclaim needs to re-establish information about the page that
we already had in the COW context, do another LRU operation, do the
swapcache deletion and go through the allocator, but on cold caches.

Personally, I'm a bit skeptical the extra code complexity and reclaim
overhead in paging workloads will definitely pay off in intermittently
paging loads (non-paging wouldn't have swap pages). As far as code
goes, the point of 09854ba94c6a (+17, -42) was simplification, and
this adds more lines back in another place. In particular it adds
another lifetime variant to swap pages which are already somewhat
unwieldy. OTOH, freeing is a two-liner reusing the swap unmap code:

	if (page_copied)
		free_swap_cache(old_page);

Linus, what do you think?
Linus Torvalds May 28, 2021, 1:27 a.m. UTC | #2
On Thu, May 27, 2021 at 6:53 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> OTOH, freeing is a two-liner reusing the swap unmap code:
>
>         if (page_copied)
>                 free_swap_cache(old_page);
>
> Linus, what do you think?

I'm ok with that version, the important thing was

 (a) avoiding the unconditional page lock we used to have (well, it
first did "trylock", but if tht failed it would then get a page ref,
and do the unconditional lock_page())

 (b) avoid the re-use based on "mapcount" that had problems with
non-mapped page references (ie GUP)

and

 (c) that I wanted to see some numbers rather than just blindly
re-introduce free_swap_cache()

But doing the above two-liner in wp_page_copy() doesn't have the
(a)/(b) issues, and if we have numbers that it helps, then that takes
care of (c) too.

Of course, I don't think it's just that two-liner, because you'd
actually have to export (or move )that "free_swap_cache()" function
that is now private to swapfile.c.

But no, I'm not adverse to the above at all, I just had the above reservations.

I was worried about non-swap behavior (which the old code had with
that whole unconditional page locking whether needed or not), but
free_swap_cache() should be basically free for the non-swap behavior
since it doesn't even do the trylock until after it has checked that
it is now an unmapped swap cache page.

              Linus
Huang, Ying May 30, 2021, 11:44 p.m. UTC | #3
Johannes Weiner <hannes@cmpxchg.org> writes:

> On Thu, May 27, 2021 at 04:49:53PM +0800, Huang Ying wrote:
>> With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after
>> COW, the idle swap cache (neither the page nor the corresponding swap
>> entry is mapped by any process) will be left at the original position
>> in the LRU list.  While it may be in the active list or the head of
>> the inactive list, so that vmscan may take more overhead or time to
>> reclaim these actually unused pages.
>> 
>> To help the page reclaiming, in this patch, after COW, the idle swap
>> cache will be tried to be moved to the tail of the inactive LRU list.
>> To avoid to introduce much overhead to the hot COW code path, all
>> locks are acquired with try locking.
>> 
>> To test the patch, we used pmbench memory accessing benchmark with
>> working-set larger than available memory on a 2-socket Intel server
>> with a NVMe SSD as swap device.  Test results shows that the pmbench
>> score increases up to 21.8% with the decreased size of swap cache and
>> swapin throughput.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Suggested-by: Matthew Wilcox <willy@infradead.org>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Tim Chen <tim.c.chen@intel.com>
>> 
>> V2:
>> 
>> - Move trylock_page() to try_to_free_idle_swapcache() per Rik and
>>   Linus' comments.
>> - Fix PageLRU() checking.
>> - Fix THP processing.
>> - Rename the function.
>> ---
>>  include/linux/memcontrol.h | 10 ++++++++++
>>  include/linux/swap.h       |  3 +++
>>  mm/memcontrol.c            | 12 ++++++++++++
>>  mm/memory.c                |  2 ++
>>  mm/swapfile.c              | 39 ++++++++++++++++++++++++++++++++++++++
>>  5 files changed, 66 insertions(+)
>
> Sorry the discussion fizzled out on the last patch.
>
> Let me try to recap this series: on your first submission you directly
> freed the old page if we copied. Linus was worried about overhead in
> the COW path that wouldn't pay off in a real workload. Before getting
> numbers, it was then suggested to move the pages to the tail of the
> LRU and leaving them to reclaim - which was also met with skepticism.
>
> V2 presented the LRU moving version with pmbench numbers that indeed
> show it pays off. However, much simpler direct freeing produces even
> better numbers in the same benchmark. We don't have numbers showing if
> the LRU shuffling would significantly fare better in other workloads.
>
> Purely looking at the code: whether we defer or free, we need to lock
> the page, take the LRU spinlock for this page, and touch the LRU
> linkage. If we free, we add the swapcache deletion and the page
> allocator, but it's most likely the percpu-cached fastpath. If we
> defer, reclaim needs to re-establish information about the page that
> we already had in the COW context, do another LRU operation, do the
> swapcache deletion and go through the allocator, but on cold caches.
>
> Personally, I'm a bit skeptical the extra code complexity and reclaim
> overhead in paging workloads will definitely pay off in intermittently
> paging loads (non-paging wouldn't have swap pages). As far as code
> goes, the point of 09854ba94c6a (+17, -42) was simplification, and
> this adds more lines back in another place. In particular it adds
> another lifetime variant to swap pages which are already somewhat
> unwieldy. OTOH, freeing is a two-liner reusing the swap unmap code:
>
> 	if (page_copied)
> 		free_swap_cache(old_page);

Yes.  This looks better than my previous version, which duplicated the
code of free_swap_cache().  Thanks for pointing this out.

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3cc18c2176e7..d6c6ff69b586 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -760,6 +760,7 @@  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
 
 struct lruvec *lock_page_lruvec(struct page *page);
 struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *trylock_page_lruvec_irq(struct page *page);
 struct lruvec *lock_page_lruvec_irqsave(struct page *page,
 						unsigned long *flags);
 
@@ -1250,6 +1251,15 @@  static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
 	return &pgdat->__lruvec;
 }
 
+static inline struct lruvec *trylock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	if (spin_trylock_irq(&pgdat->__lruvec.lru_lock))
+		return &pgdat->__lruvec;
+	return NULL;
+}
+
 static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
 		unsigned long *flagsp)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 032485ee7597..dbef99233736 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -509,6 +509,7 @@  extern struct swap_info_struct *page_swap_info(struct page *);
 extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
 extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
+extern void deactivate_idle_swapcache(struct page *page);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
@@ -677,6 +678,8 @@  static inline int try_to_free_swap(struct page *page)
 	return 0;
 }
 
+static inline void deactivate_idle_swapcache(struct page *page) {}
+
 static inline swp_entry_t get_swap_page(struct page *page)
 {
 	swp_entry_t entry;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb864f87b01d..a192476fc55c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1213,6 +1213,18 @@  struct lruvec *lock_page_lruvec_irq(struct page *page)
 	return lruvec;
 }
 
+struct lruvec *trylock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_page_lruvec(page);
+	if (spin_trylock_irq(&lruvec->lru_lock)) {
+		lruvec_memcg_debug(lruvec, page);
+		return lruvec;
+	}
+	return NULL;
+}
+
 struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
 {
 	struct lruvec *lruvec;
diff --git a/mm/memory.c b/mm/memory.c
index 2b7ffcbca175..8a4b1ccd6879 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3104,6 +3104,8 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 				munlock_vma_page(old_page);
 			unlock_page(old_page);
 		}
+		if (page_copied && PageSwapCache(old_page) && !page_mapped(old_page))
+			deactivate_idle_swapcache(old_page);
 		put_page(old_page);
 	}
 	return page_copied ? VM_FAULT_WRITE : 0;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cbb4c0795284..0b390b14c5ae 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -40,6 +40,7 @@ 
 #include <linux/swap_slots.h>
 #include <linux/sort.h>
 #include <linux/completion.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
@@ -1748,6 +1749,44 @@  int try_to_free_swap(struct page *page)
 	return 1;
 }
 
+static inline void locked_deactivate_idle_swapcache(struct page *page)
+{
+	struct lruvec *lruvec;
+
+	page = compound_head(page);
+	if (!PageSwapCache(page))
+		return;
+	if (PageWriteback(page))
+		return;
+	if (page_mapped(page))
+		return;
+	if (!PageLRU(page))
+		return;
+	if (page_swapped(page))
+		return;
+	lruvec = trylock_page_lruvec_irq(page);
+	if (!lruvec)
+		return;
+
+	if (TestClearPageLRU(page)) {
+		del_page_from_lru_list(page, lruvec);
+		ClearPageActive(page);
+		ClearPageReferenced(page);
+		add_page_to_lru_list_tail(page, lruvec);
+		SetPageLRU(page);
+	}
+
+	unlock_page_lruvec_irq(lruvec);
+}
+
+void deactivate_idle_swapcache(struct page *page)
+{
+	if (trylock_page(page)) {
+		locked_deactivate_idle_swapcache(page);
+		unlock_page(page);
+	}
+}
+
 /*
  * Free the swap entry like above, but also try to
  * free the page cache entry if it is the last user.