Message ID | 20210519013313.1274454-1-ying.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: move idle swap cache pages to the tail of LRU after COW | expand |
On Wed, 2021-05-19 at 09:33 +0800, Huang Ying wrote: > To test the patch, we used pmbench memory accessing benchmark with > working-set larger than available memory on a 2-socket Intel server > with a NVMe SSD as swap device. Test results shows that the pmbench > score increases up to 21.8% with the decreased size of swap cache and > swapin throughput. Nice! > +++ b/mm/memory.c > @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault > *vmf) > munlock_vma_page(old_page); > unlock_page(old_page); > } > + if (page_copied && PageSwapCache(old_page) && > + !page_mapped(old_page) && trylock_page(old_page)) { > + try_to_free_idle_swapcache(old_page); > + unlock_page(old_page); > + } That's quite the if condition! Would it make sense to move some of the tests, as well as the trylock and unlock into try_to_free_idle_swapcache() itself? Especially considering that page_mapped is already tested in that function, too... > put_page(old_page); > }
On Tue, May 18, 2021 at 3:33 PM Huang Ying <ying.huang@intel.com> wrote: > > With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after > COW, the idle swap cache (neither the page nor the corresponding swap > entry is mapped by any process) will be left at the original position > in the LRU list. While it may be in the active list or the head of > the inactive list, so that vmscan may take more overhead or time to > reclaim these actually unused pages. This looks sensible to me (and numbers talk!), but as Rik says, it would probably be a good idea to move the trylock_page()/unlock_page() into try_to_free_idle_swapcache(), and that would make the calling side a whole lot cleaner and easier to read. Linus
On Tue, May 18, 2021 at 5:17 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > This looks sensible to me (and numbers talk!), but as Rik says, it > would probably be a good idea to move the trylock_page()/unlock_page() > into try_to_free_idle_swapcache(), and that would make the calling > side a whole lot cleaner and easier to read. To keep the error handling simple, and keep that "if that didn't work, just return" logic in you had, doing it as two functions like: static inline void locked_try_to_free_idle_swapcache(struct page *page) { .. your current try_to_free_idle_swapcache() .. } void try_to_free_idle_swapcache(struct page *page) { if (trylock_page(page)) { locked_try_to_free_idle_swapcache(page); unlock_page(page); } } would keep that readability and simplicity. And then the wp_page_copy() code ends up being if (page_copied && PageSwapCache(old_page) && !page_mapped(old_page)) try_to_free_idle_swapcache(old_page); which looks pretty sensible to me: if we copied the page, and the old page is a no longer mapped swap cache page, let's try to free it. That's still a hell of a long conditional, partly because of those long names. But at least it's conceptually fairly straightforward and easy to understand what's going on. No? Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, May 18, 2021 at 5:17 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> This looks sensible to me (and numbers talk!), but as Rik says, it >> would probably be a good idea to move the trylock_page()/unlock_page() >> into try_to_free_idle_swapcache(), and that would make the calling >> side a whole lot cleaner and easier to read. > > To keep the error handling simple, and keep that "if that didn't work, > just return" logic in you had, doing it as two functions like: > > static inline void locked_try_to_free_idle_swapcache(struct page *page) > { .. your current try_to_free_idle_swapcache() .. } > > void try_to_free_idle_swapcache(struct page *page) > { > if (trylock_page(page)) { > locked_try_to_free_idle_swapcache(page); > unlock_page(page); > } > } > > would keep that readability and simplicity. > > And then the wp_page_copy() code ends up being > > if (page_copied && PageSwapCache(old_page) && !page_mapped(old_page)) > try_to_free_idle_swapcache(old_page); > > which looks pretty sensible to me: if we copied the page, and the old > page is a no longer mapped swap cache page, let's try to free it. > > That's still a hell of a long conditional, partly because of those > long names. But at least it's conceptually fairly straightforward and > easy to understand what's going on. Thanks! That looks much better. I will do that in the next version. Best Regards, Huang, Ying
Rik van Riel <riel@surriel.com> writes: > On Wed, 2021-05-19 at 09:33 +0800, Huang Ying wrote: > >> To test the patch, we used pmbench memory accessing benchmark with >> working-set larger than available memory on a 2-socket Intel server >> with a NVMe SSD as swap device. Test results shows that the pmbench >> score increases up to 21.8% with the decreased size of swap cache and >> swapin throughput. > > Nice! > >> +++ b/mm/memory.c >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault >> *vmf) >> munlock_vma_page(old_page); >> unlock_page(old_page); >> } >> + if (page_copied && PageSwapCache(old_page) && >> + !page_mapped(old_page) && trylock_page(old_page)) { >> + try_to_free_idle_swapcache(old_page); >> + unlock_page(old_page); >> + } > > That's quite the if condition! > > Would it make sense to move some of the tests, as well > as the trylock and unlock into try_to_free_idle_swapcache() > itself? Sure. Will put trylock/unlock into try_to_free_idle_swapcache() as suggested by Linus. > Especially considering that page_mapped is already tested > in that function, too... The two page_mapped() tests are intended. The first one is a quick check with the page unlocked, the second one is to confirm with the page locked. Because if the page is unlocked, the swap count may be transited to map count or vice versa. Best Regards, Huang, Ying
On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote: > diff --git a/mm/memory.c b/mm/memory.c > index b83f734c4e1d..2b6847f4c03e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > munlock_vma_page(old_page); > unlock_page(old_page); > } > + if (page_copied && PageSwapCache(old_page) && > + !page_mapped(old_page) && trylock_page(old_page)) { > + try_to_free_idle_swapcache(old_page); > + unlock_page(old_page); If there are no more swap or pte references, can we just attempt to free the page right away, like we do during regular unmap? if (page_copied) free_swap_cache(old_page); put_page(old_page);
Johannes Weiner <hannes@cmpxchg.org> writes: > On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote: >> diff --git a/mm/memory.c b/mm/memory.c >> index b83f734c4e1d..2b6847f4c03e 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >> munlock_vma_page(old_page); >> unlock_page(old_page); >> } >> + if (page_copied && PageSwapCache(old_page) && >> + !page_mapped(old_page) && trylock_page(old_page)) { >> + try_to_free_idle_swapcache(old_page); >> + unlock_page(old_page); > > If there are no more swap or pte references, can we just attempt to > free the page right away, like we do during regular unmap? > > if (page_copied) > free_swap_cache(old_page); > put_page(old_page); A previous version of the patch does roughly this. https://lore.kernel.org/lkml/20210113024241.179113-1-ying.huang@intel.com/ But Linus has concerns with the overhead introduced in the hot COW path. Another possibility is to move the idle swap cache page to the tail of the file LRU list. But the question is how to identify the page. Best Regards, Huang, Ying
On Thu, May 20, 2021 at 09:22:45AM +0800, Huang, Ying wrote: > Johannes Weiner <hannes@cmpxchg.org> writes: > > > On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote: > >> diff --git a/mm/memory.c b/mm/memory.c > >> index b83f734c4e1d..2b6847f4c03e 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > >> munlock_vma_page(old_page); > >> unlock_page(old_page); > >> } > >> + if (page_copied && PageSwapCache(old_page) && > >> + !page_mapped(old_page) && trylock_page(old_page)) { > >> + try_to_free_idle_swapcache(old_page); > >> + unlock_page(old_page); > > > > If there are no more swap or pte references, can we just attempt to > > free the page right away, like we do during regular unmap? > > > > if (page_copied) > > free_swap_cache(old_page); > > put_page(old_page); > > A previous version of the patch does roughly this. > > https://lore.kernel.org/lkml/20210113024241.179113-1-ying.huang@intel.com/ > > But Linus has concerns with the overhead introduced in the hot COW path. Sorry, I had missed that thread. It sounds like there were the same concerns about the LRU shuffling overhead in the COW page. Now we have numbers for that, but not the free_swap_cache version. Would you be able to run the numbers for that as well? It would be interesting to see how much the additional code complexity buys us. > Another possibility is to move the idle swap cache page to the tail of > the file LRU list. But the question is how to identify the page. The LRU type is identified by PG_swapbacked, and we do clear that for anon pages to implement MADV_FREE. It may work here too. But I'm honestly a bit skeptical about the ROI on this...
Johannes Weiner <hannes@cmpxchg.org> writes: > On Thu, May 20, 2021 at 09:22:45AM +0800, Huang, Ying wrote: >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> > On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote: >> >> diff --git a/mm/memory.c b/mm/memory.c >> >> index b83f734c4e1d..2b6847f4c03e 100644 >> >> --- a/mm/memory.c >> >> +++ b/mm/memory.c >> >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >> >> munlock_vma_page(old_page); >> >> unlock_page(old_page); >> >> } >> >> + if (page_copied && PageSwapCache(old_page) && >> >> + !page_mapped(old_page) && trylock_page(old_page)) { >> >> + try_to_free_idle_swapcache(old_page); >> >> + unlock_page(old_page); >> > >> > If there are no more swap or pte references, can we just attempt to >> > free the page right away, like we do during regular unmap? >> > >> > if (page_copied) >> > free_swap_cache(old_page); >> > put_page(old_page); >> >> A previous version of the patch does roughly this. >> >> https://lore.kernel.org/lkml/20210113024241.179113-1-ying.huang@intel.com/ >> >> But Linus has concerns with the overhead introduced in the hot COW path. > > Sorry, I had missed that thread. > > It sounds like there were the same concerns about the LRU shuffling > overhead in the COW page. Now we have numbers for that, but not the > free_swap_cache version. Would you be able to run the numbers for that > as well? It would be interesting to see how much the additional code > complexity buys us. The number for which workload? The workload that is used to evaluate this patch? >> Another possibility is to move the idle swap cache page to the tail of >> the file LRU list. But the question is how to identify the page. > > The LRU type is identified by PG_swapbacked, and we do clear that for > anon pages to implement MADV_FREE. It may work here too. But I'm > honestly a bit skeptical about the ROI on this... The definition of PageSwapCache() is static __always_inline int PageSwapCache(struct page *page) { #ifdef CONFIG_THP_SWAP page = compound_head(page); #endif return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags); } So we cannot clear PG_swapbacked directly. Best Regards, Huang, Ying
On Thu, May 20, 2021 at 09:59:15AM +0800, Huang, Ying wrote: > Johannes Weiner <hannes@cmpxchg.org> writes: > > > On Thu, May 20, 2021 at 09:22:45AM +0800, Huang, Ying wrote: > >> Johannes Weiner <hannes@cmpxchg.org> writes: > >> > >> > On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote: > >> >> diff --git a/mm/memory.c b/mm/memory.c > >> >> index b83f734c4e1d..2b6847f4c03e 100644 > >> >> --- a/mm/memory.c > >> >> +++ b/mm/memory.c > >> >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > >> >> munlock_vma_page(old_page); > >> >> unlock_page(old_page); > >> >> } > >> >> + if (page_copied && PageSwapCache(old_page) && > >> >> + !page_mapped(old_page) && trylock_page(old_page)) { > >> >> + try_to_free_idle_swapcache(old_page); > >> >> + unlock_page(old_page); > >> > > >> > If there are no more swap or pte references, can we just attempt to > >> > free the page right away, like we do during regular unmap? > >> > > >> > if (page_copied) > >> > free_swap_cache(old_page); > >> > put_page(old_page); > >> > >> A previous version of the patch does roughly this. > >> > >> https://lore.kernel.org/lkml/20210113024241.179113-1-ying.huang@intel.com/ > >> > >> But Linus has concerns with the overhead introduced in the hot COW path. > > > > Sorry, I had missed that thread. > > > > It sounds like there were the same concerns about the LRU shuffling > > overhead in the COW page. Now we have numbers for that, but not the > > free_swap_cache version. Would you be able to run the numbers for that > > as well? It would be interesting to see how much the additional code > > complexity buys us. > > The number for which workload? The workload that is used to evaluate > this patch? Yeah, the pmbench one from the changelog.
Johannes Weiner <hannes@cmpxchg.org> writes: > On Thu, May 20, 2021 at 09:59:15AM +0800, Huang, Ying wrote: >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> > On Thu, May 20, 2021 at 09:22:45AM +0800, Huang, Ying wrote: >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> >> >> > On Wed, May 19, 2021 at 09:33:13AM +0800, Huang Ying wrote: >> >> >> diff --git a/mm/memory.c b/mm/memory.c >> >> >> index b83f734c4e1d..2b6847f4c03e 100644 >> >> >> --- a/mm/memory.c >> >> >> +++ b/mm/memory.c >> >> >> @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >> >> >> munlock_vma_page(old_page); >> >> >> unlock_page(old_page); >> >> >> } >> >> >> + if (page_copied && PageSwapCache(old_page) && >> >> >> + !page_mapped(old_page) && trylock_page(old_page)) { >> >> >> + try_to_free_idle_swapcache(old_page); >> >> >> + unlock_page(old_page); >> >> > >> >> > If there are no more swap or pte references, can we just attempt to >> >> > free the page right away, like we do during regular unmap? >> >> > >> >> > if (page_copied) >> >> > free_swap_cache(old_page); >> >> > put_page(old_page); >> >> >> >> A previous version of the patch does roughly this. >> >> >> >> https://lore.kernel.org/lkml/20210113024241.179113-1-ying.huang@intel.com/ >> >> >> >> But Linus has concerns with the overhead introduced in the hot COW path. >> > >> > Sorry, I had missed that thread. >> > >> > It sounds like there were the same concerns about the LRU shuffling >> > overhead in the COW page. Now we have numbers for that, but not the >> > free_swap_cache version. Would you be able to run the numbers for that >> > as well? It would be interesting to see how much the additional code >> > complexity buys us. >> >> The number for which workload? The workload that is used to evaluate >> this patch? > > Yeah, the pmbench one from the changelog. Sure. I have rebased the original patch that frees the idle swap cache directly and done the test. The results show that the pmbench score of freeing directly is a little better than that of moving to the tail of LRU. The pmbench score increases about 3.6%. I think this is expected, because we need to free the page finally even if we move the idle swap cache to the tail of LRU. Best Regards, Huang, Ying
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0ce97eff79e2..68956db13772 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -761,6 +761,7 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); struct lruvec *lock_page_lruvec(struct page *page); struct lruvec *lock_page_lruvec_irq(struct page *page); +struct lruvec *trylock_page_lruvec_irq(struct page *page); struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags); @@ -1251,6 +1252,15 @@ static inline struct lruvec *lock_page_lruvec_irq(struct page *page) return &pgdat->__lruvec; } +static inline struct lruvec *trylock_page_lruvec_irq(struct page *page) +{ + struct pglist_data *pgdat = page_pgdat(page); + + if (spin_trylock_irq(&pgdat->__lruvec.lru_lock)) + return &pgdat->__lruvec; + return NULL; +} + static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flagsp) { diff --git a/include/linux/swap.h b/include/linux/swap.h index 46d51d058d05..d344b0fa7925 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -504,6 +504,7 @@ extern struct swap_info_struct *page_swap_info(struct page *); extern struct swap_info_struct *swp_swap_info(swp_entry_t entry); extern bool reuse_swap_page(struct page *, int *); extern int try_to_free_swap(struct page *); +extern void try_to_free_idle_swapcache(struct page *page); struct backing_dev_info; extern int init_swap_address_space(unsigned int type, unsigned long nr_pages); extern void exit_swap_address_space(unsigned int type); @@ -668,6 +669,8 @@ static inline int try_to_free_swap(struct page *page) return 0; } +static inline void try_to_free_idle_swapcache(struct page *page) {} + static inline swp_entry_t get_swap_page(struct page *page) { swp_entry_t entry; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index db29b96f7311..e3e813bfebe2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1213,6 +1213,18 @@ struct lruvec *lock_page_lruvec_irq(struct page *page) return lruvec; } +struct lruvec *trylock_page_lruvec_irq(struct page *page) +{ + struct lruvec *lruvec; + + lruvec = mem_cgroup_page_lruvec(page); + if (spin_trylock_irq(&lruvec->lru_lock)) { + lruvec_memcg_debug(lruvec, page); + return lruvec; + } + return NULL; +} + struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags) { struct lruvec *lruvec; diff --git a/mm/memory.c b/mm/memory.c index b83f734c4e1d..2b6847f4c03e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3012,6 +3012,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) munlock_vma_page(old_page); unlock_page(old_page); } + if (page_copied && PageSwapCache(old_page) && + !page_mapped(old_page) && trylock_page(old_page)) { + try_to_free_idle_swapcache(old_page); + unlock_page(old_page); + } put_page(old_page); } return page_copied ? VM_FAULT_WRITE : 0; diff --git a/mm/swapfile.c b/mm/swapfile.c index 2aad85751991..e0dd8937de4e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -40,6 +40,7 @@ #include <linux/swap_slots.h> #include <linux/sort.h> #include <linux/completion.h> +#include <linux/mm_inline.h> #include <asm/tlbflush.h> #include <linux/swapops.h> @@ -1788,6 +1789,34 @@ int try_to_free_swap(struct page *page) return 1; } +void try_to_free_idle_swapcache(struct page *page) +{ + struct lruvec *lruvec; + swp_entry_t entry; + + if (!PageSwapCache(page)) + return; + if (PageWriteback(page)) + return; + if (!PageLRU(page)) + return; + if (page_mapped(page)) + return; + entry.val = page_private(page); + if (__swap_count(entry)) + return; + + lruvec = trylock_page_lruvec_irq(page); + if (!lruvec) + return; + + del_page_from_lru_list(page, lruvec); + ClearPageActive(page); + ClearPageReferenced(page); + add_page_to_lru_list_tail(page, lruvec); + + unlock_page_lruvec_irq(lruvec); +} /* * Free the swap entry like above, but also try to * free the page cache entry if it is the last user.
With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after COW, the idle swap cache (neither the page nor the corresponding swap entry is mapped by any process) will be left at the original position in the LRU list. While it may be in the active list or the head of the inactive list, so that vmscan may take more overhead or time to reclaim these actually unused pages. To help the page reclaiming, in this patch, after COW, the idle swap cache will be tried to be moved to the tail of the inactive LRU list. To avoid to introduce much overhead to the hot COW code path, all locks are acquired with try locking. To test the patch, we used pmbench memory accessing benchmark with working-set larger than available memory on a 2-socket Intel server with a NVMe SSD as swap device. Test results shows that the pmbench score increases up to 21.8% with the decreased size of swap cache and swapin throughput. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> --- include/linux/memcontrol.h | 10 ++++++++++ include/linux/swap.h | 3 +++ mm/memcontrol.c | 12 ++++++++++++ mm/memory.c | 5 +++++ mm/swapfile.c | 29 +++++++++++++++++++++++++++++ 5 files changed, 59 insertions(+)