mbox series

[v20,00/20] per memcg lru lock

Message ID 1603968305-8026-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
Headers show
Series per memcg lru lock | expand

Message

Alex Shi Oct. 29, 2020, 10:44 a.m. UTC
This version just a rebase on v5.10-rc1. and moves the lru_lock position
down after lists[] in lruvec, which resolves a fio.read regression that
revealed by Rong Chen -- Intel LKP.

Many thanks for line by line review by Hugh Dickins and Alexander Duyck.

So now this patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation. 
2, use TestCleanPageLRU as page isolation's precondition.
3, replace per node lru_lock with per memcg per node lru_lock.

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
with this v20.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

Alex Shi (17):
  mm/memcg: warning on !memcg after readahead page charged
  mm/memcg: bail early from swap accounting if memcg disabled
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: use head for head page in lru_add_page_tail
  mm/thp: Simplify lru_add_page_tail()
  mm/thp: narrow lru locking
  mm/vmscan: remove unnecessary lruvec adding
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lock into lru_note_cost
  mm/vmscan: remove lruvec reget in move_pages_to_lru
  mm/mlock: remove lru_lock on TestClearPageMlocked
  mm/mlock: remove __munlock_isolate_lru_page
  mm/lru: introduce TestClearPageLRU
  mm/compaction: do page isolation first in compaction
  mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock

Alexander Duyck (1):
  mm/lru: introduce the relock_page_lruvec function

Hugh Dickins (2):
  mm: page_idle_get_page() does not need lru_lock
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +--
 include/linux/memcontrol.h                         | 110 +++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmdebug.h                            |  13 ++
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 mm/compaction.c                                    |  94 +++++++---
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  45 +++--
 mm/memcontrol.c                                    |  85 ++++++++-
 mm/mlock.c                                         |  63 ++-----
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   4 -
 mm/rmap.c                                          |   4 +-
 mm/swap.c                                          | 199 ++++++++------------
 mm/vmscan.c                                        | 203 +++++++++++----------
 mm/workingset.c                                    |   2 -
 22 files changed, 523 insertions(+), 378 deletions(-)

Comments

Alex Shi Nov. 4, 2020, 11:55 a.m. UTC | #1
Hi Johannes & all,

Thanks for all comments and suggestions, here is a patch base on v20, as a summary for all you suggested:
Is this ok?

Many thanks!
Alex

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c97292834fa..0fe4172c8c14 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
  * Lockless page tracking & accounting
  * Unified hierarchy configuration model
  * Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */

 #include <linux/page_counter.h>
@@ -1380,6 +1383,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
        return lruvec;
 }

+/**
+ * lock_page_lruvec - return lruvec for the locked page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or, page->_refcount is zero
+ */
 struct lruvec *lock_page_lruvec(struct page *page)
 {
        struct lruvec *lruvec;
diff --git a/mm/swap.c b/mm/swap.c
index 9fe5ff9a8111..bcc814de35c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -264,6 +264,13 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
        do {
                unsigned long lrusize;

+               /*
+                * Hold lruvec->lru_lock is safe here, since
+                * 1) The pinned lruvec in reclaim, or
+                * 2) From a pre-LRU page during refault (which also holds the
+                *    rcu lock, so would be safe even if the page was on the LRU
+                *    and could move simultaneously to a new lruvec).
+                */
                spin_lock_irq(&lruvec->lru_lock);
                /* Record cost event */
                if (file)
@@ -355,10 +362,12 @@ static void activate_page(struct page *page)
        struct lruvec *lruvec;

        page = compound_head(page);
-       lruvec = lock_page_lruvec_irq(page);
-       if (PageLRU(page))
+       if (TestClearPageLRU(page)) {
+               lruvec = lock_page_lruvec_irq(page);
                __activate_page(page, lruvec);
-       unlock_page_lruvec_irq(lruvec);
+               unlock_page_lruvec_irq(lruvec);
+               SetPageLRU(page);
+       }
 }
 #endif

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7ed10ade548d..af03a7f2e1b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1868,6 +1868,10 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
                        continue;
                }

+               /*
+                * All pages were isolated from the same lruvec (and isolation
+                * inhibits memcg migration).
+                */
                VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
                lru = page_lru(page);
                nr_pages = thp_nr_pages(page);
Johannes Weiner Nov. 4, 2020, 4:59 p.m. UTC | #2
On Wed, Nov 04, 2020 at 07:55:39PM +0800, Alex Shi wrote:
> @@ -1380,6 +1383,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>         return lruvec;
>  }
> 
> +/**
> + * lock_page_lruvec - return lruvec for the locked page.

I would say "lock and return the lruvec for a given page"

> + * @page: the page
> + *
> + * This series functions should be used in either conditions:
> + * PageLRU is cleared or unset
> + * or, page->_refcount is zero

or page is locked

The other changes look good to me, thanks!
Alex Shi Nov. 5, 2020, 5:07 a.m. UTC | #3
在 2020/11/5 上午12:59, Johannes Weiner 写道:
> On Wed, Nov 04, 2020 at 07:55:39PM +0800, Alex Shi wrote:
>> @@ -1380,6 +1383,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>>         return lruvec;
>>  }
>>
>> +/**
>> + * lock_page_lruvec - return lruvec for the locked page.
> 
> I would say "lock and return the lruvec for a given page"
> 
>> + * @page: the page
>> + *
>> + * This series functions should be used in either conditions:
>> + * PageLRU is cleared or unset
>> + * or, page->_refcount is zero
> 
> or page is locked
> 
> The other changes look good to me, thanks!
> 

Thanks a lot for both comments!
I will pick them and sent out in v21.

Thanks!
Alex