[v17,00/21] per memcg lru lock

Message ID	1595681998-19193-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
Headers	show Return-Path: <SRS0=ZIdU=BE=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 10627206F6 From: Alex Shi <alex.shi@linux.alibaba.com> To: akpm@linux-foundation.org, mgorman@techsingularity.net, tj@kernel.org, hughd@google.com, khlebnikov@yandex-team.ru, daniel.m.jordan@oracle.com, yang.shi@linux.alibaba.com, willy@infradead.org, hannes@cmpxchg.org, lkp@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, shakeelb@google.com, iamjoonsoo.kim@lge.com, richard.weiyang@gmail.com, kirill@shutemov.name, alexander.duyck@gmail.com, rong.a.chen@intel.com Subject: [PATCH v17 00/21] per memcg lru lock Date: Sat, 25 Jul 2020 20:59:37 +0800 Message-Id: <1595681998-19193-1-git-send-email-alex.shi@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	per memcg lru lock \| expand [v17,00/21] per memcg lru lock [v17,01/21] mm/vmscan: remove unnecessary lruvec adding [v17,02/21] mm/page_idle: no unlikely double check for idle page counting [v17,03/21] mm/compaction: correct the comments of compact_defer_shift [v17,04/21] mm/compaction: rename compact_deferred as compact_should_defer [v17,05/21] mm/thp: move lru_add_page_tail func to huge_memory.c [v17,06/21] mm/thp: clean up lru_add_page_tail [v17,07/21] mm/thp: remove code path which never got into [v17,08/21] mm/thp: narrow lru locking [v17,09/21] mm/memcg: add debug checking in lock_page_memcg [v17,10/21] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn [v17,11/21] mm/lru: move lru_lock holding in func lru_note_cost_page [v17,12/21] mm/lru: move lock into lru_note_cost [v17,13/21] mm/lru: introduce TestClearPageLRU [v17,14/21] mm/compaction: do page isolation first in compaction [v17,15/21] mm/thp: add tail pages into lru anyway in split_huge_page() [v17,16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn [v17,17/21] mm/lru: replace pgdat lru_lock with lruvec lock [v17,18/21] mm/lru: introduce the relock_page_lruvec function [v17,19/21] mm/vmscan: use relock for move_pages_to_lru [v17,20/21] mm/pgdat: remove pgdat lru_lock [v17,21/21] mm/lru: revise the comments of lru_lock

Alex Shi July 25, 2020, 12:59 p.m. UTC

The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in 
mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then
removes 'mm/mlock: reorder isolation sequence during munlock' 

Hi Johanness & Hugh & Alexander & Willy,

Could you like to give a reviewed by since you address much of issue and
give lots of suggestions! Many thanks!

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

The patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation.
2, use TestCleanPageLRU as page isolation's precondition
3, replace per node lru_lock with per memcg per node lru_lock

Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
With this patchset, the readtwice performance increased about 80%
in concurrent containers.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!


Alex Shi (19):
  mm/vmscan: remove unnecessary lruvec adding
  mm/page_idle: no unlikely double check for idle page counting
  mm/compaction: correct the comments of compact_defer_shift
  mm/compaction: rename compact_deferred as compact_should_defer
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: clean up lru_add_page_tail
  mm/thp: remove code path which never got into
  mm/thp: narrow lru locking
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lru_lock holding in func lru_note_cost_page
  mm/lru: move lock into lru_note_cost
  mm/lru: introduce TestClearPageLRU
  mm/compaction: do page isolation first in compaction
  mm/thp: add tail pages into lru anyway in split_huge_page()
  mm/swap: serialize memcg changes in pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/lru: introduce the relock_page_lruvec function
  mm/pgdat: remove pgdat lru_lock

Hugh Dickins (2):
  mm/vmscan: use relock for move_pages_to_lru
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +--
 include/linux/compaction.h                         |   4 +-
 include/linux/memcontrol.h                         |  98 ++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 include/trace/events/compaction.h                  |   2 +-
 mm/compaction.c                                    | 113 ++++++++----
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  48 +++--
 mm/memcontrol.c                                    |  71 ++++++-
 mm/memory.c                                        |   3 -
 mm/mlock.c                                         |  43 +++--
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   8 -
 mm/rmap.c                                          |   4 +-
 mm/swap.c                                          | 203 ++++++++-------------
 mm/swap_state.c                                    |   2 -
 mm/vmscan.c                                        | 174 ++++++++++--------
 mm/workingset.c                                    |   2 -
 25 files changed, 510 insertions(+), 344 deletions(-)

Alex Shi July 27, 2020, 5:40 a.m. UTC | #1

A standard for new page isolation steps like the following:
1, get_page(); #pin the page avoid be free
2, TestClearPageLRU(); #serialize other isolation, also memcg change
3, spin_lock on lru_lock; #serialize lru list access
The step 2 could be optimzed/replaced in scenarios which page is unlikely
be accessed by others.



在 2020/7/25 下午8:59, Alex Shi 写道:
> The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in 
> mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then
> removes 'mm/mlock: reorder isolation sequence during munlock' 
> 
> Hi Johanness & Hugh & Alexander & Willy,
> 
> Could you like to give a reviewed by since you address much of issue and
> give lots of suggestions! Many thanks!
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
> 
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> 
> Alex Shi (19):
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/page_idle: no unlikely double check for idle page counting
>   mm/compaction: correct the comments of compact_defer_shift
>   mm/compaction: rename compact_deferred as compact_should_defer
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: clean up lru_add_page_tail
>   mm/thp: remove code path which never got into
>   mm/thp: narrow lru locking
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
>   mm/lru: move lru_lock holding in func lru_note_cost_page
>   mm/lru: move lock into lru_note_cost
>   mm/lru: introduce TestClearPageLRU
>   mm/compaction: do page isolation first in compaction
>   mm/thp: add tail pages into lru anyway in split_huge_page()
>   mm/swap: serialize memcg changes in pagevec_lru_move_fn
>   mm/lru: replace pgdat lru_lock with lruvec lock
>   mm/lru: introduce the relock_page_lruvec function
>   mm/pgdat: remove pgdat lru_lock
> 
> Hugh Dickins (2):
>   mm/vmscan: use relock for move_pages_to_lru
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/compaction.h                         |   4 +-
>  include/linux/memcontrol.h                         |  98 ++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  include/trace/events/compaction.h                  |   2 +-
>  mm/compaction.c                                    | 113 ++++++++----
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  48 +++--
>  mm/memcontrol.c                                    |  71 ++++++-
>  mm/memory.c                                        |   3 -
>  mm/mlock.c                                         |  43 +++--
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   8 -
>  mm/rmap.c                                          |   4 +-
>  mm/swap.c                                          | 203 ++++++++-------------
>  mm/swap_state.c                                    |   2 -
>  mm/vmscan.c                                        | 174 ++++++++++--------
>  mm/workingset.c                                    |   2 -
>  25 files changed, 510 insertions(+), 344 deletions(-)
>

Alex Shi July 29, 2020, 2:49 p.m. UTC | #2

Is there any comments or suggestion for this patchset?
Any hints will be very appreciated.

Thanks
Alex

在 2020/7/27 下午1:40, Alex Shi 写道:

> A standard for new page isolation steps like the following:
> 1, get_page(); #pin the page avoid be free
> 2, TestClearPageLRU(); #serialize other isolation, also memcg change
> 3, spin_lock on lru_lock; #serialize lru list access
> The step 2 could be optimzed/replaced in scenarios which page is unlikely
> be accessed by others.
> 
> 
> 
> 在 2020/7/25 下午8:59, Alex Shi 写道:
>> The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in 
>> mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then
>> removes 'mm/mlock: reorder isolation sequence during munlock' 
>>
>> Hi Johanness & Hugh & Alexander & Willy,
>>
>> Could you like to give a reviewed by since you address much of issue and
>> give lots of suggestions! Many thanks!

Hugh Dickins July 29, 2020, 6:06 p.m. UTC | #3

On Wed, 29 Jul 2020, Alex Shi wrote:
> 
> Is there any comments or suggestion for this patchset?
> Any hints will be very appreciated.

Alex: it is now v5.8-rc7, obviously too late for this patchset to make
v5.9, so I'm currently concentrated on checking some patches headed for
v5.9 (and some bugfix patches of my own that I don't get time to send):
I'll get back to responding on lru_lock in a week or two's time.

Hugh

Alex Shi July 30, 2020, 2:16 a.m. UTC | #4

在 2020/7/30 上午2:06, Hugh Dickins 写道:
> On Wed, 29 Jul 2020, Alex Shi wrote:
>>
>> Is there any comments or suggestion for this patchset?
>> Any hints will be very appreciated.
> 
> Alex: it is now v5.8-rc7, obviously too late for this patchset to make
> v5.9, so I'm currently concentrated on checking some patches headed for
> v5.9 (and some bugfix patches of my own that I don't get time to send):
> I'll get back to responding on lru_lock in a week or two's time.

Hi Hugh,

Thanks a lot for response! It's fine to wait longer.
But thing would be more efficient if review get concentrated...
I am still too new in mm area.

Thanks
Alex

Alexander Duyck July 31, 2020, 9:31 p.m. UTC | #5

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in
> mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then
> removes 'mm/mlock: reorder isolation sequence during munlock'
>
> Hi Johanness & Hugh & Alexander & Willy,
>
> Could you like to give a reviewed by since you address much of issue and
> give lots of suggestions! Many thanks!
>

I just finished getting a test pass done on the patches. I'm still
seeing a regression on the will-it-scale/page_fault3 test but it is
now only 3% instead of the 20% that it was so it may just be noise at
this point.

I'll try to make sure to get my review feedback wrapped up early next week.

Thanks.

- Alex

Michal Hocko Aug. 3, 2020, 3:07 p.m. UTC | #6

On Thu 30-07-20 10:16:13, Alex Shi wrote:
> 
> 
> 在 2020/7/30 上午2:06, Hugh Dickins 写道:
> > On Wed, 29 Jul 2020, Alex Shi wrote:
> >>
> >> Is there any comments or suggestion for this patchset?
> >> Any hints will be very appreciated.
> > 
> > Alex: it is now v5.8-rc7, obviously too late for this patchset to make
> > v5.9, so I'm currently concentrated on checking some patches headed for
> > v5.9 (and some bugfix patches of my own that I don't get time to send):
> > I'll get back to responding on lru_lock in a week or two's time.
> 
> Hi Hugh,
> 
> Thanks a lot for response! It's fine to wait longer.
> But thing would be more efficient if review get concentrated...
> I am still too new in mm area.

I am sorry and owe you a review but it is hard to find time for that.
This is a large change and the review will be really far from trivial.
If this version is mostly stable then I would recommend not posting new
versions and simply remind people you expect the review from by a
targeted ping.

Alex Shi Aug. 4, 2020, 6:14 a.m. UTC | #7

在 2020/8/3 下午11:07, Michal Hocko 写道:
> On Thu 30-07-20 10:16:13, Alex Shi wrote:
>>
>>
>> 在 2020/7/30 上午2:06, Hugh Dickins 写道:
>>> On Wed, 29 Jul 2020, Alex Shi wrote:
>>>>
>>>> Is there any comments or suggestion for this patchset?
>>>> Any hints will be very appreciated.
>>>
>>> Alex: it is now v5.8-rc7, obviously too late for this patchset to make
>>> v5.9, so I'm currently concentrated on checking some patches headed for
>>> v5.9 (and some bugfix patches of my own that I don't get time to send):
>>> I'll get back to responding on lru_lock in a week or two's time.
>>
>> Hi Hugh,
>>
>> Thanks a lot for response! It's fine to wait longer.
>> But thing would be more efficient if review get concentrated...
>> I am still too new in mm area.
> 
> I am sorry and owe you a review but it is hard to find time for that.
> This is a large change and the review will be really far from trivial.
> If this version is mostly stable then I would recommend not posting new
> versions and simply remind people you expect the review from by a
> targeted ping.
> 

hi Michal,

Thanks a lot for reminder!

Except a update on patch [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function
from Alexander, the patchset is stable on 5.8.

Just on linux-next, there are changes on hpage_nr_pages -> thp_nr_pages func name change, and
lru_note_cost changes that they need a new update.
And I have another 3 more patches, following this patchset which do clean up and optimzing.

Is it worth for a new patchset? or let me just update here?

Thanks
Alex

Alex Shi Aug. 4, 2020, 8:36 a.m. UTC | #8

From 6f3ac2a72448291a88f50df836d829a23e7df736 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Sat, 25 Jul 2020 22:52:11 +0800
Subject: [PATCH 2/3] mm/mlock: remove __munlock_isolate_lru_page

The func only has one caller, remove it to clean up code and simplify
code.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 46a05e6ec5ba..40a8bb79c65e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -102,23 +102,6 @@ void mlock_vma_page(struct page *page)
 }
 
 /*
- * Isolate a page from LRU with optional get_page() pin.
- * Assumes lru_lock already held and page already pinned.
- */
-static bool __munlock_isolate_lru_page(struct page *page,
-				struct lruvec *lruvec, bool getpage)
-{
-	if (TestClearPageLRU(page)) {
-		if (getpage)
-			get_page(page);
-		del_page_from_lru_list(page, lruvec, page_lru(page));
-		return true;
-	}
-
-	return false;
-}
-
-/*
  * Finish munlock after successful page isolation
  *
  * Page must be locked. This is a wrapper for try_to_munlock()
@@ -300,7 +283,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, lruvec, false)) {
+			if (TestClearPageLRU(page)) {
+				enum lru_list lru = page_lru(page);
+
+				del_page_from_lru_list(page, lruvec, lru);
 				unlock_page_memcg(page);
 				continue;
 			} else

Alex Shi Aug. 4, 2020, 8:36 a.m. UTC | #9

From e2918c8fa741442255a2f12659f95dae94fdfe5d Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Sat, 1 Aug 2020 22:49:31 +0800
Subject: [PATCH 3/3] mm/swap.c: optimizing __pagevec_lru_add lru_lock

The current relock will unlock/lock lru_lock with every time lruvec
changes, so it would cause frequency relock if 2 memcgs are reading file
simultaneously.

This patch will record the involved lru_lock and only hold them once in
above scenario. That could reduce the lock contention.

Using per cpu data intead of local stack data to avoid repeatly
INIT_LIST_HEAD action.

[lkp@intel.com: found a build issue in the original patch, thanks]
Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 51 insertions(+), 6 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index d88a6c650a7c..e227fec6983c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -72,6 +72,27 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 	.lock = INIT_LOCAL_LOCK(lock),
 };
 
+struct pvlvs {
+	struct list_head lists[PAGEVEC_SIZE];
+	struct lruvec *vecs[PAGEVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct pvlvs, pvlvs);
+
+static int __init pvlvs_init(void) {
+	int i, cpu;
+	struct pvlvs *pvecs;
+
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		if (!cpu_possible(cpu))
+			continue;
+		pvecs = per_cpu_ptr(&pvlvs, cpu);
+		for (i = 0; i < PAGEVEC_SIZE; i++)
+			INIT_LIST_HEAD(&pvecs->lists[i]);
+	}
+	return 0;
+}
+subsys_initcall(pvlvs_init);
+
 /*
  * This path almost never happens for VM activity - pages are normally
  * freed via pagevecs.  But it gets used by networking.
@@ -963,18 +984,42 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	int i;
+	int i, j, total = 0;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
+	struct page *page;
+	struct pvlvs *lvs = this_cpu_ptr(&pvlvs);
 
+	/* Sort the same lruvec pages on a list. */
 	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
+		page = pvec->pages[i];
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+		/* Try to find a same lruvec */
+		for (j = 0; j <= total; j++)
+			if (lruvec == lvs->vecs[j])
+				break;
+		/* A new lruvec */
+		if (j > total) {
+			lvs->vecs[total] = lruvec;
+			j = total;
+			total++;
+		}
 
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		__pagevec_lru_add_fn(page, lruvec);
+		list_add(&page->lru, &lvs->lists[j]);
 	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+
+	for (i = 0; i < total; i++) {
+		spin_lock_irqsave(&lvs->vecs[i]->lru_lock, flags);
+		while (!list_empty(&lvs->lists[i])) {
+			page = lru_to_page(&lvs->lists[i]);
+			list_del(&page->lru);
+			__pagevec_lru_add_fn(page, lvs->vecs[i]);
+		}
+		spin_unlock_irqrestore(&lvs->vecs[i]->lru_lock, flags);
+		lvs->vecs[i] = NULL;
+	}
+
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }

Alex Shi Aug. 4, 2020, 8:37 a.m. UTC | #10

From 0696a2a4a8ca5e9bf62f208126ea4af7727d2edc Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Sat, 25 Jul 2020 22:31:03 +0800
Subject: [PATCH 1/3] mm/mlock: remove lru_lock on TestClearPageMlocked in
 munlock_vma_page

In the func munlock_vma_page, the page must be PageLocked as well as
pages in split_huge_page series funcs. Thus the PageLocked is enough
to serialize both funcs.

So we could relief the TestClearPageMlocked/hpage_nr_pages which are not
necessary under lru lock.

As to another munlock func __munlock_pagevec, which no PageLocked
protection and should remain lru protecting.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 41 +++++++++++++++--------------------------
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 0448409184e3..46a05e6ec5ba 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -69,9 +69,9 @@ void clear_page_mlock(struct page *page)
 	 *
 	 * See __pagevec_lru_add_fn for more explanation.
 	 */
-	if (!isolate_lru_page(page)) {
+	if (!isolate_lru_page(page))
 		putback_lru_page(page);
-	} else {
+	else {
 		/*
 		 * We lost the race. the page already moved to evictable list.
 		 */
@@ -178,7 +178,6 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -186,37 +185,22 @@ unsigned int munlock_vma_page(struct page *page)
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	/*
-	 * Serialize split tail pages in __split_huge_page_tail() which
-	 * might otherwise copy PageMlocked to part of the tail pages before
-	 * we clear it in the head page. It also stabilizes thp_nr_pages().
-	 * TestClearPageLRU can't be used here to block page isolation, since
-	 * out of lock clear_page_mlock may interfer PageLRU/PageMlocked
-	 * sequence, same as __pagevec_lru_add_fn, and lead the page place to
-	 * wrong lru list here. So relay on PageLocked to stop lruvec change
-	 * in mem_cgroup_move_account().
+	 * Serialize split tail pages in __split_huge_page_tail() by
+	 * lock_page(); Do TestClearPageMlocked/PageLRU sequence like
+	 * clear_page_mlock().
 	 */
-	lruvec = lock_page_lruvec_irq(page);
-
-	if (!TestClearPageMlocked(page)) {
+	if (!TestClearPageMlocked(page))
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		nr_pages = 1;
-		goto unlock_out;
-	}
+		return 0;
 
 	nr_pages = thp_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, lruvec, true)) {
-		unlock_page_lruvec_irq(lruvec);
+	if (!isolate_lru_page(page))
 		__munlock_isolated_page(page);
-		goto out;
-	}
-	__munlock_isolation_failed(page);
-
-unlock_out:
-	unlock_page_lruvec_irq(lruvec);
+	else
+		__munlock_isolation_failed(page);
 
-out:
 	return nr_pages - 1;
 }
 
@@ -305,6 +289,11 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 
 		/* block memcg change in mem_cgroup_move_account */
 		lock_page_memcg(page);
+		/*
+		 * Serialize split tail pages in __split_huge_page_tail() which
+		 * might otherwise copy PageMlocked to part of the tail pages
+		 * before we clear it in the head page.
+		 */
 		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (TestClearPageMlocked(page)) {
 			/*

Alex Shi Aug. 4, 2020, 8:37 a.m. UTC | #11

From e2918c8fa741442255a2f12659f95dae94fdfe5d Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 4 Aug 2020 16:20:02 +0800
Subject: [PATCH 0/3] optimzing following per memcg lru_lock

The first 2 patches are code clean up. And the 3rd one is a lru_add optimize.


Alex Shi (3):
  mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
  mm/mlock: remove __munlock_isolate_lru_page
  mm/swap.c: optimizing __pagevec_lru_add lru_lock

 mm/mlock.c | 63 +++++++++++++++++++-------------------------------------------
 mm/swap.c  | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 70 insertions(+), 50 deletions(-)

[v17,00/21] per memcg lru lock

Message

Comments