Message ID | 1595681998-19193-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | per memcg lru lock | expand |
A standard for new page isolation steps like the following: 1, get_page(); #pin the page avoid be free 2, TestClearPageLRU(); #serialize other isolation, also memcg change 3, spin_lock on lru_lock; #serialize lru list access The step 2 could be optimzed/replaced in scenarios which page is unlikely be accessed by others. 在 2020/7/25 下午8:59, Alex Shi 写道: > The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in > mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then > removes 'mm/mlock: reorder isolation sequence during munlock' > > Hi Johanness & Hugh & Alexander & Willy, > > Could you like to give a reviewed by since you address much of issue and > give lots of suggestions! Many thanks! > > Current lru_lock is one for each of node, pgdat->lru_lock, that guard for > lru lists, but now we had moved the lru lists into memcg for long time. Still > using per node lru_lock is clearly unscalable, pages on each of memcgs have > to compete each others for a whole lru_lock. This patchset try to use per > lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make > it scalable for memcgs and get performance gain. > > Currently lru_lock still guards both lru list and page's lru bit, that's ok. > but if we want to use specific lruvec lock on the page, we need to pin down > the page's lruvec/memcg during locking. Just taking lruvec lock first may be > undermined by the page's memcg charge/migration. To fix this problem, we could > take out the page's lru bit clear and use it as pin down action to block the > memcg changes. That's the reason for new atomic func TestClearPageLRU. > So now isolating a page need both actions: TestClearPageLRU and hold the > lru_lock. > > The typical usage of this is isolate_migratepages_block() in compaction.c > we have to take lru bit before lru lock, that serialized the page isolation > in memcg page charge/migration which will change page's lruvec and new > lru_lock in it. > > The above solution suggested by Johannes Weiner, and based on his new memcg > charge path, then have this patchset. (Hugh Dickins tested and contributed much > code from compaction fix to general code polish, thanks a lot!). > > The patchset includes 3 parts: > 1, some code cleanup and minimum optimization as a preparation. > 2, use TestCleanPageLRU as page isolation's precondition > 3, replace per node lru_lock with per memcg per node lru_lock > > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 > containers on a 2s * 26cores * HT box with a modefied case: > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice > With this patchset, the readtwice performance increased about 80% > in concurrent containers. > > Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this > idea 8 years ago, and others who give comments as well: Daniel Jordan, > Mel Gorman, Shakeel Butt, Matthew Wilcox etc. > > Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu, > and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks! > > > Alex Shi (19): > mm/vmscan: remove unnecessary lruvec adding > mm/page_idle: no unlikely double check for idle page counting > mm/compaction: correct the comments of compact_defer_shift > mm/compaction: rename compact_deferred as compact_should_defer > mm/thp: move lru_add_page_tail func to huge_memory.c > mm/thp: clean up lru_add_page_tail > mm/thp: remove code path which never got into > mm/thp: narrow lru locking > mm/memcg: add debug checking in lock_page_memcg > mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn > mm/lru: move lru_lock holding in func lru_note_cost_page > mm/lru: move lock into lru_note_cost > mm/lru: introduce TestClearPageLRU > mm/compaction: do page isolation first in compaction > mm/thp: add tail pages into lru anyway in split_huge_page() > mm/swap: serialize memcg changes in pagevec_lru_move_fn > mm/lru: replace pgdat lru_lock with lruvec lock > mm/lru: introduce the relock_page_lruvec function > mm/pgdat: remove pgdat lru_lock > > Hugh Dickins (2): > mm/vmscan: use relock for move_pages_to_lru > mm/lru: revise the comments of lru_lock > > Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +- > Documentation/admin-guide/cgroup-v1/memory.rst | 21 +-- > Documentation/trace/events-kmem.rst | 2 +- > Documentation/vm/unevictable-lru.rst | 22 +-- > include/linux/compaction.h | 4 +- > include/linux/memcontrol.h | 98 ++++++++++ > include/linux/mm_types.h | 2 +- > include/linux/mmzone.h | 6 +- > include/linux/page-flags.h | 1 + > include/linux/swap.h | 4 +- > include/trace/events/compaction.h | 2 +- > mm/compaction.c | 113 ++++++++---- > mm/filemap.c | 4 +- > mm/huge_memory.c | 48 +++-- > mm/memcontrol.c | 71 ++++++- > mm/memory.c | 3 - > mm/mlock.c | 43 +++-- > mm/mmzone.c | 1 + > mm/page_alloc.c | 1 - > mm/page_idle.c | 8 - > mm/rmap.c | 4 +- > mm/swap.c | 203 ++++++++------------- > mm/swap_state.c | 2 - > mm/vmscan.c | 174 ++++++++++-------- > mm/workingset.c | 2 - > 25 files changed, 510 insertions(+), 344 deletions(-) >
Is there any comments or suggestion for this patchset? Any hints will be very appreciated. Thanks Alex 在 2020/7/27 下午1:40, Alex Shi 写道: > A standard for new page isolation steps like the following: > 1, get_page(); #pin the page avoid be free > 2, TestClearPageLRU(); #serialize other isolation, also memcg change > 3, spin_lock on lru_lock; #serialize lru list access > The step 2 could be optimzed/replaced in scenarios which page is unlikely > be accessed by others. > > > > 在 2020/7/25 下午8:59, Alex Shi 写道: >> The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in >> mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then >> removes 'mm/mlock: reorder isolation sequence during munlock' >> >> Hi Johanness & Hugh & Alexander & Willy, >> >> Could you like to give a reviewed by since you address much of issue and >> give lots of suggestions! Many thanks!
On Wed, 29 Jul 2020, Alex Shi wrote: > > Is there any comments or suggestion for this patchset? > Any hints will be very appreciated. Alex: it is now v5.8-rc7, obviously too late for this patchset to make v5.9, so I'm currently concentrated on checking some patches headed for v5.9 (and some bugfix patches of my own that I don't get time to send): I'll get back to responding on lru_lock in a week or two's time. Hugh
在 2020/7/30 上午2:06, Hugh Dickins 写道: > On Wed, 29 Jul 2020, Alex Shi wrote: >> >> Is there any comments or suggestion for this patchset? >> Any hints will be very appreciated. > > Alex: it is now v5.8-rc7, obviously too late for this patchset to make > v5.9, so I'm currently concentrated on checking some patches headed for > v5.9 (and some bugfix patches of my own that I don't get time to send): > I'll get back to responding on lru_lock in a week or two's time. Hi Hugh, Thanks a lot for response! It's fine to wait longer. But thing would be more efficient if review get concentrated... I am still too new in mm area. Thanks Alex
On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote: > > The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in > mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then > removes 'mm/mlock: reorder isolation sequence during munlock' > > Hi Johanness & Hugh & Alexander & Willy, > > Could you like to give a reviewed by since you address much of issue and > give lots of suggestions! Many thanks! > I just finished getting a test pass done on the patches. I'm still seeing a regression on the will-it-scale/page_fault3 test but it is now only 3% instead of the 20% that it was so it may just be noise at this point. I'll try to make sure to get my review feedback wrapped up early next week. Thanks. - Alex
On Thu 30-07-20 10:16:13, Alex Shi wrote: > > > 在 2020/7/30 上午2:06, Hugh Dickins 写道: > > On Wed, 29 Jul 2020, Alex Shi wrote: > >> > >> Is there any comments or suggestion for this patchset? > >> Any hints will be very appreciated. > > > > Alex: it is now v5.8-rc7, obviously too late for this patchset to make > > v5.9, so I'm currently concentrated on checking some patches headed for > > v5.9 (and some bugfix patches of my own that I don't get time to send): > > I'll get back to responding on lru_lock in a week or two's time. > > Hi Hugh, > > Thanks a lot for response! It's fine to wait longer. > But thing would be more efficient if review get concentrated... > I am still too new in mm area. I am sorry and owe you a review but it is hard to find time for that. This is a large change and the review will be really far from trivial. If this version is mostly stable then I would recommend not posting new versions and simply remind people you expect the review from by a targeted ping.
在 2020/8/3 下午11:07, Michal Hocko 写道: > On Thu 30-07-20 10:16:13, Alex Shi wrote: >> >> >> 在 2020/7/30 上午2:06, Hugh Dickins 写道: >>> On Wed, 29 Jul 2020, Alex Shi wrote: >>>> >>>> Is there any comments or suggestion for this patchset? >>>> Any hints will be very appreciated. >>> >>> Alex: it is now v5.8-rc7, obviously too late for this patchset to make >>> v5.9, so I'm currently concentrated on checking some patches headed for >>> v5.9 (and some bugfix patches of my own that I don't get time to send): >>> I'll get back to responding on lru_lock in a week or two's time. >> >> Hi Hugh, >> >> Thanks a lot for response! It's fine to wait longer. >> But thing would be more efficient if review get concentrated... >> I am still too new in mm area. > > I am sorry and owe you a review but it is hard to find time for that. > This is a large change and the review will be really far from trivial. > If this version is mostly stable then I would recommend not posting new > versions and simply remind people you expect the review from by a > targeted ping. > hi Michal, Thanks a lot for reminder! Except a update on patch [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function from Alexander, the patchset is stable on 5.8. Just on linux-next, there are changes on hpage_nr_pages -> thp_nr_pages func name change, and lru_note_cost changes that they need a new update. And I have another 3 more patches, following this patchset which do clean up and optimzing. Is it worth for a new patchset? or let me just update here? Thanks Alex
From 6f3ac2a72448291a88f50df836d829a23e7df736 Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi@linux.alibaba.com> Date: Sat, 25 Jul 2020 22:52:11 +0800 Subject: [PATCH 2/3] mm/mlock: remove __munlock_isolate_lru_page The func only has one caller, remove it to clean up code and simplify code. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/mlock.c | 22 ++++------------------ 1 file changed, 4 insertions(+), 18 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 46a05e6ec5ba..40a8bb79c65e 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -102,23 +102,6 @@ void mlock_vma_page(struct page *page) } /* - * Isolate a page from LRU with optional get_page() pin. - * Assumes lru_lock already held and page already pinned. - */ -static bool __munlock_isolate_lru_page(struct page *page, - struct lruvec *lruvec, bool getpage) -{ - if (TestClearPageLRU(page)) { - if (getpage) - get_page(page); - del_page_from_lru_list(page, lruvec, page_lru(page)); - return true; - } - - return false; -} - -/* * Finish munlock after successful page isolation * * Page must be locked. This is a wrapper for try_to_munlock() @@ -300,7 +283,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) * We already have pin from follow_page_mask() * so we can spare the get_page() here. */ - if (__munlock_isolate_lru_page(page, lruvec, false)) { + if (TestClearPageLRU(page)) { + enum lru_list lru = page_lru(page); + + del_page_from_lru_list(page, lruvec, lru); unlock_page_memcg(page); continue; } else
From e2918c8fa741442255a2f12659f95dae94fdfe5d Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi@linux.alibaba.com> Date: Sat, 1 Aug 2020 22:49:31 +0800 Subject: [PATCH 3/3] mm/swap.c: optimizing __pagevec_lru_add lru_lock The current relock will unlock/lock lru_lock with every time lruvec changes, so it would cause frequency relock if 2 memcgs are reading file simultaneously. This patch will record the involved lru_lock and only hold them once in above scenario. That could reduce the lock contention. Using per cpu data intead of local stack data to avoid repeatly INIT_LIST_HEAD action. [lkp@intel.com: found a build issue in the original patch, thanks] Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 51 insertions(+), 6 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index d88a6c650a7c..e227fec6983c 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -72,6 +72,27 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = { .lock = INIT_LOCAL_LOCK(lock), }; +struct pvlvs { + struct list_head lists[PAGEVEC_SIZE]; + struct lruvec *vecs[PAGEVEC_SIZE]; +}; +static DEFINE_PER_CPU(struct pvlvs, pvlvs); + +static int __init pvlvs_init(void) { + int i, cpu; + struct pvlvs *pvecs; + + for (cpu = 0; cpu < NR_CPUS; cpu++) { + if (!cpu_possible(cpu)) + continue; + pvecs = per_cpu_ptr(&pvlvs, cpu); + for (i = 0; i < PAGEVEC_SIZE; i++) + INIT_LIST_HEAD(&pvecs->lists[i]); + } + return 0; +} +subsys_initcall(pvlvs_init); + /* * This path almost never happens for VM activity - pages are normally * freed via pagevecs. But it gets used by networking. @@ -963,18 +984,42 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) */ void __pagevec_lru_add(struct pagevec *pvec) { - int i; + int i, j, total = 0; struct lruvec *lruvec = NULL; unsigned long flags = 0; + struct page *page; + struct pvlvs *lvs = this_cpu_ptr(&pvlvs); + /* Sort the same lruvec pages on a list. */ for (i = 0; i < pagevec_count(pvec); i++) { - struct page *page = pvec->pages[i]; + page = pvec->pages[i]; + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + + /* Try to find a same lruvec */ + for (j = 0; j <= total; j++) + if (lruvec == lvs->vecs[j]) + break; + /* A new lruvec */ + if (j > total) { + lvs->vecs[total] = lruvec; + j = total; + total++; + } - lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); - __pagevec_lru_add_fn(page, lruvec); + list_add(&page->lru, &lvs->lists[j]); } - if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, flags); + + for (i = 0; i < total; i++) { + spin_lock_irqsave(&lvs->vecs[i]->lru_lock, flags); + while (!list_empty(&lvs->lists[i])) { + page = lru_to_page(&lvs->lists[i]); + list_del(&page->lru); + __pagevec_lru_add_fn(page, lvs->vecs[i]); + } + spin_unlock_irqrestore(&lvs->vecs[i]->lru_lock, flags); + lvs->vecs[i] = NULL; + } + release_pages(pvec->pages, pvec->nr); pagevec_reinit(pvec); }
From 0696a2a4a8ca5e9bf62f208126ea4af7727d2edc Mon Sep 17 00:00:00 2001 From: Alex Shi <alex.shi@linux.alibaba.com> Date: Sat, 25 Jul 2020 22:31:03 +0800 Subject: [PATCH 1/3] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page In the func munlock_vma_page, the page must be PageLocked as well as pages in split_huge_page series funcs. Thus the PageLocked is enough to serialize both funcs. So we could relief the TestClearPageMlocked/hpage_nr_pages which are not necessary under lru lock. As to another munlock func __munlock_pagevec, which no PageLocked protection and should remain lru protecting. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/mlock.c | 41 +++++++++++++++-------------------------- 1 file changed, 15 insertions(+), 26 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 0448409184e3..46a05e6ec5ba 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -69,9 +69,9 @@ void clear_page_mlock(struct page *page) * * See __pagevec_lru_add_fn for more explanation. */ - if (!isolate_lru_page(page)) { + if (!isolate_lru_page(page)) putback_lru_page(page); - } else { + else { /* * We lost the race. the page already moved to evictable list. */ @@ -178,7 +178,6 @@ static void __munlock_isolation_failed(struct page *page) unsigned int munlock_vma_page(struct page *page) { int nr_pages; - struct lruvec *lruvec; /* For try_to_munlock() and to serialize with page migration */ BUG_ON(!PageLocked(page)); @@ -186,37 +185,22 @@ unsigned int munlock_vma_page(struct page *page) VM_BUG_ON_PAGE(PageTail(page), page); /* - * Serialize split tail pages in __split_huge_page_tail() which - * might otherwise copy PageMlocked to part of the tail pages before - * we clear it in the head page. It also stabilizes thp_nr_pages(). - * TestClearPageLRU can't be used here to block page isolation, since - * out of lock clear_page_mlock may interfer PageLRU/PageMlocked - * sequence, same as __pagevec_lru_add_fn, and lead the page place to - * wrong lru list here. So relay on PageLocked to stop lruvec change - * in mem_cgroup_move_account(). + * Serialize split tail pages in __split_huge_page_tail() by + * lock_page(); Do TestClearPageMlocked/PageLRU sequence like + * clear_page_mlock(). */ - lruvec = lock_page_lruvec_irq(page); - - if (!TestClearPageMlocked(page)) { + if (!TestClearPageMlocked(page)) /* Potentially, PTE-mapped THP: do not skip the rest PTEs */ - nr_pages = 1; - goto unlock_out; - } + return 0; nr_pages = thp_nr_pages(page); __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); - if (__munlock_isolate_lru_page(page, lruvec, true)) { - unlock_page_lruvec_irq(lruvec); + if (!isolate_lru_page(page)) __munlock_isolated_page(page); - goto out; - } - __munlock_isolation_failed(page); - -unlock_out: - unlock_page_lruvec_irq(lruvec); + else + __munlock_isolation_failed(page); -out: return nr_pages - 1; } @@ -305,6 +289,11 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) /* block memcg change in mem_cgroup_move_account */ lock_page_memcg(page); + /* + * Serialize split tail pages in __split_huge_page_tail() which + * might otherwise copy PageMlocked to part of the tail pages + * before we clear it in the head page. + */ lruvec = relock_page_lruvec_irq(page, lruvec); if (TestClearPageMlocked(page)) { /*
From e2918c8fa741442255a2f12659f95dae94fdfe5d Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 4 Aug 2020 16:20:02 +0800
Subject: [PATCH 0/3] optimzing following per memcg lru_lock
The first 2 patches are code clean up. And the 3rd one is a lru_add optimize.
Alex Shi (3):
mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
mm/mlock: remove __munlock_isolate_lru_page
mm/swap.c: optimizing __pagevec_lru_add lru_lock
mm/mlock.c | 63 +++++++++++++++++++-------------------------------------------
mm/swap.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 70 insertions(+), 50 deletions(-)