mbox series

[00/14] per memcg lru_lock

Message ID 1566294517-86418-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
Headers show
Series per memcg lru_lock | expand

Message

Alex Shi Aug. 20, 2019, 9:48 a.m. UTC
This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg.

Per memcg lru_lock would ease the lru_lock contention a lot in
this patch series.

In some data center, containers are used widely to deploy different kind
of services, then multiple memcgs share per node pgdat->lru_lock which
cause heavy lock contentions when doing lru operation.

On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9
simultaneously with mmtests' config:
        # AIM9
        export AIM9_TESTTIME=180
        export AIM9_TESTLIST=page_test,brk_test

perf lock report show much contentions on lru_lock in 20 second snapshot:
                        Name   acquired  contended   avg wait (ns) total wait (ns)   max wait (ns)   min wait (ns)
        &(ptlock_ptr(pag...         22          0               0       0               0               0
        ...
        &(&pgdat->lru_lo...          9          7           12728       89096           26656            1597

With this patch series, lruvec->lru_lock show no contentions
        &(&lruvec->lru_l...          8          0               0       0               0               0

and aim9 page_test/brk_test performance increased 5%~50%.
BTW, Detailed results in aim9-pft.compare.log if needed,
All containers data are increased and pretty steady.

$for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i page_test ============"; cat aim9-pft.compare.log | grep "^$i.*page_test" | awk 'BEGIN {a=b=0;}  { a+=$3; b+=$6 } END { print "5.3-rc4          " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
========= Max page_test ============
5.3-rc4          34729.6
5.3-rc4+lru_lock 36128.3
========= Min page_test ============
5.3-rc4          33644.2
5.3-rc4+lru_lock 35349.7
========= Hmean page_test ============
5.3-rc4          34355.4
5.3-rc4+lru_lock 35810.9
========= Stddev page_test ============
5.3-rc4          319.757
5.3-rc4+lru_lock 223.324
========= CoeffVar page_test ============
5.3-rc4          0.93125
5.3-rc4+lru_lock 0.623333
========= BHmean-50 page_test ============
5.3-rc4          34579.2
5.3-rc4+lru_lock 35977.1
========= BHmean-95 page_test ============
5.3-rc4          34421.7
5.3-rc4+lru_lock 35853.6
========= BHmean-99 page_test ============
5.3-rc4          34421.7
5.3-rc4+lru_lock 35853.6

$for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i brk_test ============"; cat aim9-pft.compare.log | grep "^$i.*brk_test" | awk 'BEGIN {a=b=0;}  { a+=$3; b+=$6 } END { print "5.3-rc4          " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
========= Max brk_test ============
5.3-rc4          96647.7
5.3-rc4+lru_lock 98960.3
========= Min brk_test ============
5.3-rc4          91800.8
5.3-rc4+lru_lock 96817.6
========= Hmean brk_test ============
5.3-rc4          95470
5.3-rc4+lru_lock 97769.6
========= Stddev brk_test ============
5.3-rc4          1253.52
5.3-rc4+lru_lock 596.593
========= CoeffVar brk_test ============
5.3-rc4          1.31375
5.3-rc4+lru_lock 0.609583
========= BHmean-50 brk_test ============
5.3-rc4          96141.4
5.3-rc4+lru_lock 98194
========= BHmean-95 brk_test ============
5.3-rc4          95818.5
5.3-rc4+lru_lock 97857.2
========= BHmean-99 brk_test ============
5.3-rc4          95818.5
5.3-rc4+lru_lock 97857.2

Alex Shi (14):
  mm/lru: move pgdat lru_lock into lruvec
  lru/memcg: move the lruvec->pgdat sync out lru_lock
  lru/memcg: using per lruvec lock in un/lock_page_lru
  lru/compaction: use per lruvec lock in isolate_migratepages_block
  lru/huge_page: use per lruvec lock in __split_huge_page
  lru/mlock: using per lruvec lock in munlock
  lru/swap: using per lruvec lock in page_cache_release
  lru/swap: uer lruvec lock in activate_page
  lru/swap: uer per lruvec lock in pagevec_lru_move_fn
  lru/swap: use per lruvec lock in release_pages
  lru/vmscan: using per lruvec lock in lists shrinking.
  lru/vmscan: use pre lruvec lock in check_move_unevictable_pages
  lru/vmscan: using per lruvec lru_lock in get_scan_count
  mm/lru: fix the comments of lru_lock

 include/linux/memcontrol.h | 24 ++++++++++----
 include/linux/mm_types.h   |  2 +-
 include/linux/mmzone.h     |  6 ++--
 mm/compaction.c            | 48 +++++++++++++++++-----------
 mm/filemap.c               |  4 +--
 mm/huge_memory.c           |  9 ++++--
 mm/memcontrol.c            | 24 ++++++--------
 mm/mlock.c                 | 35 ++++++++++----------
 mm/mmzone.c                |  1 +
 mm/page_alloc.c            |  1 -
 mm/page_idle.c             |  4 +--
 mm/rmap.c                  |  2 +-
 mm/swap.c                  | 79 +++++++++++++++++++++++++---------------------
 mm/vmscan.c                | 63 ++++++++++++++++++------------------
 14 files changed, 166 insertions(+), 136 deletions(-)

Comments

Michal Hocko Aug. 20, 2019, 10:45 a.m. UTC | #1
On Tue 20-08-19 17:48:23, Alex Shi wrote:
> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg.
> 
> Per memcg lru_lock would ease the lru_lock contention a lot in
> this patch series.
> 
> In some data center, containers are used widely to deploy different kind
> of services, then multiple memcgs share per node pgdat->lru_lock which
> cause heavy lock contentions when doing lru operation.

Having some real world workloads numbers would be more than useful
for a non trivial change like this. I believe googlers have tried
something like this in the past but then didn't have really a good
example of workloads that benefit. I might misremember though. Cc Hugh.
Shakeel Butt Aug. 20, 2019, 4:48 p.m. UTC | #2
On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 20-08-19 17:48:23, Alex Shi wrote:
> > This patchset move lru_lock into lruvec, give a lru_lock for each of
> > lruvec, thus bring a lru_lock for each of memcg.
> >
> > Per memcg lru_lock would ease the lru_lock contention a lot in
> > this patch series.
> >
> > In some data center, containers are used widely to deploy different kind
> > of services, then multiple memcgs share per node pgdat->lru_lock which
> > cause heavy lock contentions when doing lru operation.
>
> Having some real world workloads numbers would be more than useful
> for a non trivial change like this. I believe googlers have tried
> something like this in the past but then didn't have really a good
> example of workloads that benefit. I might misremember though. Cc Hugh.
>

We, at Google, have been using per-memcg lru locks for more than 7
years. Per-memcg lru locks are really beneficial for providing
performance isolation if there are multiple distinct jobs/memcgs
running on large machines. We are planning to upstream our internal
implementation. I will let Hugh comment on that.

thanks,
Shakeel
Hugh Dickins Aug. 20, 2019, 6:24 p.m. UTC | #3
On Tue, 20 Aug 2019, Shakeel Butt wrote:
> On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 20-08-19 17:48:23, Alex Shi wrote:
> > > This patchset move lru_lock into lruvec, give a lru_lock for each of
> > > lruvec, thus bring a lru_lock for each of memcg.
> > >
> > > Per memcg lru_lock would ease the lru_lock contention a lot in
> > > this patch series.
> > >
> > > In some data center, containers are used widely to deploy different kind
> > > of services, then multiple memcgs share per node pgdat->lru_lock which
> > > cause heavy lock contentions when doing lru operation.
> >
> > Having some real world workloads numbers would be more than useful
> > for a non trivial change like this. I believe googlers have tried
> > something like this in the past but then didn't have really a good
> > example of workloads that benefit. I might misremember though. Cc Hugh.
> >
> 
> We, at Google, have been using per-memcg lru locks for more than 7
> years. Per-memcg lru locks are really beneficial for providing
> performance isolation if there are multiple distinct jobs/memcgs
> running on large machines. We are planning to upstream our internal
> implementation. I will let Hugh comment on that.

Thanks for the Cc Michal.  As Shakeel says, Google prodkernel has been
using our per-memcg lru locks for 7 years or so.  Yes, we did not come
up with supporting performance data at the time of posting, nor since:
I see Alex has done much better on that (though I haven't even glanced
to see if +s are persuasive).

https://lkml.org/lkml/2012/2/20/434
was how ours was back then; some parts of that went in, then attached
lrulock417.tar is how it was the last time I rebased, to v4.17.

I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
and/or mmotm.  Then compare with what Alex has, to see if there's any
good reason to prefer one to the other: if no good reason to prefer ours,
I doubt we shall bother to repost, but just use it as basis for helping
to review or improve Alex's.

Hugh
Alex Shi Aug. 21, 2019, 1:21 a.m. UTC | #4
> 
> Thanks for the Cc Michal.  As Shakeel says, Google prodkernel has been
> using our per-memcg lru locks for 7 years or so.  Yes, we did not come
> up with supporting performance data at the time of posting, nor since:
> I see Alex has done much better on that (though I haven't even glanced
> to see if +s are persuasive).
> 
> https://lkml.org/lkml/2012/2/20/434
> was how ours was back then; some parts of that went in, then attached
> lrulock417.tar is how it was the last time I rebased, to v4.17.
> 
> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> and/or mmotm.  Then compare with what Alex has, to see if there's any
> good reason to prefer one to the other: if no good reason to prefer ours,
> I doubt we shall bother to repost, but just use it as basis for helping
> to review or improve Alex's.
> 

Thanks for you all! Very glad to see we are trying on same point. :)
Not only on per memcg lru_lock, there are much room on lru and page replacement
tunings. Anyway Hope to see your update and more review comments soon.

Thanks
Alex
Alex Shi Aug. 21, 2019, 2 a.m. UTC | #5
在 2019/8/21 上午2:24, Hugh Dickins 写道:
> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> and/or mmotm.  Then compare with what Alex has, to see if there's any
> good reason to prefer one to the other: if no good reason to prefer ours,
> I doubt we shall bother to repost, but just use it as basis for helping
> to review or improve Alex's.

For your review, my patchset are pretty straight and simple. It just use per lruvec lru_lock to replace necessary pgdat lru_lock. just this. 
We could talk more after I back to work. :)

Thanks alot!
Alex
Daniel Jordan Aug. 21, 2019, 6 p.m. UTC | #6
Hi Alex,

On 8/20/19 5:48 AM, Alex Shi wrote:
> In some data center, containers are used widely to deploy different kind
> of services, then multiple memcgs share per node pgdat->lru_lock which
> cause heavy lock contentions when doing lru operation.
> 
> On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9
> simultaneously with mmtests' config:
>          # AIM9
>          export AIM9_TESTTIME=180
>          export AIM9_TESTLIST=page_test,brk_test
> 
> perf lock report show much contentions on lru_lock in 20 second snapshot:
>                          Name   acquired  contended   avg wait (ns) total wait (ns)   max wait (ns)   min wait (ns)
>          &(ptlock_ptr(pag...         22          0               0       0               0               0
>          ...
>          &(&pgdat->lru_lo...          9          7           12728       89096           26656            1597

This is system-wide right, not per container?  Even per container, 89 usec isn't much contention over 20 seconds.  You may want to give this a try:

   https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].

It'd be better though, as Michal suggests, to use the real workload that's causing problems.  Where are you seeing contention?

> With this patch series, lruvec->lru_lock show no contentions
>          &(&lruvec->lru_l...          8          0               0       0               0               0
> 
> and aim9 page_test/brk_test performance increased 5%~50%.

Where does the 50% number come in?  The numbers below seem to only show ~4% boost.

> BTW, Detailed results in aim9-pft.compare.log if needed,
> All containers data are increased and pretty steady.
> 
> $for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i page_test ============"; cat aim9-pft.compare.log | grep "^$i.*page_test" | awk 'BEGIN {a=b=0;}  { a+=$3; b+=$6 } END { print "5.3-rc4          " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
> ========= Max page_test ============
> 5.3-rc4          34729.6
> 5.3-rc4+lru_lock 36128.3
> ========= Min page_test ============
> 5.3-rc4          33644.2
> 5.3-rc4+lru_lock 35349.7
> ========= Hmean page_test ============
> 5.3-rc4          34355.4
> 5.3-rc4+lru_lock 35810.9
> ========= Stddev page_test ============
> 5.3-rc4          319.757
> 5.3-rc4+lru_lock 223.324
> ========= CoeffVar page_test ============
> 5.3-rc4          0.93125
> 5.3-rc4+lru_lock 0.623333
> ========= BHmean-50 page_test ============
> 5.3-rc4          34579.2
> 5.3-rc4+lru_lock 35977.1
> ========= BHmean-95 page_test ============
> 5.3-rc4          34421.7
> 5.3-rc4+lru_lock 35853.6
> ========= BHmean-99 page_test ============
> 5.3-rc4          34421.7
> 5.3-rc4+lru_lock 35853.6
> 
> $for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i brk_test ============"; cat aim9-pft.compare.log | grep "^$i.*brk_test" | awk 'BEGIN {a=b=0;}  { a+=$3; b+=$6 } END { print "5.3-rc4          " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
> ========= Max brk_test ============
> 5.3-rc4          96647.7
> 5.3-rc4+lru_lock 98960.3
> ========= Min brk_test ============
> 5.3-rc4          91800.8
> 5.3-rc4+lru_lock 96817.6
> ========= Hmean brk_test ============
> 5.3-rc4          95470
> 5.3-rc4+lru_lock 97769.6
> ========= Stddev brk_test ============
> 5.3-rc4          1253.52
> 5.3-rc4+lru_lock 596.593
> ========= CoeffVar brk_test ============
> 5.3-rc4          1.31375
> 5.3-rc4+lru_lock 0.609583
> ========= BHmean-50 brk_test ============
> 5.3-rc4          96141.4
> 5.3-rc4+lru_lock 98194
> ========= BHmean-95 brk_test ============
> 5.3-rc4          95818.5
> 5.3-rc4+lru_lock 97857.2
> ========= BHmean-99 brk_test ============
> 5.3-rc4          95818.5
> 5.3-rc4+lru_lock 97857.2

[1] https://lore.kernel.org/linux-mm/CABdVr8R2y9B+2zzSAT_Ve=BQCa+F+E9_kVH+C28DGpkeQitiog@mail.gmail.com/
Alex Shi Aug. 22, 2019, 11:56 a.m. UTC | #7
在 2019/8/22 上午2:00, Daniel Jordan 写道:
>>
> 
> This is system-wide right, not per container?  Even per container, 89 usec isn't much contention over 20 seconds.  You may want to give this a try:

yes, perf lock show the host info.
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice> 
> It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].

Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now. Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. But anyway, aim9 is also pretty good to show the problem and solutions. :)
> 
> It'd be better though, as Michal suggests, to use the real workload that's causing problems.  Where are you seeing contention?

We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages alloc/remove. aim9 could reflect part of scenarios. I don't know the DB scenario yet.

> 
>> With this patch series, lruvec->lru_lock show no contentions
>>          &(&lruvec->lru_l...          8          0               0       0               0               0
>>
>> and aim9 page_test/brk_test performance increased 5%~50%.
> 
> Where does the 50% number come in?  The numbers below seem to only show ~4% boost.

the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following:

Stddev    page_test      245.15 (   0.00%)      189.29 (  22.79%)
Stddev    brk_test      1258.60 (   0.00%)      629.16 (  50.01%)
CoeffVar  page_test        0.71 (   0.00%)        0.53 (  26.05%)
CoeffVar  brk_test         1.32 (   0.00%)        0.64 (  51.14%)
Daniel Jordan Aug. 22, 2019, 3:20 p.m. UTC | #8
On 8/22/19 7:56 AM, Alex Shi wrote:
> 在 2019/8/22 上午2:00, Daniel Jordan 写道:
>>    https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
>> It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
> 
> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now.

I should've been clearer.  [1] is meant as an example of someone suffering from lru_lock during reclaim.  Wouldn't your series help per-memcg reclaim?

> Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. But anyway, aim9 is also pretty good to show the problem and solutions. :)
>>
>> It'd be better though, as Michal suggests, to use the real workload that's causing problems.  Where are you seeing contention?
> 
> We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages alloc/remove. 

I think numbers from that scenario would help your case.

> aim9 could reflect part of scenarios. I don't know the DB scenario yet.

We see it during DB shutdown when each DB process frees its memory (zap_pte_range -> mark_page_accessed).  But that's a different thing, clearly Not This Series.

>>> With this patch series, lruvec->lru_lock show no contentions
>>>           &(&lruvec->lru_l...          8          0               0       0               0               0
>>>
>>> and aim9 page_test/brk_test performance increased 5%~50%.
>>
>> Where does the 50% number come in?  The numbers below seem to only show ~4% boost.
> 
> the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following:
> 
> Stddev    page_test      245.15 (   0.00%)      189.29 (  22.79%)
> Stddev    brk_test      1258.60 (   0.00%)      629.16 (  50.01%)
> CoeffVar  page_test        0.71 (   0.00%)        0.53 (  26.05%)
> CoeffVar  brk_test         1.32 (   0.00%)        0.64 (  51.14%)

Aha.  50% decrease in stdev.
Hugh Dickins Aug. 24, 2019, 1:59 a.m. UTC | #9
On Wed, 21 Aug 2019, Alex Shi wrote:
> 在 2019/8/21 上午2:24, Hugh Dickins 写道:
> > I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> > and/or mmotm.  Then compare with what Alex has, to see if there's any
> > good reason to prefer one to the other: if no good reason to prefer ours,
> > I doubt we shall bother to repost, but just use it as basis for helping
> > to review or improve Alex's.
> 
> For your review, my patchset are pretty straight and simple.
> It just use per lruvec lru_lock to replace necessary pgdat lru_lock.
> just this.  We could talk more after I back to work. :)

Sorry to be bearer of bad news, Alex, but when you said "straight and
simple", I feared that your patchset would turn out to be fundamentally
too simple.

And that is so. I have only to see the
		lruvec = mem_cgroup_page_lruvec(page, pgdat);
line in isolate_migratepages_block() in mm/compaction.c, and check
that mem_cgroup_page_lruvec() is little changed in mm/mempolicy.c.

The central problem with per-memcg lru_lock is that you do not know
for sure what lock to take (which memcg a page belongs to) until you
have taken the expected lock, and then checked whether page->memcg
is still the same - backing out and trying again if not.

Fix that central problem, and you end up with a more complicated
patchset, much like ours.  It's true that when ours was first developed,
the memcg situation was more complicated in several ways, and perhaps
some aspects of our patchset could be simplified now (though I've not
identified any).  Johannes in particular has done a great deal of
simplifying work in memcg over the last few years, but there are still
situations in which a page's memcg can change (move_charge_at_immigrate
and swapin readahead spring to mind - or perhaps the latter is only an
issue when MEMCG_SWAP is not enabled, I forget; and I often wonder if
reparenting will be brought back one day).

I did not review your patchset in detail, and wasn't able to get very
far in testing it.  At first I was put off by set_task_reclaim_state
warnings from mm/vmscan.c, but those turned out to be in v5.3-rc5
itself, not from your patchset or mine (but I've not yet investigated
what's responsible for them).  Within a minute of starting swapping
load, kcompactd compact_lock_irqsave() in isolate_migratepages_block()
would deadlock, and I did not get further.  (Though I did also notice
that booting the CONFIG_MEMCG=y kernel with "cgroup_disable=memory"
froze in booting - tiresomely, one has to keep both the memcg and
no-memcg locking to cope with that case, and I guess you had not.)

Rather than duplicating effort, I would advise you to give our patchset
a try, and if it works for you, help towards getting that one merged:
but of course, it's up to you.

I've attached a tarfile of it rebased to v5.3-rc5: I do not want to
spam the list with patches yet, because I do not have any stats or
argument in support of the series, as Andrew asked for years ago and
Michal asks again now.  But aside from that I consider it ready, and
will let Shakeel take it over from here, while I get back to what I
diverted from (but of course I'll try to answer questions on it).

Thanks,
Hugh
Konstantin Khlebnikov Aug. 26, 2019, 8:39 a.m. UTC | #10
On 22/08/2019 18.20, Daniel Jordan wrote:
> On 8/22/19 7:56 AM, Alex Shi wrote:
>> 在 2019/8/22 上午2:00, Daniel Jordan 写道:
>>>    https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
>>> It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this 
>>> lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
>>
>> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now.
> 
> I should've been clearer.  [1] is meant as an example of someone suffering from lru_lock during reclaim.  Wouldn't your series help 
> per-memcg reclaim?
> 
>> Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. 
>> But anyway, aim9 is also pretty good to show the problem and solutions. :)
>>>
>>> It'd be better though, as Michal suggests, to use the real workload that's causing problems.  Where are you seeing contention?
>>
>> We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages 
>> alloc/remove. 
> 
> I think numbers from that scenario would help your case.
> 
>> aim9 could reflect part of scenarios. I don't know the DB scenario yet.
> 
> We see it during DB shutdown when each DB process frees its memory (zap_pte_range -> mark_page_accessed).  But that's a different thing, 
> clearly Not This Series.
> 
>>>> With this patch series, lruvec->lru_lock show no contentions
>>>>           &(&lruvec->lru_l...          8          0               0       0               0               0
>>>>
>>>> and aim9 page_test/brk_test performance increased 5%~50%.
>>>
>>> Where does the 50% number come in?  The numbers below seem to only show ~4% boost.
>>After splitting lru-locks present per-cpu page-vectors works no so well
because they mixes pages from different cgroups.

pagevec_lru_move_fn and friends need better implementation:
either sorting pages or splitting vectores in per-lruvec basis.
>> the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following:
>>
>> Stddev    page_test      245.15 (   0.00%)      189.29 (  22.79%)
>> Stddev    brk_test      1258.60 (   0.00%)      629.16 (  50.01%)
>> CoeffVar  page_test        0.71 (   0.00%)        0.53 (  26.05%)
>> CoeffVar  brk_test         1.32 (   0.00%)        0.64 (  51.14%)
> 
> Aha.  50% decrease in stdev.
> 

After splitting lru-locks present per-cpu page-vectors works
no so well because they mix pages from different cgroups.

pagevec_lru_move_fn and friends need better implementation:
either sorting pages or splitting vectores in per-lruvec basis.
Alex Shi Aug. 26, 2019, 2:22 p.m. UTC | #11
在 2019/8/26 下午4:39, Konstantin Khlebnikov 写道:
>>>
> because they mixes pages from different cgroups.
> 
> pagevec_lru_move_fn and friends need better implementation:
> either sorting pages or splitting vectores in per-lruvec basis.

Right, this should be the next step to improve. Maybe we could try the per-lruvec pagevec?

Thanks
Alex
Alex Shi Aug. 26, 2019, 2:25 p.m. UTC | #12
在 2019/8/22 下午11:20, Daniel Jordan 写道:
>>
>>>    https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
>>> It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
>>
>> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now.
> 
> I should've been clearer.  [1] is meant as an example of someone suffering from lru_lock during reclaim.  Wouldn't your series help per-memcg reclaim?

yes, I got your point, since the aim9 don't show much improvement, I am trying this case in containers.

Thanks
Alex
Alex Shi Aug. 26, 2019, 2:35 p.m. UTC | #13
在 2019/8/24 上午9:59, Hugh Dickins 写道:
> On Wed, 21 Aug 2019, Alex Shi wrote:
>> 在 2019/8/21 上午2:24, Hugh Dickins 写道:
>>> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
>>> and/or mmotm.  Then compare with what Alex has, to see if there's any
>>> good reason to prefer one to the other: if no good reason to prefer ours,
>>> I doubt we shall bother to repost, but just use it as basis for helping
>>> to review or improve Alex's.
>>
>> For your review, my patchset are pretty straight and simple.
>> It just use per lruvec lru_lock to replace necessary pgdat lru_lock.
>> just this.  We could talk more after I back to work. :)
> 
> Sorry to be bearer of bad news, Alex, but when you said "straight and
> simple", I feared that your patchset would turn out to be fundamentally
> too simple.
> 
> And that is so. I have only to see the
> 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> line in isolate_migratepages_block() in mm/compaction.c, and check
> that mem_cgroup_page_lruvec() is little changed in mm/mempolicy.c.
> 
> The central problem with per-memcg lru_lock is that you do not know
> for sure what lock to take (which memcg a page belongs to) until you
> have taken the expected lock, and then checked whether page->memcg
> is still the same - backing out and trying again if not.
> 
> Fix that central problem, and you end up with a more complicated
> patchset, much like ours.  It's true that when ours was first developed,
> the memcg situation was more complicated in several ways, and perhaps
> some aspects of our patchset could be simplified now (though I've not
> identified any).  Johannes in particular has done a great deal of
> simplifying work in memcg over the last few years, but there are still
> situations in which a page's memcg can change (move_charge_at_immigrate
> and swapin readahead spring to mind - or perhaps the latter is only an
> issue when MEMCG_SWAP is not enabled, I forget; and I often wonder if
> reparenting will be brought back one day).
> 
> I did not review your patchset in detail, and wasn't able to get very
> far in testing it.  At first I was put off by set_task_reclaim_state
> warnings from mm/vmscan.c, but those turned out to be in v5.3-rc5
> itself, not from your patchset or mine (but I've not yet investigated
> what's responsible for them).  Within a minute of starting swapping
> load, kcompactd compact_lock_irqsave() in isolate_migratepages_block()
> would deadlock, and I did not get further.  (Though I did also notice
> that booting the CONFIG_MEMCG=y kernel with "cgroup_disable=memory"
> froze in booting - tiresomely, one has to keep both the memcg and
> no-memcg locking to cope with that case, and I guess you had not.)
> 
> Rather than duplicating effort, I would advise you to give our patchset
> a try, and if it works for you, help towards getting that one merged:
> but of course, it's up to you.

Thanks a lot for all infos and reminders! Yes, the page->memcg change would be a problem. I will studying your patchset and try to merge them.

> 
> I've attached a tarfile of it rebased to v5.3-rc5: I do not want to
> spam the list with patches yet, because I do not have any stats or
> argument in support of the series, as Andrew asked for years ago and
> Michal asks again now.  But aside from that I consider it ready, and
> will let Shakeel take it over from here, while I get back to what I
> diverted from (but of course I'll try to answer questions on it).
> 
I will trying to look into them. Thanks for your kindly offer. :)

Thanks!
Alex