Message ID | 1566294517-86418-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | per memcg lru_lock | expand |
On Tue 20-08-19 17:48:23, Alex Shi wrote: > This patchset move lru_lock into lruvec, give a lru_lock for each of > lruvec, thus bring a lru_lock for each of memcg. > > Per memcg lru_lock would ease the lru_lock contention a lot in > this patch series. > > In some data center, containers are used widely to deploy different kind > of services, then multiple memcgs share per node pgdat->lru_lock which > cause heavy lock contentions when doing lru operation. Having some real world workloads numbers would be more than useful for a non trivial change like this. I believe googlers have tried something like this in the past but then didn't have really a good example of workloads that benefit. I might misremember though. Cc Hugh.
On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 20-08-19 17:48:23, Alex Shi wrote: > > This patchset move lru_lock into lruvec, give a lru_lock for each of > > lruvec, thus bring a lru_lock for each of memcg. > > > > Per memcg lru_lock would ease the lru_lock contention a lot in > > this patch series. > > > > In some data center, containers are used widely to deploy different kind > > of services, then multiple memcgs share per node pgdat->lru_lock which > > cause heavy lock contentions when doing lru operation. > > Having some real world workloads numbers would be more than useful > for a non trivial change like this. I believe googlers have tried > something like this in the past but then didn't have really a good > example of workloads that benefit. I might misremember though. Cc Hugh. > We, at Google, have been using per-memcg lru locks for more than 7 years. Per-memcg lru locks are really beneficial for providing performance isolation if there are multiple distinct jobs/memcgs running on large machines. We are planning to upstream our internal implementation. I will let Hugh comment on that. thanks, Shakeel
On Tue, 20 Aug 2019, Shakeel Butt wrote: > On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 20-08-19 17:48:23, Alex Shi wrote: > > > This patchset move lru_lock into lruvec, give a lru_lock for each of > > > lruvec, thus bring a lru_lock for each of memcg. > > > > > > Per memcg lru_lock would ease the lru_lock contention a lot in > > > this patch series. > > > > > > In some data center, containers are used widely to deploy different kind > > > of services, then multiple memcgs share per node pgdat->lru_lock which > > > cause heavy lock contentions when doing lru operation. > > > > Having some real world workloads numbers would be more than useful > > for a non trivial change like this. I believe googlers have tried > > something like this in the past but then didn't have really a good > > example of workloads that benefit. I might misremember though. Cc Hugh. > > > > We, at Google, have been using per-memcg lru locks for more than 7 > years. Per-memcg lru locks are really beneficial for providing > performance isolation if there are multiple distinct jobs/memcgs > running on large machines. We are planning to upstream our internal > implementation. I will let Hugh comment on that. Thanks for the Cc Michal. As Shakeel says, Google prodkernel has been using our per-memcg lru locks for 7 years or so. Yes, we did not come up with supporting performance data at the time of posting, nor since: I see Alex has done much better on that (though I haven't even glanced to see if +s are persuasive). https://lkml.org/lkml/2012/2/20/434 was how ours was back then; some parts of that went in, then attached lrulock417.tar is how it was the last time I rebased, to v4.17. I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc and/or mmotm. Then compare with what Alex has, to see if there's any good reason to prefer one to the other: if no good reason to prefer ours, I doubt we shall bother to repost, but just use it as basis for helping to review or improve Alex's. Hugh
> > Thanks for the Cc Michal. As Shakeel says, Google prodkernel has been > using our per-memcg lru locks for 7 years or so. Yes, we did not come > up with supporting performance data at the time of posting, nor since: > I see Alex has done much better on that (though I haven't even glanced > to see if +s are persuasive). > > https://lkml.org/lkml/2012/2/20/434 > was how ours was back then; some parts of that went in, then attached > lrulock417.tar is how it was the last time I rebased, to v4.17. > > I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc > and/or mmotm. Then compare with what Alex has, to see if there's any > good reason to prefer one to the other: if no good reason to prefer ours, > I doubt we shall bother to repost, but just use it as basis for helping > to review or improve Alex's. > Thanks for you all! Very glad to see we are trying on same point. :) Not only on per memcg lru_lock, there are much room on lru and page replacement tunings. Anyway Hope to see your update and more review comments soon. Thanks Alex
在 2019/8/21 上午2:24, Hugh Dickins 写道: > I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc > and/or mmotm. Then compare with what Alex has, to see if there's any > good reason to prefer one to the other: if no good reason to prefer ours, > I doubt we shall bother to repost, but just use it as basis for helping > to review or improve Alex's. For your review, my patchset are pretty straight and simple. It just use per lruvec lru_lock to replace necessary pgdat lru_lock. just this. We could talk more after I back to work. :) Thanks alot! Alex
Hi Alex, On 8/20/19 5:48 AM, Alex Shi wrote: > In some data center, containers are used widely to deploy different kind > of services, then multiple memcgs share per node pgdat->lru_lock which > cause heavy lock contentions when doing lru operation. > > On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9 > simultaneously with mmtests' config: > # AIM9 > export AIM9_TESTTIME=180 > export AIM9_TESTLIST=page_test,brk_test > > perf lock report show much contentions on lru_lock in 20 second snapshot: > Name acquired contended avg wait (ns) total wait (ns) max wait (ns) min wait (ns) > &(ptlock_ptr(pag... 22 0 0 0 0 0 > ... > &(&pgdat->lru_lo... 9 7 12728 89096 26656 1597 This is system-wide right, not per container? Even per container, 89 usec isn't much contention over 20 seconds. You may want to give this a try: https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1]. It'd be better though, as Michal suggests, to use the real workload that's causing problems. Where are you seeing contention? > With this patch series, lruvec->lru_lock show no contentions > &(&lruvec->lru_l... 8 0 0 0 0 0 > > and aim9 page_test/brk_test performance increased 5%~50%. Where does the 50% number come in? The numbers below seem to only show ~4% boost. > BTW, Detailed results in aim9-pft.compare.log if needed, > All containers data are increased and pretty steady. > > $for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i page_test ============"; cat aim9-pft.compare.log | grep "^$i.*page_test" | awk 'BEGIN {a=b=0;} { a+=$3; b+=$6 } END { print "5.3-rc4 " a/24; print "5.3-rc4+lru_lock " b/24}' ; done > ========= Max page_test ============ > 5.3-rc4 34729.6 > 5.3-rc4+lru_lock 36128.3 > ========= Min page_test ============ > 5.3-rc4 33644.2 > 5.3-rc4+lru_lock 35349.7 > ========= Hmean page_test ============ > 5.3-rc4 34355.4 > 5.3-rc4+lru_lock 35810.9 > ========= Stddev page_test ============ > 5.3-rc4 319.757 > 5.3-rc4+lru_lock 223.324 > ========= CoeffVar page_test ============ > 5.3-rc4 0.93125 > 5.3-rc4+lru_lock 0.623333 > ========= BHmean-50 page_test ============ > 5.3-rc4 34579.2 > 5.3-rc4+lru_lock 35977.1 > ========= BHmean-95 page_test ============ > 5.3-rc4 34421.7 > 5.3-rc4+lru_lock 35853.6 > ========= BHmean-99 page_test ============ > 5.3-rc4 34421.7 > 5.3-rc4+lru_lock 35853.6 > > $for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i brk_test ============"; cat aim9-pft.compare.log | grep "^$i.*brk_test" | awk 'BEGIN {a=b=0;} { a+=$3; b+=$6 } END { print "5.3-rc4 " a/24; print "5.3-rc4+lru_lock " b/24}' ; done > ========= Max brk_test ============ > 5.3-rc4 96647.7 > 5.3-rc4+lru_lock 98960.3 > ========= Min brk_test ============ > 5.3-rc4 91800.8 > 5.3-rc4+lru_lock 96817.6 > ========= Hmean brk_test ============ > 5.3-rc4 95470 > 5.3-rc4+lru_lock 97769.6 > ========= Stddev brk_test ============ > 5.3-rc4 1253.52 > 5.3-rc4+lru_lock 596.593 > ========= CoeffVar brk_test ============ > 5.3-rc4 1.31375 > 5.3-rc4+lru_lock 0.609583 > ========= BHmean-50 brk_test ============ > 5.3-rc4 96141.4 > 5.3-rc4+lru_lock 98194 > ========= BHmean-95 brk_test ============ > 5.3-rc4 95818.5 > 5.3-rc4+lru_lock 97857.2 > ========= BHmean-99 brk_test ============ > 5.3-rc4 95818.5 > 5.3-rc4+lru_lock 97857.2 [1] https://lore.kernel.org/linux-mm/CABdVr8R2y9B+2zzSAT_Ve=BQCa+F+E9_kVH+C28DGpkeQitiog@mail.gmail.com/
在 2019/8/22 上午2:00, Daniel Jordan 写道: >> > > This is system-wide right, not per container? Even per container, 89 usec isn't much contention over 20 seconds. You may want to give this a try: yes, perf lock show the host info. > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice> > It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1]. Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now. Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. But anyway, aim9 is also pretty good to show the problem and solutions. :) > > It'd be better though, as Michal suggests, to use the real workload that's causing problems. Where are you seeing contention? We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages alloc/remove. aim9 could reflect part of scenarios. I don't know the DB scenario yet. > >> With this patch series, lruvec->lru_lock show no contentions >> &(&lruvec->lru_l... 8 0 0 0 0 0 >> >> and aim9 page_test/brk_test performance increased 5%~50%. > > Where does the 50% number come in? The numbers below seem to only show ~4% boost. the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following: Stddev page_test 245.15 ( 0.00%) 189.29 ( 22.79%) Stddev brk_test 1258.60 ( 0.00%) 629.16 ( 50.01%) CoeffVar page_test 0.71 ( 0.00%) 0.53 ( 26.05%) CoeffVar brk_test 1.32 ( 0.00%) 0.64 ( 51.14%)
On 8/22/19 7:56 AM, Alex Shi wrote: > 在 2019/8/22 上午2:00, Daniel Jordan 写道: >> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice> >> It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1]. > > Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now. I should've been clearer. [1] is meant as an example of someone suffering from lru_lock during reclaim. Wouldn't your series help per-memcg reclaim? > Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. But anyway, aim9 is also pretty good to show the problem and solutions. :) >> >> It'd be better though, as Michal suggests, to use the real workload that's causing problems. Where are you seeing contention? > > We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages alloc/remove. I think numbers from that scenario would help your case. > aim9 could reflect part of scenarios. I don't know the DB scenario yet. We see it during DB shutdown when each DB process frees its memory (zap_pte_range -> mark_page_accessed). But that's a different thing, clearly Not This Series. >>> With this patch series, lruvec->lru_lock show no contentions >>> &(&lruvec->lru_l... 8 0 0 0 0 0 >>> >>> and aim9 page_test/brk_test performance increased 5%~50%. >> >> Where does the 50% number come in? The numbers below seem to only show ~4% boost. > > the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following: > > Stddev page_test 245.15 ( 0.00%) 189.29 ( 22.79%) > Stddev brk_test 1258.60 ( 0.00%) 629.16 ( 50.01%) > CoeffVar page_test 0.71 ( 0.00%) 0.53 ( 26.05%) > CoeffVar brk_test 1.32 ( 0.00%) 0.64 ( 51.14%) Aha. 50% decrease in stdev.
On Wed, 21 Aug 2019, Alex Shi wrote: > 在 2019/8/21 上午2:24, Hugh Dickins 写道: > > I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc > > and/or mmotm. Then compare with what Alex has, to see if there's any > > good reason to prefer one to the other: if no good reason to prefer ours, > > I doubt we shall bother to repost, but just use it as basis for helping > > to review or improve Alex's. > > For your review, my patchset are pretty straight and simple. > It just use per lruvec lru_lock to replace necessary pgdat lru_lock. > just this. We could talk more after I back to work. :) Sorry to be bearer of bad news, Alex, but when you said "straight and simple", I feared that your patchset would turn out to be fundamentally too simple. And that is so. I have only to see the lruvec = mem_cgroup_page_lruvec(page, pgdat); line in isolate_migratepages_block() in mm/compaction.c, and check that mem_cgroup_page_lruvec() is little changed in mm/mempolicy.c. The central problem with per-memcg lru_lock is that you do not know for sure what lock to take (which memcg a page belongs to) until you have taken the expected lock, and then checked whether page->memcg is still the same - backing out and trying again if not. Fix that central problem, and you end up with a more complicated patchset, much like ours. It's true that when ours was first developed, the memcg situation was more complicated in several ways, and perhaps some aspects of our patchset could be simplified now (though I've not identified any). Johannes in particular has done a great deal of simplifying work in memcg over the last few years, but there are still situations in which a page's memcg can change (move_charge_at_immigrate and swapin readahead spring to mind - or perhaps the latter is only an issue when MEMCG_SWAP is not enabled, I forget; and I often wonder if reparenting will be brought back one day). I did not review your patchset in detail, and wasn't able to get very far in testing it. At first I was put off by set_task_reclaim_state warnings from mm/vmscan.c, but those turned out to be in v5.3-rc5 itself, not from your patchset or mine (but I've not yet investigated what's responsible for them). Within a minute of starting swapping load, kcompactd compact_lock_irqsave() in isolate_migratepages_block() would deadlock, and I did not get further. (Though I did also notice that booting the CONFIG_MEMCG=y kernel with "cgroup_disable=memory" froze in booting - tiresomely, one has to keep both the memcg and no-memcg locking to cope with that case, and I guess you had not.) Rather than duplicating effort, I would advise you to give our patchset a try, and if it works for you, help towards getting that one merged: but of course, it's up to you. I've attached a tarfile of it rebased to v5.3-rc5: I do not want to spam the list with patches yet, because I do not have any stats or argument in support of the series, as Andrew asked for years ago and Michal asks again now. But aside from that I consider it ready, and will let Shakeel take it over from here, while I get back to what I diverted from (but of course I'll try to answer questions on it). Thanks, Hugh
On 22/08/2019 18.20, Daniel Jordan wrote: > On 8/22/19 7:56 AM, Alex Shi wrote: >> 在 2019/8/22 上午2:00, Daniel Jordan 写道: >>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice> >>> It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this >>> lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1]. >> >> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now. > > I should've been clearer. [1] is meant as an example of someone suffering from lru_lock during reclaim. Wouldn't your series help > per-memcg reclaim? > >> Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. >> But anyway, aim9 is also pretty good to show the problem and solutions. :) >>> >>> It'd be better though, as Michal suggests, to use the real workload that's causing problems. Where are you seeing contention? >> >> We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages >> alloc/remove. > > I think numbers from that scenario would help your case. > >> aim9 could reflect part of scenarios. I don't know the DB scenario yet. > > We see it during DB shutdown when each DB process frees its memory (zap_pte_range -> mark_page_accessed). But that's a different thing, > clearly Not This Series. > >>>> With this patch series, lruvec->lru_lock show no contentions >>>> &(&lruvec->lru_l... 8 0 0 0 0 0 >>>> >>>> and aim9 page_test/brk_test performance increased 5%~50%. >>> >>> Where does the 50% number come in? The numbers below seem to only show ~4% boost. >>After splitting lru-locks present per-cpu page-vectors works no so well because they mixes pages from different cgroups. pagevec_lru_move_fn and friends need better implementation: either sorting pages or splitting vectores in per-lruvec basis. >> the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following: >> >> Stddev page_test 245.15 ( 0.00%) 189.29 ( 22.79%) >> Stddev brk_test 1258.60 ( 0.00%) 629.16 ( 50.01%) >> CoeffVar page_test 0.71 ( 0.00%) 0.53 ( 26.05%) >> CoeffVar brk_test 1.32 ( 0.00%) 0.64 ( 51.14%) > > Aha. 50% decrease in stdev. > After splitting lru-locks present per-cpu page-vectors works no so well because they mix pages from different cgroups. pagevec_lru_move_fn and friends need better implementation: either sorting pages or splitting vectores in per-lruvec basis.
在 2019/8/26 下午4:39, Konstantin Khlebnikov 写道: >>> > because they mixes pages from different cgroups. > > pagevec_lru_move_fn and friends need better implementation: > either sorting pages or splitting vectores in per-lruvec basis. Right, this should be the next step to improve. Maybe we could try the per-lruvec pagevec? Thanks Alex
在 2019/8/22 下午11:20, Daniel Jordan 写道: >> >>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice> >>> It's also synthetic but it stresses lru_lock more than just anon alloc/free. It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1]. >> >> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now. > > I should've been clearer. [1] is meant as an example of someone suffering from lru_lock during reclaim. Wouldn't your series help per-memcg reclaim? yes, I got your point, since the aim9 don't show much improvement, I am trying this case in containers. Thanks Alex
在 2019/8/24 上午9:59, Hugh Dickins 写道: > On Wed, 21 Aug 2019, Alex Shi wrote: >> 在 2019/8/21 上午2:24, Hugh Dickins 写道: >>> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc >>> and/or mmotm. Then compare with what Alex has, to see if there's any >>> good reason to prefer one to the other: if no good reason to prefer ours, >>> I doubt we shall bother to repost, but just use it as basis for helping >>> to review or improve Alex's. >> >> For your review, my patchset are pretty straight and simple. >> It just use per lruvec lru_lock to replace necessary pgdat lru_lock. >> just this. We could talk more after I back to work. :) > > Sorry to be bearer of bad news, Alex, but when you said "straight and > simple", I feared that your patchset would turn out to be fundamentally > too simple. > > And that is so. I have only to see the > lruvec = mem_cgroup_page_lruvec(page, pgdat); > line in isolate_migratepages_block() in mm/compaction.c, and check > that mem_cgroup_page_lruvec() is little changed in mm/mempolicy.c. > > The central problem with per-memcg lru_lock is that you do not know > for sure what lock to take (which memcg a page belongs to) until you > have taken the expected lock, and then checked whether page->memcg > is still the same - backing out and trying again if not. > > Fix that central problem, and you end up with a more complicated > patchset, much like ours. It's true that when ours was first developed, > the memcg situation was more complicated in several ways, and perhaps > some aspects of our patchset could be simplified now (though I've not > identified any). Johannes in particular has done a great deal of > simplifying work in memcg over the last few years, but there are still > situations in which a page's memcg can change (move_charge_at_immigrate > and swapin readahead spring to mind - or perhaps the latter is only an > issue when MEMCG_SWAP is not enabled, I forget; and I often wonder if > reparenting will be brought back one day). > > I did not review your patchset in detail, and wasn't able to get very > far in testing it. At first I was put off by set_task_reclaim_state > warnings from mm/vmscan.c, but those turned out to be in v5.3-rc5 > itself, not from your patchset or mine (but I've not yet investigated > what's responsible for them). Within a minute of starting swapping > load, kcompactd compact_lock_irqsave() in isolate_migratepages_block() > would deadlock, and I did not get further. (Though I did also notice > that booting the CONFIG_MEMCG=y kernel with "cgroup_disable=memory" > froze in booting - tiresomely, one has to keep both the memcg and > no-memcg locking to cope with that case, and I guess you had not.) > > Rather than duplicating effort, I would advise you to give our patchset > a try, and if it works for you, help towards getting that one merged: > but of course, it's up to you. Thanks a lot for all infos and reminders! Yes, the page->memcg change would be a problem. I will studying your patchset and try to merge them. > > I've attached a tarfile of it rebased to v5.3-rc5: I do not want to > spam the list with patches yet, because I do not have any stats or > argument in support of the series, as Andrew asked for years ago and > Michal asks again now. But aside from that I consider it ready, and > will let Shakeel take it over from here, while I get back to what I > diverted from (but of course I'll try to answer questions on it). > I will trying to look into them. Thanks for your kindly offer. :) Thanks! Alex