Message ID | 1605860847-47445-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [next] mm/swap.c: reduce lock contention in lru_cache_add | expand |
On Fri, 20 Nov 2020 16:27:27 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > The current relock logical will change lru_lock when found a new > lruvec, so if 2 memcgs are reading file or alloc page at same time, > they could hold the lru_lock alternately, and wait for each other for > fairness attribute of ticket spin lock. > > This patch will sort that all lru_locks and only hold them once in > above scenario. That could reduce fairness waiting for lock reget. > Than, vm-scalability/case-lru-file-readtwice could get ~5% performance > gain on my 2P*20core*HT machine. But what happens when all or most of the pages belong to the same lruvec? This sounds like the common case - won't it suffer?
在 2020/11/21 上午7:19, Andrew Morton 写道: > On Fri, 20 Nov 2020 16:27:27 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > >> The current relock logical will change lru_lock when found a new >> lruvec, so if 2 memcgs are reading file or alloc page at same time, >> they could hold the lru_lock alternately, and wait for each other for >> fairness attribute of ticket spin lock. >> >> This patch will sort that all lru_locks and only hold them once in >> above scenario. That could reduce fairness waiting for lock reget. >> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance >> gain on my 2P*20core*HT machine. > > But what happens when all or most of the pages belong to the same > lruvec? This sounds like the common case - won't it suffer? > Hi Andrew, My testing show no regression on this situation, like original centos7, The most spending time is on lru_lock for lru sensitive case. Thanks Alex
On 11/20/20 9:27 AM, Alex Shi wrote: > The current relock logical will change lru_lock when found a new > lruvec, so if 2 memcgs are reading file or alloc page at same time, > they could hold the lru_lock alternately, and wait for each other for > fairness attribute of ticket spin lock. > > This patch will sort that all lru_locks and only hold them once in > above scenario. That could reduce fairness waiting for lock reget. > Than, vm-scalability/case-lru-file-readtwice could get ~5% performance > gain on my 2P*20core*HT machine. Hm, once you sort the pages like this, it's a shame not to splice them instead of more list_del() + list_add() iterations. update_lru_size() could be also called once? > Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com> > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Konstantin Khlebnikov <koct9i@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: Yu Zhao <yuzhao@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > --- > mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 49 insertions(+), 8 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index 490553f3f9ef..c787b38bf9c0 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) > trace_mm_lru_insertion(page, lru); > } > > +struct lruvecs { > + struct list_head lists[PAGEVEC_SIZE]; > + struct lruvec *vecs[PAGEVEC_SIZE]; > +}; > + > +/* Sort pvec pages on their lruvec */ > +int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec) > +{ > + int i, j, nr_lruvec; > + struct page *page; > + struct lruvec *lruvec = NULL; > + > + lruvecs->vecs[0] = NULL; > + for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) { > + page = pvec->pages[i]; > + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > + > + /* Try to find a same lruvec */ > + for (j = 0; j <= nr_lruvec; j++) > + if (lruvec == lruvecs->vecs[j]) > + break; > + > + /* A new lruvec */ > + if (j > nr_lruvec) { > + INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]); > + lruvecs->vecs[nr_lruvec] = lruvec; > + j = nr_lruvec++; > + lruvecs->vecs[nr_lruvec] = 0; > + } > + > + list_add_tail(&page->lru, &lruvecs->lists[j]); > + } > + > + return nr_lruvec; > +} > + > /* > * Add the passed pages to the LRU, then drop the caller's refcount > * on them. Reinitialises the caller's pagevec. > */ > void __pagevec_lru_add(struct pagevec *pvec) > { > - int i; > - struct lruvec *lruvec = NULL; > + int i, nr_lruvec; > unsigned long flags = 0; > + struct page *page; > + struct lruvecs lruvecs; > > - for (i = 0; i < pagevec_count(pvec); i++) { > - struct page *page = pvec->pages[i]; > + nr_lruvec = sort_page_lruvec(&lruvecs, pvec); > > - lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); > - __pagevec_lru_add_fn(page, lruvec); > + for (i = 0; i < nr_lruvec; i++) { > + spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags); > + while (!list_empty(&lruvecs.lists[i])) { > + page = lru_to_page(&lruvecs.lists[i]); > + list_del(&page->lru); > + __pagevec_lru_add_fn(page, lruvecs.vecs[i]); > + } > + spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags); > } > - if (lruvec) > - unlock_page_lruvec_irqrestore(lruvec, flags); > + > release_pages(pvec->pages, pvec->nr); > pagevec_reinit(pvec); > } >
在 2020/11/25 下午11:38, Vlastimil Babka 写道: > On 11/20/20 9:27 AM, Alex Shi wrote: >> The current relock logical will change lru_lock when found a new >> lruvec, so if 2 memcgs are reading file or alloc page at same time, >> they could hold the lru_lock alternately, and wait for each other for >> fairness attribute of ticket spin lock. >> >> This patch will sort that all lru_locks and only hold them once in >> above scenario. That could reduce fairness waiting for lock reget. >> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance >> gain on my 2P*20core*HT machine. > > Hm, once you sort the pages like this, it's a shame not to splice them instead of more list_del() + list_add() iterations. update_lru_size() could be also called once? Yes, looks it's a good idea to use splice instead of list_del/add, but pages may on different lru list in a same lruvec, and also may come from different zones. That could involve 5 cycles for different lists, and more for zones... I give up the try.
On Fri, Nov 20, 2020 at 04:27:27PM +0800, Alex Shi wrote: > The current relock logical will change lru_lock when found a new > lruvec, so if 2 memcgs are reading file or alloc page at same time, > they could hold the lru_lock alternately, and wait for each other for > fairness attribute of ticket spin lock. > > This patch will sort that all lru_locks and only hold them once in > above scenario. That could reduce fairness waiting for lock reget. > Than, vm-scalability/case-lru-file-readtwice could get ~5% performance > gain on my 2P*20core*HT machine. > > Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com> > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Konstantin Khlebnikov <koct9i@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: Yu Zhao <yuzhao@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > --- > mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 49 insertions(+), 8 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index 490553f3f9ef..c787b38bf9c0 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) > trace_mm_lru_insertion(page, lru); > } > > +struct lruvecs { > + struct list_head lists[PAGEVEC_SIZE]; > + struct lruvec *vecs[PAGEVEC_SIZE]; > +}; > + > +/* Sort pvec pages on their lruvec */ > +int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec) > +{ > + int i, j, nr_lruvec; > + struct page *page; > + struct lruvec *lruvec = NULL; > + > + lruvecs->vecs[0] = NULL; > + for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) { > + page = pvec->pages[i]; > + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > + > + /* Try to find a same lruvec */ > + for (j = 0; j <= nr_lruvec; j++) > + if (lruvec == lruvecs->vecs[j]) > + break; > + > + /* A new lruvec */ > + if (j > nr_lruvec) { > + INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]); > + lruvecs->vecs[nr_lruvec] = lruvec; > + j = nr_lruvec++; > + lruvecs->vecs[nr_lruvec] = 0; > + } > + > + list_add_tail(&page->lru, &lruvecs->lists[j]); > + } > + > + return nr_lruvec; > +} > + > /* > * Add the passed pages to the LRU, then drop the caller's refcount > * on them. Reinitialises the caller's pagevec. > */ > void __pagevec_lru_add(struct pagevec *pvec) > { > - int i; > - struct lruvec *lruvec = NULL; > + int i, nr_lruvec; > unsigned long flags = 0; > + struct page *page; > + struct lruvecs lruvecs; > > - for (i = 0; i < pagevec_count(pvec); i++) { > - struct page *page = pvec->pages[i]; > + nr_lruvec = sort_page_lruvec(&lruvecs, pvec); Simply looping pvec multiple times (15 at most) for different lruvecs would be better because 1) it requires no extra data structures and therefore has better cache locality (theoretically faster) 2) it only loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no impact on Android and Chrome OS. > - lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); > - __pagevec_lru_add_fn(page, lruvec); > + for (i = 0; i < nr_lruvec; i++) { > + spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags); > + while (!list_empty(&lruvecs.lists[i])) { > + page = lru_to_page(&lruvecs.lists[i]); > + list_del(&page->lru); > + __pagevec_lru_add_fn(page, lruvecs.vecs[i]); > + } > + spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags); > } > - if (lruvec) > - unlock_page_lruvec_irqrestore(lruvec, flags); > + > release_pages(pvec->pages, pvec->nr); > pagevec_reinit(pvec); > } > -- > 2.29.GIT >
在 2020/11/26 下午12:52, Yu Zhao 写道: >> */ >> void __pagevec_lru_add(struct pagevec *pvec) >> { >> - int i; >> - struct lruvec *lruvec = NULL; >> + int i, nr_lruvec; >> unsigned long flags = 0; >> + struct page *page; >> + struct lruvecs lruvecs; >> >> - for (i = 0; i < pagevec_count(pvec); i++) { >> - struct page *page = pvec->pages[i]; >> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec); > Simply looping pvec multiple times (15 at most) for different lruvecs > would be better because 1) it requires no extra data structures and > therefore has better cache locality (theoretically faster) 2) it only > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no > impact on Android and Chrome OS. > With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So would you like has a proposal for this? Thanks Alex
On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote: > > > 在 2020/11/26 下午12:52, Yu Zhao 写道: > >> */ > >> void __pagevec_lru_add(struct pagevec *pvec) > >> { > >> - int i; > >> - struct lruvec *lruvec = NULL; > >> + int i, nr_lruvec; > >> unsigned long flags = 0; > >> + struct page *page; > >> + struct lruvecs lruvecs; > >> > >> - for (i = 0; i < pagevec_count(pvec); i++) { > >> - struct page *page = pvec->pages[i]; > >> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec); > > Simply looping pvec multiple times (15 at most) for different lruvecs > > would be better because 1) it requires no extra data structures and > > therefore has better cache locality (theoretically faster) 2) it only > > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no > > impact on Android and Chrome OS. > > > > With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice > case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So > would you like has a proposal for this? Oh, no, I'm not against your idea. I was saying it doesn't seem necessary to sort -- a nested loop would just do the job given pagevec is small. diff --git a/mm/swap.c b/mm/swap.c index cb3794e13b48..1d238edc2907 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) */ void __pagevec_lru_add(struct pagevec *pvec) { - int i; + int i, j; struct lruvec *lruvec = NULL; unsigned long flags = 0; for (i = 0; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; + if (!page) + continue; + lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); - __pagevec_lru_add_fn(page, lruvec); + + for (j = i; j < pagevec_count(pvec); j++) { + if (page_to_nid(pvec->pages[j]) != page_to_nid(page) || + page_memcg(pvec->pages[j]) != page_memcg(page)) + continue; + + __pagevec_lru_add_fn(pvec->pages[j], lruvec); + pvec->pages[j] = NULL; + } } if (lruvec) unlock_page_lruvec_irqrestore(lruvec, flags);
在 2020/11/26 下午3:24, Yu Zhao 写道: > Oh, no, I'm not against your idea. I was saying it doesn't seem > necessary to sort -- a nested loop would just do the job given > pagevec is small. > > diff --git a/mm/swap.c b/mm/swap.c > index cb3794e13b48..1d238edc2907 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) > */ > void __pagevec_lru_add(struct pagevec *pvec) > { > - int i; > + int i, j; > struct lruvec *lruvec = NULL; > unsigned long flags = 0; > > for (i = 0; i < pagevec_count(pvec); i++) { > struct page *page = pvec->pages[i]; > > + if (!page) > + continue; > + > lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); > - __pagevec_lru_add_fn(page, lruvec); > + > + for (j = i; j < pagevec_count(pvec); j++) { > + if (page_to_nid(pvec->pages[j]) != page_to_nid(page) || > + page_memcg(pvec->pages[j]) != page_memcg(page)) > + continue; > + > + __pagevec_lru_add_fn(pvec->pages[j], lruvec); > + pvec->pages[j] = NULL; > + } Uh, I have to say your method is more better than mine. And this could be reused for all relock_page_lruvec. I expect this could speed up lru performance a lot! > } > if (lruvec) > unlock_page_lruvec_irqrestore(lruvec, flags);
On 11/26/20 4:12 AM, Alex Shi wrote: > > > 在 2020/11/25 下午11:38, Vlastimil Babka 写道: >> On 11/20/20 9:27 AM, Alex Shi wrote: >>> The current relock logical will change lru_lock when found a new >>> lruvec, so if 2 memcgs are reading file or alloc page at same time, >>> they could hold the lru_lock alternately, and wait for each other for >>> fairness attribute of ticket spin lock. >>> >>> This patch will sort that all lru_locks and only hold them once in >>> above scenario. That could reduce fairness waiting for lock reget. >>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance >>> gain on my 2P*20core*HT machine. >> >> Hm, once you sort the pages like this, it's a shame not to splice them instead of more list_del() + list_add() iterations. update_lru_size() could be also called once? > > Yes, looks it's a good idea to use splice instead of list_del/add, but pages > may on different lru list in a same lruvec, and also may come from different > zones. That could involve 5 cycles for different lists, and more for zones... Hmm, zones wouldn't affect splicing (there's a per-node lru these days), but would affect accounting. And yeah, there are 5 lru lists, and we probably need to be under lru lock to stabilize where a page belongs, so pre-sorting without lock wouldn't be safe? Bummer. > I give up the try. >
On 11/26/20 8:24 AM, Yu Zhao wrote: > On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote: >> >> >> 在 2020/11/26 下午12:52, Yu Zhao 写道: >> >> */ >> >> void __pagevec_lru_add(struct pagevec *pvec) >> >> { >> >> - int i; >> >> - struct lruvec *lruvec = NULL; >> >> + int i, nr_lruvec; >> >> unsigned long flags = 0; >> >> + struct page *page; >> >> + struct lruvecs lruvecs; >> >> >> >> - for (i = 0; i < pagevec_count(pvec); i++) { >> >> - struct page *page = pvec->pages[i]; >> >> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec); >> > Simply looping pvec multiple times (15 at most) for different lruvecs >> > would be better because 1) it requires no extra data structures and >> > therefore has better cache locality (theoretically faster) 2) it only >> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no >> > impact on Android and Chrome OS. >> > >> >> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice >> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So >> would you like has a proposal for this? > > Oh, no, I'm not against your idea. I was saying it doesn't seem > necessary to sort -- a nested loop would just do the job given > pagevec is small. Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be rather rare to have pages from really multiple memcgs/nodes? Maybe with the following change? Avoids going through the nulls if we got lucky (or have !MEMCG !NUMA). > diff --git a/mm/swap.c b/mm/swap.c > index cb3794e13b48..1d238edc2907 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) > */ > void __pagevec_lru_add(struct pagevec *pvec) > { > - int i; > + int i, j; int i, j, n; > struct lruvec *lruvec = NULL; > unsigned long flags = 0; > n = pagevec_count(pvec); > for (i = 0; i < pagevec_count(pvec); i++) { for (i = 0; n; i++) { > struct page *page = pvec->pages[i]; > > + if (!page) > + continue; > + > lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); > - __pagevec_lru_add_fn(page, lruvec); n--; > + > + for (j = i; j < pagevec_count(pvec); j++) { > + if (page_to_nid(pvec->pages[j]) != page_to_nid(page) || > + page_memcg(pvec->pages[j]) != page_memcg(page)) > + continue; > + > + __pagevec_lru_add_fn(pvec->pages[j], lruvec); > + pvec->pages[j] = NULL; n--; > + } > } > if (lruvec) > unlock_page_lruvec_irqrestore(lruvec, flags); >
On 11/26/20 12:22 PM, Vlastimil Babka wrote: > On 11/26/20 8:24 AM, Yu Zhao wrote: >> On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote: >>> >>> >>> 在 2020/11/26 下午12:52, Yu Zhao 写道: >>> >> */ >>> >> void __pagevec_lru_add(struct pagevec *pvec) >>> >> { >>> >> - int i; >>> >> - struct lruvec *lruvec = NULL; >>> >> + int i, nr_lruvec; >>> >> unsigned long flags = 0; >>> >> + struct page *page; >>> >> + struct lruvecs lruvecs; >>> >> >>> >> - for (i = 0; i < pagevec_count(pvec); i++) { >>> >> - struct page *page = pvec->pages[i]; >>> >> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec); >>> > Simply looping pvec multiple times (15 at most) for different lruvecs >>> > would be better because 1) it requires no extra data structures and >>> > therefore has better cache locality (theoretically faster) 2) it only >>> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no >>> > impact on Android and Chrome OS. >>> > >>> >>> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice >>> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So >>> would you like has a proposal for this? >> >> Oh, no, I'm not against your idea. I was saying it doesn't seem >> necessary to sort -- a nested loop would just do the job given >> pagevec is small. > > Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be > rather rare to have pages from really multiple memcgs/nodes? However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes 63^2, it starts to be somewhat more worrying. [1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-willy@infradead.org/ > Maybe with the following change? Avoids going through the nulls if we got lucky > (or have !MEMCG !NUMA). > >> diff --git a/mm/swap.c b/mm/swap.c >> index cb3794e13b48..1d238edc2907 100644 >> --- a/mm/swap.c >> +++ b/mm/swap.c >> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) >> */ >> void __pagevec_lru_add(struct pagevec *pvec) >> { >> - int i; >> + int i, j; > > int i, j, n; > >> struct lruvec *lruvec = NULL; >> unsigned long flags = 0; >> > > n = pagevec_count(pvec); >> for (i = 0; i < pagevec_count(pvec); i++) { > > for (i = 0; n; i++) { > >> struct page *page = pvec->pages[i]; >> >> + if (!page) >> + continue; >> + >> lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); >> - __pagevec_lru_add_fn(page, lruvec); > > n--; > >> + >> + for (j = i; j < pagevec_count(pvec); j++) { >> + if (page_to_nid(pvec->pages[j]) != page_to_nid(page) || >> + page_memcg(pvec->pages[j]) != page_memcg(page)) >> + continue; >> + >> + __pagevec_lru_add_fn(pvec->pages[j], lruvec); >> + pvec->pages[j] = NULL; > > n--; >> + } >> } >> if (lruvec) >> unlock_page_lruvec_irqrestore(lruvec, flags); >> >
On Thu, Nov 26, 2020 at 04:44:04PM +0100, Vlastimil Babka wrote: > However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes > 63^2, it starts to be somewhat more worrying. > > [1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-willy@infradead.org/ Well, Tim wanted it ;-) I would suggest that rather than an insertion sort (or was it a bubble sort?), we should be using a Shell sort. It's ideal for these kinds of smallish arrays. https://en.wikipedia.org/wiki/Shellsort
在 2020/11/26 下午11:55, Matthew Wilcox 写道: > On Thu, Nov 26, 2020 at 04:44:04PM +0100, Vlastimil Babka wrote: >> However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes >> 63^2, it starts to be somewhat more worrying. >> >> [1] https://lore.kernel.org/linux-mm/20201105172651.2455-1-willy@infradead.org/ > > Well, Tim wanted it ;-) > > I would suggest that rather than an insertion sort (or was it a bubble > sort?), we should be using a Shell sort. It's ideal for these kinds of > smallish arrays. > > https://en.wikipedia.org/wiki/Shellsort > Uh, looks perfect good!. I gonna look into it. :) Thanks!
diff --git a/mm/swap.c b/mm/swap.c index 490553f3f9ef..c787b38bf9c0 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) trace_mm_lru_insertion(page, lru); } +struct lruvecs { + struct list_head lists[PAGEVEC_SIZE]; + struct lruvec *vecs[PAGEVEC_SIZE]; +}; + +/* Sort pvec pages on their lruvec */ +int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec) +{ + int i, j, nr_lruvec; + struct page *page; + struct lruvec *lruvec = NULL; + + lruvecs->vecs[0] = NULL; + for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) { + page = pvec->pages[i]; + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + + /* Try to find a same lruvec */ + for (j = 0; j <= nr_lruvec; j++) + if (lruvec == lruvecs->vecs[j]) + break; + + /* A new lruvec */ + if (j > nr_lruvec) { + INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]); + lruvecs->vecs[nr_lruvec] = lruvec; + j = nr_lruvec++; + lruvecs->vecs[nr_lruvec] = 0; + } + + list_add_tail(&page->lru, &lruvecs->lists[j]); + } + + return nr_lruvec; +} + /* * Add the passed pages to the LRU, then drop the caller's refcount * on them. Reinitialises the caller's pagevec. */ void __pagevec_lru_add(struct pagevec *pvec) { - int i; - struct lruvec *lruvec = NULL; + int i, nr_lruvec; unsigned long flags = 0; + struct page *page; + struct lruvecs lruvecs; - for (i = 0; i < pagevec_count(pvec); i++) { - struct page *page = pvec->pages[i]; + nr_lruvec = sort_page_lruvec(&lruvecs, pvec); - lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); - __pagevec_lru_add_fn(page, lruvec); + for (i = 0; i < nr_lruvec; i++) { + spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags); + while (!list_empty(&lruvecs.lists[i])) { + page = lru_to_page(&lruvecs.lists[i]); + list_del(&page->lru); + __pagevec_lru_add_fn(page, lruvecs.vecs[i]); + } + spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags); } - if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, flags); + release_pages(pvec->pages, pvec->nr); pagevec_reinit(pvec); }
The current relock logical will change lru_lock when found a new lruvec, so if 2 memcgs are reading file or alloc page at same time, they could hold the lru_lock alternately, and wait for each other for fairness attribute of ticket spin lock. This patch will sort that all lru_locks and only hold them once in above scenario. That could reduce fairness waiting for lock reget. Than, vm-scalability/case-lru-file-readtwice could get ~5% performance gain on my 2P*20core*HT machine. Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 49 insertions(+), 8 deletions(-)