Message ID | 20191107205334.158354-3-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: fix page aging across multiple cgroups | expand |
On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > We use refault information to determine whether the cache workingset > is stable or transitioning, and dynamically adjust the inactive:active > file LRU ratio so as to maximize protection from one-off cache during > stable periods, and minimize IO during transitions. > > With cgroups and their nested LRU lists, we currently don't do this > correctly. While recursive cgroup reclaim establishes a relative LRU > order among the pages of all involved cgroups, refaults only affect > the local LRU order in the cgroup in which they are occuring. As a > result, cache transitions can take longer in a cgrouped system as the > active pages of sibling cgroups aren't challenged when they should be. > > [ Right now, this is somewhat theoretical, because the siblings, under > continued regular reclaim pressure, should eventually run out of > inactive pages - and since inactive:active *size* balancing is also > done on a cgroup-local level, we will challenge the active pages > eventually in most cases. But the next patch will move that relative > size enforcement to the reclaim root as well, and then this patch > here will be necessary to propagate refault pressure to siblings. ] > > This patch moves refault detection to the root of reclaim. Instead of > remembering the cgroup owner of an evicted page, remember the cgroup > that caused the reclaim to happen. When refaults later occur, they'll > correctly influence the cross-cgroup LRU order that reclaim follows. I spent some time thinking about the idea of calculating refault distance using target_memcg's inactive_age and then activating refaulted page in (possibly) another memcg and I am still having trouble convincing myself that this should work correctly. However I also was unable to convince myself otherwise... We use refault distance to calculate the deficit in inactive LRU space and then activate the refaulted page if that distance is less that active+inactive LRU size. However making that decision based on LRU sizes of one memcg and then activating the page in another one seems very counterintuitive to me. Maybe that's just me though... > > I.e. if global reclaim kicked out pages in some subgroup A/B/C, the > refault of those pages will challenge the global LRU order, and not > just the local order down inside C. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > include/linux/memcontrol.h | 5 +++ > include/linux/swap.h | 2 +- > mm/vmscan.c | 32 ++++++++--------- > mm/workingset.c | 72 +++++++++++++++++++++++++++++--------- > 4 files changed, 77 insertions(+), 34 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 5b86287fa069..a7a0a1a5c8d5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -901,6 +901,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, > return &pgdat->__lruvec; > } > > +static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) > +{ > + return NULL; > +} > + > static inline bool mm_match_cgroup(struct mm_struct *mm, > struct mem_cgroup *memcg) > { > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 063c0c1e112b..1e99f7ac1d7e 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -307,7 +307,7 @@ struct vma_swap_readahead { > }; > > /* linux/mm/workingset.c */ > -void *workingset_eviction(struct page *page); > +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg); > void workingset_refault(struct page *page, void *shadow); > void workingset_activation(struct page *page); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index e8dd601e1fad..527617ee9b73 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -853,7 +853,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping) > * gets returned with a refcount of 0. > */ > static int __remove_mapping(struct address_space *mapping, struct page *page, > - bool reclaimed) > + bool reclaimed, struct mem_cgroup *target_memcg) > { > unsigned long flags; > int refcount; > @@ -925,7 +925,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > */ > if (reclaimed && page_is_file_cache(page) && > !mapping_exiting(mapping) && !dax_mapping(mapping)) > - shadow = workingset_eviction(page); > + shadow = workingset_eviction(page, target_memcg); > __delete_from_page_cache(page, shadow); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > @@ -948,7 +948,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > */ > int remove_mapping(struct address_space *mapping, struct page *page) > { > - if (__remove_mapping(mapping, page, false)) { > + if (__remove_mapping(mapping, page, false, NULL)) { > /* > * Unfreezing the refcount with 1 rather than 2 effectively > * drops the pagecache ref for us without requiring another > @@ -1426,7 +1426,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > count_vm_event(PGLAZYFREED); > count_memcg_page_event(page, PGLAZYFREED); > - } else if (!mapping || !__remove_mapping(mapping, page, true)) > + } else if (!mapping || !__remove_mapping(mapping, page, true, > + sc->target_mem_cgroup)) > goto keep_locked; > > unlock_page(page); > @@ -2189,6 +2190,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, > enum lru_list inactive_lru = file * LRU_FILE; > unsigned long inactive, active; > unsigned long inactive_ratio; > + struct lruvec *target_lruvec; > unsigned long refaults; > unsigned long gb; > > @@ -2200,8 +2202,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, > * is being established. Disable active list protection to get > * rid of the stale workingset quickly. > */ > - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); > - if (file && lruvec->refaults != refaults) { > + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); > + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); > + if (file && target_lruvec->refaults != refaults) { > inactive_ratio = 0; > } else { > gb = (inactive + active) >> (30 - PAGE_SHIFT); > @@ -2973,19 +2976,14 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > sc->gfp_mask = orig_mask; > } > > -static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat) > +static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) > { > - struct mem_cgroup *memcg; > - > - memcg = mem_cgroup_iter(root_memcg, NULL, NULL); > - do { > - unsigned long refaults; > - struct lruvec *lruvec; > + struct lruvec *target_lruvec; > + unsigned long refaults; > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); > - lruvec->refaults = refaults; > - } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL))); > + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); > + target_lruvec->refaults = refaults; > } > > /* > diff --git a/mm/workingset.c b/mm/workingset.c > index e8212123c1c3..f0885d9f41cd 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -213,28 +213,53 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, > *workingsetp = workingset; > } > > +static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) > +{ > + /* > + * Reclaiming a cgroup means reclaiming all its children in a > + * round-robin fashion. That means that each cgroup has an LRU > + * order that is composed of the LRU orders of its child > + * cgroups; and every page has an LRU position not just in the > + * cgroup that owns it, but in all of that group's ancestors. > + * > + * So when the physical inactive list of a leaf cgroup ages, > + * the virtual inactive lists of all its parents, including > + * the root cgroup's, age as well. > + */ > + do { > + struct lruvec *lruvec; > + > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + atomic_long_inc(&lruvec->inactive_age); > + } while (memcg && (memcg = parent_mem_cgroup(memcg))); > +} > + > /** > * workingset_eviction - note the eviction of a page from memory > + * @target_memcg: the cgroup that is causing the reclaim > * @page: the page being evicted > * > * Returns a shadow entry to be stored in @page->mapping->i_pages in place > * of the evicted @page so that a later refault can be detected. > */ > -void *workingset_eviction(struct page *page) > +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) > { > struct pglist_data *pgdat = page_pgdat(page); > - struct mem_cgroup *memcg = page_memcg(page); > - int memcgid = mem_cgroup_id(memcg); > unsigned long eviction; > struct lruvec *lruvec; > + int memcgid; > > /* Page is fully exclusive and pins page->mem_cgroup */ > VM_BUG_ON_PAGE(PageLRU(page), page); > VM_BUG_ON_PAGE(page_count(page), page); > VM_BUG_ON_PAGE(!PageLocked(page), page); > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - eviction = atomic_long_inc_return(&lruvec->inactive_age); > + advance_inactive_age(page_memcg(page), pgdat); > + > + lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > + /* XXX: target_memcg can be NULL, go through lruvec */ > + memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); > + eviction = atomic_long_read(&lruvec->inactive_age); > return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); > } > > @@ -244,10 +269,13 @@ void *workingset_eviction(struct page *page) > * @shadow: shadow entry of the evicted page > * > * Calculates and evaluates the refault distance of the previously > - * evicted page in the context of the node it was allocated in. > + * evicted page in the context of the node and the memcg whose memory > + * pressure caused the eviction. > */ > void workingset_refault(struct page *page, void *shadow) > { > + struct mem_cgroup *eviction_memcg; > + struct lruvec *eviction_lruvec; > unsigned long refault_distance; > struct pglist_data *pgdat; > unsigned long active_file; > @@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow) > * would be better if the root_mem_cgroup existed in all > * configurations instead. > */ > - memcg = mem_cgroup_from_id(memcgid); > - if (!mem_cgroup_disabled() && !memcg) > + eviction_memcg = mem_cgroup_from_id(memcgid); > + if (!mem_cgroup_disabled() && !eviction_memcg) > goto out; > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - refault = atomic_long_read(&lruvec->inactive_age); > - active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES); > + eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); > + refault = atomic_long_read(&eviction_lruvec->inactive_age); > + active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); > > /* > * Calculate the refault distance > @@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow) > */ > refault_distance = (refault - eviction) & EVICTION_MASK; > > + /* > + * The activation decision for this page is made at the level > + * where the eviction occurred, as that is where the LRU order > + * during page reclaim is being determined. > + * > + * However, the cgroup that will own the page is the one that > + * is actually experiencing the refault event. > + */ > + memcg = get_mem_cgroup_from_mm(current->mm); > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + > inc_lruvec_state(lruvec, WORKINGSET_REFAULT); > > /* > @@ -310,10 +349,10 @@ void workingset_refault(struct page *page, void *shadow) > * the memory was available to the page cache. > */ > if (refault_distance > active_file) > - goto out; > + goto out_memcg; > > SetPageActive(page); > - atomic_long_inc(&lruvec->inactive_age); > + advance_inactive_age(memcg, pgdat); > inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); > > /* Page was active prior to eviction */ > @@ -321,6 +360,9 @@ void workingset_refault(struct page *page, void *shadow) > SetPageWorkingset(page); > inc_lruvec_state(lruvec, WORKINGSET_RESTORE); > } > + > +out_memcg: > + mem_cgroup_put(memcg); > out: > rcu_read_unlock(); > } > @@ -332,7 +374,6 @@ void workingset_refault(struct page *page, void *shadow) > void workingset_activation(struct page *page) > { > struct mem_cgroup *memcg; > - struct lruvec *lruvec; > > rcu_read_lock(); > /* > @@ -345,8 +386,7 @@ void workingset_activation(struct page *page) > memcg = page_memcg_rcu(page); > if (!mem_cgroup_disabled() && !memcg) > goto out; > - lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); > - atomic_long_inc(&lruvec->inactive_age); > + advance_inactive_age(memcg, page_pgdat(page)); > out: > rcu_read_unlock(); > } > -- > 2.24.0 >
On Sun, Nov 10, 2019 at 06:01:18PM -0800, Suren Baghdasaryan wrote: > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > We use refault information to determine whether the cache workingset > > is stable or transitioning, and dynamically adjust the inactive:active > > file LRU ratio so as to maximize protection from one-off cache during > > stable periods, and minimize IO during transitions. > > > > With cgroups and their nested LRU lists, we currently don't do this > > correctly. While recursive cgroup reclaim establishes a relative LRU > > order among the pages of all involved cgroups, refaults only affect > > the local LRU order in the cgroup in which they are occuring. As a > > result, cache transitions can take longer in a cgrouped system as the > > active pages of sibling cgroups aren't challenged when they should be. > > > > [ Right now, this is somewhat theoretical, because the siblings, under > > continued regular reclaim pressure, should eventually run out of > > inactive pages - and since inactive:active *size* balancing is also > > done on a cgroup-local level, we will challenge the active pages > > eventually in most cases. But the next patch will move that relative > > size enforcement to the reclaim root as well, and then this patch > > here will be necessary to propagate refault pressure to siblings. ] > > > > This patch moves refault detection to the root of reclaim. Instead of > > remembering the cgroup owner of an evicted page, remember the cgroup > > that caused the reclaim to happen. When refaults later occur, they'll > > correctly influence the cross-cgroup LRU order that reclaim follows. > > I spent some time thinking about the idea of calculating refault > distance using target_memcg's inactive_age and then activating > refaulted page in (possibly) another memcg and I am still having > trouble convincing myself that this should work correctly. However I > also was unable to convince myself otherwise... We use refault > distance to calculate the deficit in inactive LRU space and then > activate the refaulted page if that distance is less that > active+inactive LRU size. However making that decision based on LRU > sizes of one memcg and then activating the page in another one seems > very counterintuitive to me. Maybe that's just me though... It's not activating in a random, unrelated memcg - it's the parental relationship that makes it work. If you have a cgroup tree root | A / \ B1 B2 and reclaim is driven by a limit in A, we are reclaiming the pages in B1 and B2 as if they were on a single LRU list A (it's approximated by the round-robin reclaim and has some caveats, but that's the idea). So when a page that belongs to B2 gets evicted, it gets evicted from virtual LRU list A. When it refaults later, we make the (in)active size and distance comparisons against virtual LRU list A as well. The pages on the physical LRU list B2 are not just ordered relative to its B2 peers, they are also ordered relative to the pages in B1. And that of course is necessary if we want fair competition between them under shared reclaim pressure from A.
On Tue, Nov 12, 2019 at 9:45 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Sun, Nov 10, 2019 at 06:01:18PM -0800, Suren Baghdasaryan wrote: > > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > We use refault information to determine whether the cache workingset > > > is stable or transitioning, and dynamically adjust the inactive:active > > > file LRU ratio so as to maximize protection from one-off cache during > > > stable periods, and minimize IO during transitions. > > > > > > With cgroups and their nested LRU lists, we currently don't do this > > > correctly. While recursive cgroup reclaim establishes a relative LRU > > > order among the pages of all involved cgroups, refaults only affect > > > the local LRU order in the cgroup in which they are occuring. As a > > > result, cache transitions can take longer in a cgrouped system as the > > > active pages of sibling cgroups aren't challenged when they should be. > > > > > > [ Right now, this is somewhat theoretical, because the siblings, under > > > continued regular reclaim pressure, should eventually run out of > > > inactive pages - and since inactive:active *size* balancing is also > > > done on a cgroup-local level, we will challenge the active pages > > > eventually in most cases. But the next patch will move that relative > > > size enforcement to the reclaim root as well, and then this patch > > > here will be necessary to propagate refault pressure to siblings. ] > > > > > > This patch moves refault detection to the root of reclaim. Instead of > > > remembering the cgroup owner of an evicted page, remember the cgroup > > > that caused the reclaim to happen. When refaults later occur, they'll > > > correctly influence the cross-cgroup LRU order that reclaim follows. > > > > I spent some time thinking about the idea of calculating refault > > distance using target_memcg's inactive_age and then activating > > refaulted page in (possibly) another memcg and I am still having > > trouble convincing myself that this should work correctly. However I > > also was unable to convince myself otherwise... We use refault > > distance to calculate the deficit in inactive LRU space and then > > activate the refaulted page if that distance is less that > > active+inactive LRU size. However making that decision based on LRU > > sizes of one memcg and then activating the page in another one seems > > very counterintuitive to me. Maybe that's just me though... > > It's not activating in a random, unrelated memcg - it's the parental > relationship that makes it work. > > If you have a cgroup tree > > root > | > A > / \ > B1 B2 > > and reclaim is driven by a limit in A, we are reclaiming the pages in > B1 and B2 as if they were on a single LRU list A (it's approximated by > the round-robin reclaim and has some caveats, but that's the idea). > > So when a page that belongs to B2 gets evicted, it gets evicted from > virtual LRU list A. When it refaults later, we make the (in)active > size and distance comparisons against virtual LRU list A as well. > > The pages on the physical LRU list B2 are not just ordered relative to > its B2 peers, they are also ordered relative to the pages in B1. And > that of course is necessary if we want fair competition between them > under shared reclaim pressure from A. Thanks for clarification. The testcase in your description when group B has a large inactive cache which does not get reclaimed while its sibling group A has to drop its active cache got me under the impression that sibling cgroups (in your reply above B1 and B2) can cause memory pressure in each other. Maybe that's not a legit case and B1 would not cause pressure in B2 without causing pressure in their shared parent A? It now makes more sense to me and I want to confirm that is the case.
On Tue, Nov 12, 2019 at 10:45:44AM -0800, Suren Baghdasaryan wrote: > On Tue, Nov 12, 2019 at 9:45 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Sun, Nov 10, 2019 at 06:01:18PM -0800, Suren Baghdasaryan wrote: > > > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > We use refault information to determine whether the cache workingset > > > > is stable or transitioning, and dynamically adjust the inactive:active > > > > file LRU ratio so as to maximize protection from one-off cache during > > > > stable periods, and minimize IO during transitions. > > > > > > > > With cgroups and their nested LRU lists, we currently don't do this > > > > correctly. While recursive cgroup reclaim establishes a relative LRU > > > > order among the pages of all involved cgroups, refaults only affect > > > > the local LRU order in the cgroup in which they are occuring. As a > > > > result, cache transitions can take longer in a cgrouped system as the > > > > active pages of sibling cgroups aren't challenged when they should be. > > > > > > > > [ Right now, this is somewhat theoretical, because the siblings, under > > > > continued regular reclaim pressure, should eventually run out of > > > > inactive pages - and since inactive:active *size* balancing is also > > > > done on a cgroup-local level, we will challenge the active pages > > > > eventually in most cases. But the next patch will move that relative > > > > size enforcement to the reclaim root as well, and then this patch > > > > here will be necessary to propagate refault pressure to siblings. ] > > > > > > > > This patch moves refault detection to the root of reclaim. Instead of > > > > remembering the cgroup owner of an evicted page, remember the cgroup > > > > that caused the reclaim to happen. When refaults later occur, they'll > > > > correctly influence the cross-cgroup LRU order that reclaim follows. > > > > > > I spent some time thinking about the idea of calculating refault > > > distance using target_memcg's inactive_age and then activating > > > refaulted page in (possibly) another memcg and I am still having > > > trouble convincing myself that this should work correctly. However I > > > also was unable to convince myself otherwise... We use refault > > > distance to calculate the deficit in inactive LRU space and then > > > activate the refaulted page if that distance is less that > > > active+inactive LRU size. However making that decision based on LRU > > > sizes of one memcg and then activating the page in another one seems > > > very counterintuitive to me. Maybe that's just me though... > > > > It's not activating in a random, unrelated memcg - it's the parental > > relationship that makes it work. > > > > If you have a cgroup tree > > > > root > > | > > A > > / \ > > B1 B2 > > > > and reclaim is driven by a limit in A, we are reclaiming the pages in > > B1 and B2 as if they were on a single LRU list A (it's approximated by > > the round-robin reclaim and has some caveats, but that's the idea). > > > > So when a page that belongs to B2 gets evicted, it gets evicted from > > virtual LRU list A. When it refaults later, we make the (in)active > > size and distance comparisons against virtual LRU list A as well. > > > > The pages on the physical LRU list B2 are not just ordered relative to > > its B2 peers, they are also ordered relative to the pages in B1. And > > that of course is necessary if we want fair competition between them > > under shared reclaim pressure from A. > > Thanks for clarification. The testcase in your description when group > B has a large inactive cache which does not get reclaimed while its > sibling group A has to drop its active cache got me under the > impression that sibling cgroups (in your reply above B1 and B2) can > cause memory pressure in each other. Maybe that's not a legit case and > B1 would not cause pressure in B2 without causing pressure in their > shared parent A? It now makes more sense to me and I want to confirm > that is the case. Yes. I'm sorry if this was misleading. They should only cause pressure onto each other by causing pressure on A; and then reclaim in A treats them as one combined pool of pages.
On Tue, Nov 12, 2019 at 10:59 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Nov 12, 2019 at 10:45:44AM -0800, Suren Baghdasaryan wrote: > > On Tue, Nov 12, 2019 at 9:45 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Sun, Nov 10, 2019 at 06:01:18PM -0800, Suren Baghdasaryan wrote: > > > > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > We use refault information to determine whether the cache workingset > > > > > is stable or transitioning, and dynamically adjust the inactive:active > > > > > file LRU ratio so as to maximize protection from one-off cache during > > > > > stable periods, and minimize IO during transitions. > > > > > > > > > > With cgroups and their nested LRU lists, we currently don't do this > > > > > correctly. While recursive cgroup reclaim establishes a relative LRU > > > > > order among the pages of all involved cgroups, refaults only affect > > > > > the local LRU order in the cgroup in which they are occuring. As a > > > > > result, cache transitions can take longer in a cgrouped system as the > > > > > active pages of sibling cgroups aren't challenged when they should be. > > > > > > > > > > [ Right now, this is somewhat theoretical, because the siblings, under > > > > > continued regular reclaim pressure, should eventually run out of > > > > > inactive pages - and since inactive:active *size* balancing is also > > > > > done on a cgroup-local level, we will challenge the active pages > > > > > eventually in most cases. But the next patch will move that relative > > > > > size enforcement to the reclaim root as well, and then this patch > > > > > here will be necessary to propagate refault pressure to siblings. ] > > > > > > > > > > This patch moves refault detection to the root of reclaim. Instead of > > > > > remembering the cgroup owner of an evicted page, remember the cgroup > > > > > that caused the reclaim to happen. When refaults later occur, they'll > > > > > correctly influence the cross-cgroup LRU order that reclaim follows. > > > > > > > > I spent some time thinking about the idea of calculating refault > > > > distance using target_memcg's inactive_age and then activating > > > > refaulted page in (possibly) another memcg and I am still having > > > > trouble convincing myself that this should work correctly. However I > > > > also was unable to convince myself otherwise... We use refault > > > > distance to calculate the deficit in inactive LRU space and then > > > > activate the refaulted page if that distance is less that > > > > active+inactive LRU size. However making that decision based on LRU > > > > sizes of one memcg and then activating the page in another one seems > > > > very counterintuitive to me. Maybe that's just me though... > > > > > > It's not activating in a random, unrelated memcg - it's the parental > > > relationship that makes it work. > > > > > > If you have a cgroup tree > > > > > > root > > > | > > > A > > > / \ > > > B1 B2 > > > > > > and reclaim is driven by a limit in A, we are reclaiming the pages in > > > B1 and B2 as if they were on a single LRU list A (it's approximated by > > > the round-robin reclaim and has some caveats, but that's the idea). > > > > > > So when a page that belongs to B2 gets evicted, it gets evicted from > > > virtual LRU list A. When it refaults later, we make the (in)active > > > size and distance comparisons against virtual LRU list A as well. > > > > > > The pages on the physical LRU list B2 are not just ordered relative to > > > its B2 peers, they are also ordered relative to the pages in B1. And > > > that of course is necessary if we want fair competition between them > > > under shared reclaim pressure from A. > > > > Thanks for clarification. The testcase in your description when group > > B has a large inactive cache which does not get reclaimed while its > > sibling group A has to drop its active cache got me under the > > impression that sibling cgroups (in your reply above B1 and B2) can > > cause memory pressure in each other. Maybe that's not a legit case and > > B1 would not cause pressure in B2 without causing pressure in their > > shared parent A? It now makes more sense to me and I want to confirm > > that is the case. > > Yes. I'm sorry if this was misleading. They should only cause pressure > onto each other by causing pressure on A; and then reclaim in A treats > them as one combined pool of pages. Reviewed-by: Suren Baghdasaryan <surenb@google.com>
On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > We use refault information to determine whether the cache workingset > is stable or transitioning, and dynamically adjust the inactive:active > file LRU ratio so as to maximize protection from one-off cache during > stable periods, and minimize IO during transitions. > > With cgroups and their nested LRU lists, we currently don't do this > correctly. While recursive cgroup reclaim establishes a relative LRU > order among the pages of all involved cgroups, refaults only affect > the local LRU order in the cgroup in which they are occuring. As a > result, cache transitions can take longer in a cgrouped system as the > active pages of sibling cgroups aren't challenged when they should be. > > [ Right now, this is somewhat theoretical, because the siblings, under > continued regular reclaim pressure, should eventually run out of > inactive pages - and since inactive:active *size* balancing is also > done on a cgroup-local level, we will challenge the active pages > eventually in most cases. But the next patch will move that relative > size enforcement to the reclaim root as well, and then this patch > here will be necessary to propagate refault pressure to siblings. ] > > This patch moves refault detection to the root of reclaim. Instead of > remembering the cgroup owner of an evicted page, remember the cgroup > that caused the reclaim to happen. When refaults later occur, they'll > correctly influence the cross-cgroup LRU order that reclaim follows. Can you please explain how "they'll correctly influence"? I see that if the refaulted page was evicted due to pressure in some ancestor, then that's ancestor's refault distance and active file size will be used to decide to activate the refaulted page but how that is influencing cross-cgroup LRUs? > > I.e. if global reclaim kicked out pages in some subgroup A/B/C, the > refault of those pages will challenge the global LRU order, and not > just the local order down inside C. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > include/linux/memcontrol.h | 5 +++ > include/linux/swap.h | 2 +- > mm/vmscan.c | 32 ++++++++--------- > mm/workingset.c | 72 +++++++++++++++++++++++++++++--------- > 4 files changed, 77 insertions(+), 34 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 5b86287fa069..a7a0a1a5c8d5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -901,6 +901,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, > return &pgdat->__lruvec; > } > > +static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) > +{ > + return NULL; > +} > + > static inline bool mm_match_cgroup(struct mm_struct *mm, > struct mem_cgroup *memcg) > { > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 063c0c1e112b..1e99f7ac1d7e 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -307,7 +307,7 @@ struct vma_swap_readahead { > }; > > /* linux/mm/workingset.c */ > -void *workingset_eviction(struct page *page); > +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg); > void workingset_refault(struct page *page, void *shadow); > void workingset_activation(struct page *page); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index e8dd601e1fad..527617ee9b73 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -853,7 +853,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping) > * gets returned with a refcount of 0. > */ > static int __remove_mapping(struct address_space *mapping, struct page *page, > - bool reclaimed) > + bool reclaimed, struct mem_cgroup *target_memcg) > { > unsigned long flags; > int refcount; > @@ -925,7 +925,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > */ > if (reclaimed && page_is_file_cache(page) && > !mapping_exiting(mapping) && !dax_mapping(mapping)) > - shadow = workingset_eviction(page); > + shadow = workingset_eviction(page, target_memcg); > __delete_from_page_cache(page, shadow); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > @@ -948,7 +948,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > */ > int remove_mapping(struct address_space *mapping, struct page *page) > { > - if (__remove_mapping(mapping, page, false)) { > + if (__remove_mapping(mapping, page, false, NULL)) { > /* > * Unfreezing the refcount with 1 rather than 2 effectively > * drops the pagecache ref for us without requiring another > @@ -1426,7 +1426,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > count_vm_event(PGLAZYFREED); > count_memcg_page_event(page, PGLAZYFREED); > - } else if (!mapping || !__remove_mapping(mapping, page, true)) > + } else if (!mapping || !__remove_mapping(mapping, page, true, > + sc->target_mem_cgroup)) > goto keep_locked; > > unlock_page(page); > @@ -2189,6 +2190,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, > enum lru_list inactive_lru = file * LRU_FILE; > unsigned long inactive, active; > unsigned long inactive_ratio; > + struct lruvec *target_lruvec; > unsigned long refaults; > unsigned long gb; > > @@ -2200,8 +2202,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, > * is being established. Disable active list protection to get > * rid of the stale workingset quickly. > */ > - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); > - if (file && lruvec->refaults != refaults) { > + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); > + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); > + if (file && target_lruvec->refaults != refaults) { > inactive_ratio = 0; > } else { > gb = (inactive + active) >> (30 - PAGE_SHIFT); > @@ -2973,19 +2976,14 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > sc->gfp_mask = orig_mask; > } > > -static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat) > +static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) > { > - struct mem_cgroup *memcg; > - > - memcg = mem_cgroup_iter(root_memcg, NULL, NULL); > - do { > - unsigned long refaults; > - struct lruvec *lruvec; > + struct lruvec *target_lruvec; > + unsigned long refaults; > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); > - lruvec->refaults = refaults; > - } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL))); > + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); > + target_lruvec->refaults = refaults; > } > > /* > diff --git a/mm/workingset.c b/mm/workingset.c > index e8212123c1c3..f0885d9f41cd 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -213,28 +213,53 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, > *workingsetp = workingset; > } > > +static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) > +{ > + /* > + * Reclaiming a cgroup means reclaiming all its children in a > + * round-robin fashion. That means that each cgroup has an LRU > + * order that is composed of the LRU orders of its child > + * cgroups; and every page has an LRU position not just in the > + * cgroup that owns it, but in all of that group's ancestors. > + * > + * So when the physical inactive list of a leaf cgroup ages, > + * the virtual inactive lists of all its parents, including > + * the root cgroup's, age as well. > + */ > + do { > + struct lruvec *lruvec; > + > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + atomic_long_inc(&lruvec->inactive_age); > + } while (memcg && (memcg = parent_mem_cgroup(memcg))); > +} > + > /** > * workingset_eviction - note the eviction of a page from memory > + * @target_memcg: the cgroup that is causing the reclaim > * @page: the page being evicted > * > * Returns a shadow entry to be stored in @page->mapping->i_pages in place > * of the evicted @page so that a later refault can be detected. > */ > -void *workingset_eviction(struct page *page) > +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) > { > struct pglist_data *pgdat = page_pgdat(page); > - struct mem_cgroup *memcg = page_memcg(page); > - int memcgid = mem_cgroup_id(memcg); > unsigned long eviction; > struct lruvec *lruvec; > + int memcgid; > > /* Page is fully exclusive and pins page->mem_cgroup */ > VM_BUG_ON_PAGE(PageLRU(page), page); > VM_BUG_ON_PAGE(page_count(page), page); > VM_BUG_ON_PAGE(!PageLocked(page), page); > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - eviction = atomic_long_inc_return(&lruvec->inactive_age); > + advance_inactive_age(page_memcg(page), pgdat); > + > + lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > + /* XXX: target_memcg can be NULL, go through lruvec */ > + memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); > + eviction = atomic_long_read(&lruvec->inactive_age); > return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); > } > > @@ -244,10 +269,13 @@ void *workingset_eviction(struct page *page) > * @shadow: shadow entry of the evicted page > * > * Calculates and evaluates the refault distance of the previously > - * evicted page in the context of the node it was allocated in. > + * evicted page in the context of the node and the memcg whose memory > + * pressure caused the eviction. > */ > void workingset_refault(struct page *page, void *shadow) > { > + struct mem_cgroup *eviction_memcg; > + struct lruvec *eviction_lruvec; > unsigned long refault_distance; > struct pglist_data *pgdat; > unsigned long active_file; > @@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow) > * would be better if the root_mem_cgroup existed in all > * configurations instead. > */ > - memcg = mem_cgroup_from_id(memcgid); > - if (!mem_cgroup_disabled() && !memcg) > + eviction_memcg = mem_cgroup_from_id(memcgid); > + if (!mem_cgroup_disabled() && !eviction_memcg) > goto out; > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - refault = atomic_long_read(&lruvec->inactive_age); > - active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES); > + eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); > + refault = atomic_long_read(&eviction_lruvec->inactive_age); > + active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); > > /* > * Calculate the refault distance > @@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow) > */ > refault_distance = (refault - eviction) & EVICTION_MASK; > > + /* > + * The activation decision for this page is made at the level > + * where the eviction occurred, as that is where the LRU order > + * during page reclaim is being determined. > + * > + * However, the cgroup that will own the page is the one that > + * is actually experiencing the refault event. > + */ > + memcg = get_mem_cgroup_from_mm(current->mm); Why not page_memcg(page)? page is locked. > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + > inc_lruvec_state(lruvec, WORKINGSET_REFAULT); > > /* > @@ -310,10 +349,10 @@ void workingset_refault(struct page *page, void *shadow) > * the memory was available to the page cache. > */ > if (refault_distance > active_file) > - goto out; > + goto out_memcg; > > SetPageActive(page); > - atomic_long_inc(&lruvec->inactive_age); > + advance_inactive_age(memcg, pgdat); > inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); > > /* Page was active prior to eviction */ > @@ -321,6 +360,9 @@ void workingset_refault(struct page *page, void *shadow) > SetPageWorkingset(page); > inc_lruvec_state(lruvec, WORKINGSET_RESTORE); > } > + > +out_memcg: > + mem_cgroup_put(memcg); > out: > rcu_read_unlock(); > } > @@ -332,7 +374,6 @@ void workingset_refault(struct page *page, void *shadow) > void workingset_activation(struct page *page) > { > struct mem_cgroup *memcg; > - struct lruvec *lruvec; > > rcu_read_lock(); > /* > @@ -345,8 +386,7 @@ void workingset_activation(struct page *page) > memcg = page_memcg_rcu(page); > if (!mem_cgroup_disabled() && !memcg) > goto out; > - lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); > - atomic_long_inc(&lruvec->inactive_age); > + advance_inactive_age(memcg, page_pgdat(page)); > out: > rcu_read_unlock(); > } > -- > 2.24.0 >
On Thu, Nov 14, 2019 at 03:47:59PM -0800, Shakeel Butt wrote: > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > We use refault information to determine whether the cache workingset > > is stable or transitioning, and dynamically adjust the inactive:active > > file LRU ratio so as to maximize protection from one-off cache during > > stable periods, and minimize IO during transitions. > > > > With cgroups and their nested LRU lists, we currently don't do this > > correctly. While recursive cgroup reclaim establishes a relative LRU > > order among the pages of all involved cgroups, refaults only affect > > the local LRU order in the cgroup in which they are occuring. As a > > result, cache transitions can take longer in a cgrouped system as the > > active pages of sibling cgroups aren't challenged when they should be. > > > > [ Right now, this is somewhat theoretical, because the siblings, under > > continued regular reclaim pressure, should eventually run out of > > inactive pages - and since inactive:active *size* balancing is also > > done on a cgroup-local level, we will challenge the active pages > > eventually in most cases. But the next patch will move that relative > > size enforcement to the reclaim root as well, and then this patch > > here will be necessary to propagate refault pressure to siblings. ] > > > > This patch moves refault detection to the root of reclaim. Instead of > > remembering the cgroup owner of an evicted page, remember the cgroup > > that caused the reclaim to happen. When refaults later occur, they'll > > correctly influence the cross-cgroup LRU order that reclaim follows. > > Can you please explain how "they'll correctly influence"? I see that > if the refaulted page was evicted due to pressure in some ancestor, > then that's ancestor's refault distance and active file size will be > used to decide to activate the refaulted page but how that is > influencing cross-cgroup LRUs? I take it the next patch answered your question: Activating a page inside a cgroup has an effect on how it's reclaimed relative to pages in sibling cgroups. So the influence part isn't new with this change - it's about recognizing that an activation has an effect on a wider scope than just the local cgroup, and considering that scope when making the decision whether to activate or not. > > @@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow) > > */ > > refault_distance = (refault - eviction) & EVICTION_MASK; > > > > + /* > > + * The activation decision for this page is made at the level > > + * where the eviction occurred, as that is where the LRU order > > + * during page reclaim is being determined. > > + * > > + * However, the cgroup that will own the page is the one that > > + * is actually experiencing the refault event. > > + */ > > + memcg = get_mem_cgroup_from_mm(current->mm); > > Why not page_memcg(page)? page is locked. Nice catch! Indeed, the page is charged and locked at this point, so we don't have to do another lookup and refcounting dance here. Delta patch: --- From 8984f37f3e88b1b39c7d6470b313730093b24474 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Fri, 15 Nov 2019 09:14:04 -0500 Subject: [PATCH] mm: vmscan: detect file thrashing at the reclaim root fix Shakeel points out that the page is locked and already charged by the time we call workingset_refault(). Instead of doing another cgroup lookup and reference from current->mm we can simply use page_memcg(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/workingset.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/mm/workingset.c b/mm/workingset.c index f0885d9f41cd..474186b76ced 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -338,7 +338,7 @@ void workingset_refault(struct page *page, void *shadow) * However, the cgroup that will own the page is the one that * is actually experiencing the refault event. */ - memcg = get_mem_cgroup_from_mm(current->mm); + memcg = page_memcg(page); lruvec = mem_cgroup_lruvec(memcg, pgdat); inc_lruvec_state(lruvec, WORKINGSET_REFAULT); @@ -349,7 +349,7 @@ void workingset_refault(struct page *page, void *shadow) * the memory was available to the page cache. */ if (refault_distance > active_file) - goto out_memcg; + goto out; SetPageActive(page); advance_inactive_age(memcg, pgdat); @@ -360,9 +360,6 @@ void workingset_refault(struct page *page, void *shadow) SetPageWorkingset(page); inc_lruvec_state(lruvec, WORKINGSET_RESTORE); } - -out_memcg: - mem_cgroup_put(memcg); out: rcu_read_unlock(); }
On Fri, Nov 15, 2019 at 8:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Nov 14, 2019 at 03:47:59PM -0800, Shakeel Butt wrote: > > On Thu, Nov 7, 2019 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > We use refault information to determine whether the cache workingset > > > is stable or transitioning, and dynamically adjust the inactive:active > > > file LRU ratio so as to maximize protection from one-off cache during > > > stable periods, and minimize IO during transitions. > > > > > > With cgroups and their nested LRU lists, we currently don't do this > > > correctly. While recursive cgroup reclaim establishes a relative LRU > > > order among the pages of all involved cgroups, refaults only affect > > > the local LRU order in the cgroup in which they are occuring. As a > > > result, cache transitions can take longer in a cgrouped system as the > > > active pages of sibling cgroups aren't challenged when they should be. > > > > > > [ Right now, this is somewhat theoretical, because the siblings, under > > > continued regular reclaim pressure, should eventually run out of > > > inactive pages - and since inactive:active *size* balancing is also > > > done on a cgroup-local level, we will challenge the active pages > > > eventually in most cases. But the next patch will move that relative > > > size enforcement to the reclaim root as well, and then this patch > > > here will be necessary to propagate refault pressure to siblings. ] > > > > > > This patch moves refault detection to the root of reclaim. Instead of > > > remembering the cgroup owner of an evicted page, remember the cgroup > > > that caused the reclaim to happen. When refaults later occur, they'll > > > correctly influence the cross-cgroup LRU order that reclaim follows. > > > > Can you please explain how "they'll correctly influence"? I see that > > if the refaulted page was evicted due to pressure in some ancestor, > > then that's ancestor's refault distance and active file size will be > > used to decide to activate the refaulted page but how that is > > influencing cross-cgroup LRUs? > > I take it the next patch answered your question: Activating a page > inside a cgroup has an effect on how it's reclaimed relative to pages > in sibling cgroups. So the influence part isn't new with this change - > it's about recognizing that an activation has an effect on a wider > scope than just the local cgroup, and considering that scope when > making the decision whether to activate or not. > Thanks for the clarification. > > > @@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow) > > > */ > > > refault_distance = (refault - eviction) & EVICTION_MASK; > > > > > > + /* > > > + * The activation decision for this page is made at the level > > > + * where the eviction occurred, as that is where the LRU order > > > + * during page reclaim is being determined. > > > + * > > > + * However, the cgroup that will own the page is the one that > > > + * is actually experiencing the refault event. > > > + */ > > > + memcg = get_mem_cgroup_from_mm(current->mm); > > > > Why not page_memcg(page)? page is locked. > > Nice catch! Indeed, the page is charged and locked at this point, so > we don't have to do another lookup and refcounting dance here. > > Delta patch: > > --- > > From 8984f37f3e88b1b39c7d6470b313730093b24474 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner <hannes@cmpxchg.org> > Date: Fri, 15 Nov 2019 09:14:04 -0500 > Subject: [PATCH] mm: vmscan: detect file thrashing at the reclaim root fix > > Shakeel points out that the page is locked and already charged by the > time we call workingset_refault(). Instead of doing another cgroup > lookup and reference from current->mm we can simply use page_memcg(). > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> For the complete patch: Reviewed-by: Shakeel Butt <shakeelb@google.com> > --- > mm/workingset.c | 7 ++----- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/mm/workingset.c b/mm/workingset.c > index f0885d9f41cd..474186b76ced 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -338,7 +338,7 @@ void workingset_refault(struct page *page, void *shadow) > * However, the cgroup that will own the page is the one that > * is actually experiencing the refault event. > */ > - memcg = get_mem_cgroup_from_mm(current->mm); > + memcg = page_memcg(page); > lruvec = mem_cgroup_lruvec(memcg, pgdat); > > inc_lruvec_state(lruvec, WORKINGSET_REFAULT); > @@ -349,7 +349,7 @@ void workingset_refault(struct page *page, void *shadow) > * the memory was available to the page cache. > */ > if (refault_distance > active_file) > - goto out_memcg; > + goto out; > > SetPageActive(page); > advance_inactive_age(memcg, pgdat); > @@ -360,9 +360,6 @@ void workingset_refault(struct page *page, void *shadow) > SetPageWorkingset(page); > inc_lruvec_state(lruvec, WORKINGSET_RESTORE); > } > - > -out_memcg: > - mem_cgroup_put(memcg); > out: > rcu_read_unlock(); > } > -- > 2.24.0 >
Hello, Johannes. When I tested my patchset on v5.5, I found that my patchset doesn't work as intended. I tracked down the issue and this patch would be the reason of unintended work. I don't fully understand the patchset so I could be wrong. Please let me ask some questions. On Thu, Nov 07, 2019 at 12:53:33PM -0800, Johannes Weiner wrote: ...snip... > -static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat) > +static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) > { > - struct mem_cgroup *memcg; > - > - memcg = mem_cgroup_iter(root_memcg, NULL, NULL); > - do { > - unsigned long refaults; > - struct lruvec *lruvec; > + struct lruvec *target_lruvec; > + unsigned long refaults; > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); > - lruvec->refaults = refaults; > - } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL))); > + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); > + target_lruvec->refaults = refaults; Is it correct to just snapshot the refault for the target memcg? I think that we need to snapshot the refault for all the child memcgs since we have traversed all the child memcgs with the refault count that is aggregration of all the child memcgs. If next reclaim happens from the child memcg, workingset transition that is already considered could be considered again. > } > > /* > diff --git a/mm/workingset.c b/mm/workingset.c > index e8212123c1c3..f0885d9f41cd 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -213,28 +213,53 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, > *workingsetp = workingset; > } > > +static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) > +{ > + /* > + * Reclaiming a cgroup means reclaiming all its children in a > + * round-robin fashion. That means that each cgroup has an LRU > + * order that is composed of the LRU orders of its child > + * cgroups; and every page has an LRU position not just in the > + * cgroup that owns it, but in all of that group's ancestors. > + * > + * So when the physical inactive list of a leaf cgroup ages, > + * the virtual inactive lists of all its parents, including > + * the root cgroup's, age as well. > + */ > + do { > + struct lruvec *lruvec; > + > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + atomic_long_inc(&lruvec->inactive_age); > + } while (memcg && (memcg = parent_mem_cgroup(memcg))); > +} > + > /** > * workingset_eviction - note the eviction of a page from memory > + * @target_memcg: the cgroup that is causing the reclaim > * @page: the page being evicted > * > * Returns a shadow entry to be stored in @page->mapping->i_pages in place > * of the evicted @page so that a later refault can be detected. > */ > -void *workingset_eviction(struct page *page) > +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) > { > struct pglist_data *pgdat = page_pgdat(page); > - struct mem_cgroup *memcg = page_memcg(page); > - int memcgid = mem_cgroup_id(memcg); > unsigned long eviction; > struct lruvec *lruvec; > + int memcgid; > > /* Page is fully exclusive and pins page->mem_cgroup */ > VM_BUG_ON_PAGE(PageLRU(page), page); > VM_BUG_ON_PAGE(page_count(page), page); > VM_BUG_ON_PAGE(!PageLocked(page), page); > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - eviction = atomic_long_inc_return(&lruvec->inactive_age); > + advance_inactive_age(page_memcg(page), pgdat); > + > + lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > + /* XXX: target_memcg can be NULL, go through lruvec */ > + memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); > + eviction = atomic_long_read(&lruvec->inactive_age); > return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); > } > > @@ -244,10 +269,13 @@ void *workingset_eviction(struct page *page) > * @shadow: shadow entry of the evicted page > * > * Calculates and evaluates the refault distance of the previously > - * evicted page in the context of the node it was allocated in. > + * evicted page in the context of the node and the memcg whose memory > + * pressure caused the eviction. > */ > void workingset_refault(struct page *page, void *shadow) > { > + struct mem_cgroup *eviction_memcg; > + struct lruvec *eviction_lruvec; > unsigned long refault_distance; > struct pglist_data *pgdat; > unsigned long active_file; > @@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow) > * would be better if the root_mem_cgroup existed in all > * configurations instead. > */ > - memcg = mem_cgroup_from_id(memcgid); > - if (!mem_cgroup_disabled() && !memcg) > + eviction_memcg = mem_cgroup_from_id(memcgid); > + if (!mem_cgroup_disabled() && !eviction_memcg) > goto out; > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - refault = atomic_long_read(&lruvec->inactive_age); > - active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES); > + eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); > + refault = atomic_long_read(&eviction_lruvec->inactive_age); > + active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); Do we need to use the aggregation LRU count of all the child memcgs? AFAIU, refault here is the aggregation counter of all the related memcgs. Without using the aggregation count for LRU, active_file could be so small than the refault distance and refault cannot happen correctly. Thanks.
On Wed, Feb 12, 2020 at 07:28:19PM +0900, Joonsoo Kim wrote: > Hello, Johannes. > > When I tested my patchset on v5.5, I found that my patchset doesn't > work as intended. I tracked down the issue and this patch would be the > reason of unintended work. I don't fully understand the patchset so I > could be wrong. Please let me ask some questions. > > On Thu, Nov 07, 2019 at 12:53:33PM -0800, Johannes Weiner wrote: > ...snip... > > -static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat) > > +static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) > > { > > - struct mem_cgroup *memcg; > > - > > - memcg = mem_cgroup_iter(root_memcg, NULL, NULL); > > - do { > > - unsigned long refaults; > > - struct lruvec *lruvec; > > + struct lruvec *target_lruvec; > > + unsigned long refaults; > > > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > > - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); > > - lruvec->refaults = refaults; > > - } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL))); > > + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > > + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); > > + target_lruvec->refaults = refaults; > > Is it correct to just snapshot the refault for the target memcg? I > think that we need to snapshot the refault for all the child memcgs > since we have traversed all the child memcgs with the refault count > that is aggregration of all the child memcgs. If next reclaim happens > from the child memcg, workingset transition that is already considered > could be considered again. Good catch, you're right! We have to update all cgroups in the tree, like we used to. However, we need to use lruvec_page_state() instead of _local, because we do recursive comparisons in shrink_node()! So it's not a clean revert of that hunk. Does this patch here fix the problem you are seeing? diff --git a/mm/vmscan.c b/mm/vmscan.c index c82e9831003f..e7431518db13 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2993,12 +2993,17 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) { - struct lruvec *target_lruvec; - unsigned long refaults; + struct mem_cgroup *memcg; - target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); - refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); - target_lruvec->refaults = refaults; + memcg = mem_cgroup_iter(target_memcg, NULL, NULL); + do { + unsigned long refaults; + struct lruvec *lruvec; + + lruvec = mem_cgroup_lruvec(memcg, pgdat); + refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE); + lruvec->refaults = refaults; + } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); } /* > > @@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow) > > * would be better if the root_mem_cgroup existed in all > > * configurations instead. > > */ > > - memcg = mem_cgroup_from_id(memcgid); > > - if (!mem_cgroup_disabled() && !memcg) > > + eviction_memcg = mem_cgroup_from_id(memcgid); > > + if (!mem_cgroup_disabled() && !eviction_memcg) > > goto out; > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > > - refault = atomic_long_read(&lruvec->inactive_age); > > - active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES); > > + eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); > > + refault = atomic_long_read(&eviction_lruvec->inactive_age); > > + active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); > > Do we need to use the aggregation LRU count of all the child memcgs? > AFAIU, refault here is the aggregation counter of all the related > memcgs. Without using the aggregation count for LRU, active_file could > be so small than the refault distance and refault cannot happen > correctly. lruvec_page_state() *is* aggregated for all child memcgs (as opposed to lruvec_page_state_local()), so that comparison looks correct to me.
2020년 2월 13일 (목) 오전 3:18, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Wed, Feb 12, 2020 at 07:28:19PM +0900, Joonsoo Kim wrote: > > Hello, Johannes. > > > > When I tested my patchset on v5.5, I found that my patchset doesn't > > work as intended. I tracked down the issue and this patch would be the > > reason of unintended work. I don't fully understand the patchset so I > > could be wrong. Please let me ask some questions. > > > > On Thu, Nov 07, 2019 at 12:53:33PM -0800, Johannes Weiner wrote: > > ...snip... > > > -static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat) > > > +static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) > > > { > > > - struct mem_cgroup *memcg; > > > - > > > - memcg = mem_cgroup_iter(root_memcg, NULL, NULL); > > > - do { > > > - unsigned long refaults; > > > - struct lruvec *lruvec; > > > + struct lruvec *target_lruvec; > > > + unsigned long refaults; > > > > > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > > > - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); > > > - lruvec->refaults = refaults; > > > - } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL))); > > > + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > > > + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); > > > + target_lruvec->refaults = refaults; > > > > Is it correct to just snapshot the refault for the target memcg? I > > think that we need to snapshot the refault for all the child memcgs > > since we have traversed all the child memcgs with the refault count > > that is aggregration of all the child memcgs. If next reclaim happens > > from the child memcg, workingset transition that is already considered > > could be considered again. > > Good catch, you're right! We have to update all cgroups in the tree, > like we used to. However, we need to use lruvec_page_state() instead > of _local, because we do recursive comparisons in shrink_node()! So > it's not a clean revert of that hunk. > > Does this patch here fix the problem you are seeing? I found that my problem comes from my mistake. Sorry for bothering you! Anyway, following hunk looks correct to me. Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c82e9831003f..e7431518db13 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2993,12 +2993,17 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > > static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) > { > - struct lruvec *target_lruvec; > - unsigned long refaults; > + struct mem_cgroup *memcg; > > - target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > - refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); > - target_lruvec->refaults = refaults; > + memcg = mem_cgroup_iter(target_memcg, NULL, NULL); > + do { > + unsigned long refaults; > + struct lruvec *lruvec; > + > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE); > + lruvec->refaults = refaults; > + } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); > } > > /* > > > > @@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow) > > > * would be better if the root_mem_cgroup existed in all > > > * configurations instead. > > > */ > > > - memcg = mem_cgroup_from_id(memcgid); > > > - if (!mem_cgroup_disabled() && !memcg) > > > + eviction_memcg = mem_cgroup_from_id(memcgid); > > > + if (!mem_cgroup_disabled() && !eviction_memcg) > > > goto out; > > > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > > > - refault = atomic_long_read(&lruvec->inactive_age); > > > - active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES); > > > + eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); > > > + refault = atomic_long_read(&eviction_lruvec->inactive_age); > > > + active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); > > > > Do we need to use the aggregation LRU count of all the child memcgs? > > AFAIU, refault here is the aggregation counter of all the related > > memcgs. Without using the aggregation count for LRU, active_file could > > be so small than the refault distance and refault cannot happen > > correctly. > > lruvec_page_state() *is* aggregated for all child memcgs (as opposed > to lruvec_page_state_local()), so that comparison looks correct to me. Thanks for informing this. I have checked lruvec_page_state() but not mod_lruvec_state() so cannot find that counter is the aggregated value. Thanks.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5b86287fa069..a7a0a1a5c8d5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -901,6 +901,11 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, return &pgdat->__lruvec; } +static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) +{ + return NULL; +} + static inline bool mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *memcg) { diff --git a/include/linux/swap.h b/include/linux/swap.h index 063c0c1e112b..1e99f7ac1d7e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,7 +307,7 @@ struct vma_swap_readahead { }; /* linux/mm/workingset.c */ -void *workingset_eviction(struct page *page); +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg); void workingset_refault(struct page *page, void *shadow); void workingset_activation(struct page *page); diff --git a/mm/vmscan.c b/mm/vmscan.c index e8dd601e1fad..527617ee9b73 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -853,7 +853,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping) * gets returned with a refcount of 0. */ static int __remove_mapping(struct address_space *mapping, struct page *page, - bool reclaimed) + bool reclaimed, struct mem_cgroup *target_memcg) { unsigned long flags; int refcount; @@ -925,7 +925,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, */ if (reclaimed && page_is_file_cache(page) && !mapping_exiting(mapping) && !dax_mapping(mapping)) - shadow = workingset_eviction(page); + shadow = workingset_eviction(page, target_memcg); __delete_from_page_cache(page, shadow); xa_unlock_irqrestore(&mapping->i_pages, flags); @@ -948,7 +948,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, */ int remove_mapping(struct address_space *mapping, struct page *page) { - if (__remove_mapping(mapping, page, false)) { + if (__remove_mapping(mapping, page, false, NULL)) { /* * Unfreezing the refcount with 1 rather than 2 effectively * drops the pagecache ref for us without requiring another @@ -1426,7 +1426,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, count_vm_event(PGLAZYFREED); count_memcg_page_event(page, PGLAZYFREED); - } else if (!mapping || !__remove_mapping(mapping, page, true)) + } else if (!mapping || !__remove_mapping(mapping, page, true, + sc->target_mem_cgroup)) goto keep_locked; unlock_page(page); @@ -2189,6 +2190,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, enum lru_list inactive_lru = file * LRU_FILE; unsigned long inactive, active; unsigned long inactive_ratio; + struct lruvec *target_lruvec; unsigned long refaults; unsigned long gb; @@ -2200,8 +2202,9 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, * is being established. Disable active list protection to get * rid of the stale workingset quickly. */ - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); - if (file && lruvec->refaults != refaults) { + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); + if (file && target_lruvec->refaults != refaults) { inactive_ratio = 0; } else { gb = (inactive + active) >> (30 - PAGE_SHIFT); @@ -2973,19 +2976,14 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) sc->gfp_mask = orig_mask; } -static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat) +static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) { - struct mem_cgroup *memcg; - - memcg = mem_cgroup_iter(root_memcg, NULL, NULL); - do { - unsigned long refaults; - struct lruvec *lruvec; + struct lruvec *target_lruvec; + unsigned long refaults; - lruvec = mem_cgroup_lruvec(memcg, pgdat); - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); - lruvec->refaults = refaults; - } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL))); + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); + target_lruvec->refaults = refaults; } /* diff --git a/mm/workingset.c b/mm/workingset.c index e8212123c1c3..f0885d9f41cd 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -213,28 +213,53 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *workingsetp = workingset; } +static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) +{ + /* + * Reclaiming a cgroup means reclaiming all its children in a + * round-robin fashion. That means that each cgroup has an LRU + * order that is composed of the LRU orders of its child + * cgroups; and every page has an LRU position not just in the + * cgroup that owns it, but in all of that group's ancestors. + * + * So when the physical inactive list of a leaf cgroup ages, + * the virtual inactive lists of all its parents, including + * the root cgroup's, age as well. + */ + do { + struct lruvec *lruvec; + + lruvec = mem_cgroup_lruvec(memcg, pgdat); + atomic_long_inc(&lruvec->inactive_age); + } while (memcg && (memcg = parent_mem_cgroup(memcg))); +} + /** * workingset_eviction - note the eviction of a page from memory + * @target_memcg: the cgroup that is causing the reclaim * @page: the page being evicted * * Returns a shadow entry to be stored in @page->mapping->i_pages in place * of the evicted @page so that a later refault can be detected. */ -void *workingset_eviction(struct page *page) +void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) { struct pglist_data *pgdat = page_pgdat(page); - struct mem_cgroup *memcg = page_memcg(page); - int memcgid = mem_cgroup_id(memcg); unsigned long eviction; struct lruvec *lruvec; + int memcgid; /* Page is fully exclusive and pins page->mem_cgroup */ VM_BUG_ON_PAGE(PageLRU(page), page); VM_BUG_ON_PAGE(page_count(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page); - lruvec = mem_cgroup_lruvec(memcg, pgdat); - eviction = atomic_long_inc_return(&lruvec->inactive_age); + advance_inactive_age(page_memcg(page), pgdat); + + lruvec = mem_cgroup_lruvec(target_memcg, pgdat); + /* XXX: target_memcg can be NULL, go through lruvec */ + memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); + eviction = atomic_long_read(&lruvec->inactive_age); return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); } @@ -244,10 +269,13 @@ void *workingset_eviction(struct page *page) * @shadow: shadow entry of the evicted page * * Calculates and evaluates the refault distance of the previously - * evicted page in the context of the node it was allocated in. + * evicted page in the context of the node and the memcg whose memory + * pressure caused the eviction. */ void workingset_refault(struct page *page, void *shadow) { + struct mem_cgroup *eviction_memcg; + struct lruvec *eviction_lruvec; unsigned long refault_distance; struct pglist_data *pgdat; unsigned long active_file; @@ -277,12 +305,12 @@ void workingset_refault(struct page *page, void *shadow) * would be better if the root_mem_cgroup existed in all * configurations instead. */ - memcg = mem_cgroup_from_id(memcgid); - if (!mem_cgroup_disabled() && !memcg) + eviction_memcg = mem_cgroup_from_id(memcgid); + if (!mem_cgroup_disabled() && !eviction_memcg) goto out; - lruvec = mem_cgroup_lruvec(memcg, pgdat); - refault = atomic_long_read(&lruvec->inactive_age); - active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES); + eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); + refault = atomic_long_read(&eviction_lruvec->inactive_age); + active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); /* * Calculate the refault distance @@ -302,6 +330,17 @@ void workingset_refault(struct page *page, void *shadow) */ refault_distance = (refault - eviction) & EVICTION_MASK; + /* + * The activation decision for this page is made at the level + * where the eviction occurred, as that is where the LRU order + * during page reclaim is being determined. + * + * However, the cgroup that will own the page is the one that + * is actually experiencing the refault event. + */ + memcg = get_mem_cgroup_from_mm(current->mm); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + inc_lruvec_state(lruvec, WORKINGSET_REFAULT); /* @@ -310,10 +349,10 @@ void workingset_refault(struct page *page, void *shadow) * the memory was available to the page cache. */ if (refault_distance > active_file) - goto out; + goto out_memcg; SetPageActive(page); - atomic_long_inc(&lruvec->inactive_age); + advance_inactive_age(memcg, pgdat); inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); /* Page was active prior to eviction */ @@ -321,6 +360,9 @@ void workingset_refault(struct page *page, void *shadow) SetPageWorkingset(page); inc_lruvec_state(lruvec, WORKINGSET_RESTORE); } + +out_memcg: + mem_cgroup_put(memcg); out: rcu_read_unlock(); } @@ -332,7 +374,6 @@ void workingset_refault(struct page *page, void *shadow) void workingset_activation(struct page *page) { struct mem_cgroup *memcg; - struct lruvec *lruvec; rcu_read_lock(); /* @@ -345,8 +386,7 @@ void workingset_activation(struct page *page) memcg = page_memcg_rcu(page); if (!mem_cgroup_disabled() && !memcg) goto out; - lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); - atomic_long_inc(&lruvec->inactive_age); + advance_inactive_age(memcg, page_pgdat(page)); out: rcu_read_unlock(); }
We use refault information to determine whether the cache workingset is stable or transitioning, and dynamically adjust the inactive:active file LRU ratio so as to maximize protection from one-off cache during stable periods, and minimize IO during transitions. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, refaults only affect the local LRU order in the cgroup in which they are occuring. As a result, cache transitions can take longer in a cgrouped system as the active pages of sibling cgroups aren't challenged when they should be. [ Right now, this is somewhat theoretical, because the siblings, under continued regular reclaim pressure, should eventually run out of inactive pages - and since inactive:active *size* balancing is also done on a cgroup-local level, we will challenge the active pages eventually in most cases. But the next patch will move that relative size enforcement to the reclaim root as well, and then this patch here will be necessary to propagate refault pressure to siblings. ] This patch moves refault detection to the root of reclaim. Instead of remembering the cgroup owner of an evicted page, remember the cgroup that caused the reclaim to happen. When refaults later occur, they'll correctly influence the cross-cgroup LRU order that reclaim follows. I.e. if global reclaim kicked out pages in some subgroup A/B/C, the refault of those pages will challenge the global LRU order, and not just the local order down inside C. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/memcontrol.h | 5 +++ include/linux/swap.h | 2 +- mm/vmscan.c | 32 ++++++++--------- mm/workingset.c | 72 +++++++++++++++++++++++++++++--------- 4 files changed, 77 insertions(+), 34 deletions(-)