Message ID | 1554348617-12897-1-git-send-email-huangzhaoyang@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm:workingset use real time to judge activity of the file page | expand |
[Fixup email for Pavel and add Johannes] On Thu 04-04-19 11:30:17, Zhaoyang Huang wrote: > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > In previous implementation, the number of refault pages is used > for judging the refault period of each page, which is not precised as > eviction of other files will be affect a lot on current cache. > We introduce the timestamp into the workingset's entry and refault ratio > to measure the file page's activity. It helps to decrease the affection > of other files(average refault ratio can reflect the view of whole system > 's memory). > The patch is tested on an Android system, which can be described as > comparing the launch time of an application between a huge memory > consumption. The result is launch time decrease 50% and the page fault > during the test decrease 80%. > > Signed-off-by: Zhaoyang Huang <huangzhaoyang@gmail.com> > --- > include/linux/mmzone.h | 2 ++ > mm/workingset.c | 24 +++++++++++++++++------- > 2 files changed, 19 insertions(+), 7 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 32699b2..c38ba0a 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -240,6 +240,8 @@ struct lruvec { > atomic_long_t inactive_age; > /* Refaults at the time of last reclaim cycle */ > unsigned long refaults; > + atomic_long_t refaults_ratio; > + atomic_long_t prev_fault; > #ifdef CONFIG_MEMCG > struct pglist_data *pgdat; > #endif > diff --git a/mm/workingset.c b/mm/workingset.c > index 40ee02c..6361853 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -159,7 +159,7 @@ > NODES_SHIFT + \ > MEM_CGROUP_ID_SHIFT) > #define EVICTION_MASK (~0UL >> EVICTION_SHIFT) > - > +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3) > /* > * Eviction timestamps need to be able to cover the full range of > * actionable refaults. However, bits are tight in the radix tree > @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction) > eviction >>= bucket_order; > eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; > eviction = (eviction << NODES_SHIFT) | pgdat->node_id; > + eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES); > eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT); > > return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY); > } > > static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, > - unsigned long *evictionp) > + unsigned long *evictionp, unsigned long *prev_jiffp) > { > unsigned long entry = (unsigned long)shadow; > int memcgid, nid; > + unsigned long prev_jiff; > > entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT; > + entry >>= EVICTION_JIFFIES; > + prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES; > nid = entry & ((1UL << NODES_SHIFT) - 1); > entry >>= NODES_SHIFT; > memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1); > @@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, > *memcgidp = memcgid; > *pgdat = NODE_DATA(nid); > *evictionp = entry << bucket_order; > + *prev_jiffp = prev_jiff; > } > > /** > @@ -242,8 +247,12 @@ bool workingset_refault(void *shadow) > unsigned long refault; > struct pglist_data *pgdat; > int memcgid; > + unsigned long refault_ratio; > + unsigned long prev_jiff; > + unsigned long avg_refault_time; > + unsigned long refault_time; > > - unpack_shadow(shadow, &memcgid, &pgdat, &eviction); > + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff); > > rcu_read_lock(); > /* > @@ -288,10 +297,11 @@ bool workingset_refault(void *shadow) > * list is not a problem. > */ > refault_distance = (refault - eviction) & EVICTION_MASK; > - > inc_lruvec_state(lruvec, WORKINGSET_REFAULT); > - > - if (refault_distance <= active_file) { > + lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies; > + refault_time = jiffies - prev_jiff; > + avg_refault_time = refault_distance / lruvec->refaults_ratio; > + if (refault_time <= avg_refault_time) { > inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); > rcu_read_unlock(); > return true; > @@ -521,7 +531,7 @@ static int __init workingset_init(void) > * some more pages at runtime, so keep working with up to > * double the initial memory by using totalram_pages as-is. > */ > - timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT; > + timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES; > max_order = fls_long(totalram_pages - 1); > if (max_order > timestamp_bits) > bucket_order = max_order - timestamp_bits; > -- > 1.9.1
On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote: > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > In previous implementation, the number of refault pages is used > for judging the refault period of each page, which is not precised as > eviction of other files will be affect a lot on current cache. > We introduce the timestamp into the workingset's entry and refault ratio > to measure the file page's activity. It helps to decrease the affection > of other files(average refault ratio can reflect the view of whole system > 's memory). I don't understand what exactly you're saying here, can you please elaborate? The reason it's using distances instead of absolute time is because the ordering of the LRU is relative and not based on absolute time. E.g. if a page is accessed every 500ms, it depends on all other pages to determine whether this page is at the head or the tail of the LRU. So when you refault, in order to determine the relative position of the refaulted page in the LRU, you have to compare it to how fast that LRU is moving. The absolute refault time, or the average time between refaults, is not comparable to what's already in memory.
On Fri, Apr 5, 2019 at 12:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > In previous implementation, the number of refault pages is used > > for judging the refault period of each page, which is not precised as > > eviction of other files will be affect a lot on current cache. > > We introduce the timestamp into the workingset's entry and refault ratio > > to measure the file page's activity. It helps to decrease the affection > > of other files(average refault ratio can reflect the view of whole system > > 's memory). > > I don't understand what exactly you're saying here, can you please > elaborate? > > The reason it's using distances instead of absolute time is because > the ordering of the LRU is relative and not based on absolute time. > > E.g. if a page is accessed every 500ms, it depends on all other pages > to determine whether this page is at the head or the tail of the LRU. > > So when you refault, in order to determine the relative position of > the refaulted page in the LRU, you have to compare it to how fast that > LRU is moving. The absolute refault time, or the average time between > refaults, is not comparable to what's already in memory. How do you know how long time did these pages' dropping taken.Actruly, a quick dropping of large mount of pages will be wrongly deemed as slow dropping instead of the exact hard situation.That is to say, 100 pages per million second or per second have same impaction on calculating the refault distance, which may cause less protection on this page cache for former scenario and introduce page thrashing. especially when global reclaim, a round of kswapd reclaiming that waked up by a high order allocation or large number of single page allocations may cause such things as all pages within the node are counted in the same lru. This commit can decreasing above things by comparing refault time of single page with avg_refault_time = delta_lru_reclaimed_pages/ avg_refault_retio (refault_ratio = lru->inactive_ages / time).
resend it via the right mailling list and rewrite the comments by ZY. On Thu, Apr 4, 2019 at 3:15 PM Michal Hocko <mhocko@kernel.org> wrote: > > [Fixup email for Pavel and add Johannes] > > On Thu 04-04-19 11:30:17, Zhaoyang Huang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > In previous implementation, the number of refault pages is used > > for judging the refault period of each page, which is not precised as > > eviction of other files will be affect a lot on current cache. > > We introduce the timestamp into the workingset's entry and refault ratio > > to measure the file page's activity. It helps to decrease the affection > > of other files(average refault ratio can reflect the view of whole system > > 's memory). > > The patch is tested on an Android system, which can be described as > > comparing the launch time of an application between a huge memory > > consumption. The result is launch time decrease 50% and the page fault > > during the test decrease 80%. > > I don't understand what exactly you're saying here, can you please elaborate? The reason it's using distances instead of absolute time is because the ordering of the LRU is relative and not based on absolute time. E.g. if a page is accessed every 500ms, it depends on all other pages to determine whether this page is at the head or the tail of the LRU. So when you refault, in order to determine the relative position of the refaulted page in the LRU, you have to compare it to how fast that LRU is moving. The absolute refault time, or the average time between refaults, is not comparable to what's already in memory. comment by ZY For current implementation, it is hard to deal with the evaluation of refault period under the scenario of huge dropping of file pages within short time, which maybe caused by a high order allocation or continues single page allocation in KSWAPD. On the contrary, such page which having a big refault_distance will be deemed as INACTIVE wrongly, which will be reclaimed earlier than it should be and lead to page thrashing. So we introduce 'avg_refault_time' & 'refault_ratio' to judge if the refault is a accumulated thing or caused by a tight reclaiming. That is to say, a big refault_distance in a long time would also be inactive as the result of comparing it with ideal time(avg_refault_time: avg_refault_time = delta_lru_reclaimed_pages/ avg_refault_retio (refault_ratio = lru->inactive_ages / time). > > Signed-off-by: Zhaoyang Huang <huangzhaoyang@gmail.com> > > --- > > include/linux/mmzone.h | 2 ++ > > mm/workingset.c | 24 +++++++++++++++++------- > > 2 files changed, 19 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 32699b2..c38ba0a 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -240,6 +240,8 @@ struct lruvec { > > atomic_long_t inactive_age; > > /* Refaults at the time of last reclaim cycle */ > > unsigned long refaults; > > + atomic_long_t refaults_ratio; > > + atomic_long_t prev_fault; > > #ifdef CONFIG_MEMCG > > struct pglist_data *pgdat; > > #endif > > diff --git a/mm/workingset.c b/mm/workingset.c > > index 40ee02c..6361853 100644 > > --- a/mm/workingset.c > > +++ b/mm/workingset.c > > @@ -159,7 +159,7 @@ > > NODES_SHIFT + \ > > MEM_CGROUP_ID_SHIFT) > > #define EVICTION_MASK (~0UL >> EVICTION_SHIFT) > > - > > +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3) > > /* > > * Eviction timestamps need to be able to cover the full range of > > * actionable refaults. However, bits are tight in the radix tree > > @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction) > > eviction >>= bucket_order; > > eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; > > eviction = (eviction << NODES_SHIFT) | pgdat->node_id; > > + eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES); > > eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT); > > > > return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY); > > } > > > > static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, > > - unsigned long *evictionp) > > + unsigned long *evictionp, unsigned long *prev_jiffp) > > { > > unsigned long entry = (unsigned long)shadow; > > int memcgid, nid; > > + unsigned long prev_jiff; > > > > entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT; > > + entry >>= EVICTION_JIFFIES; > > + prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES; > > nid = entry & ((1UL << NODES_SHIFT) - 1); > > entry >>= NODES_SHIFT; > > memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1); > > @@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, > > *memcgidp = memcgid; > > *pgdat = NODE_DATA(nid); > > *evictionp = entry << bucket_order; > > + *prev_jiffp = prev_jiff; > > } > > > > /** > > @@ -242,8 +247,12 @@ bool workingset_refault(void *shadow) > > unsigned long refault; > > struct pglist_data *pgdat; > > int memcgid; > > + unsigned long refault_ratio; > > + unsigned long prev_jiff; > > + unsigned long avg_refault_time; > > + unsigned long refault_time; > > > > - unpack_shadow(shadow, &memcgid, &pgdat, &eviction); > > + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff); > > > > rcu_read_lock(); > > /* > > @@ -288,10 +297,11 @@ bool workingset_refault(void *shadow) > > * list is not a problem. > > */ > > refault_distance = (refault - eviction) & EVICTION_MASK; > > - > > inc_lruvec_state(lruvec, WORKINGSET_REFAULT); > > - > > - if (refault_distance <= active_file) { > > + lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies; > > + refault_time = jiffies - prev_jiff; > > + avg_refault_time = refault_distance / lruvec->refaults_ratio; > > + if (refault_time <= avg_refault_time) { > > inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); > > rcu_read_unlock(); > > return true; > > @@ -521,7 +531,7 @@ static int __init workingset_init(void) > > * some more pages at runtime, so keep working with up to > > * double the initial memory by using totalram_pages as-is. > > */ > > - timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT; > > + timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES; > > max_order = fls_long(totalram_pages - 1); > > if (max_order > timestamp_bits) > > bucket_order = max_order - timestamp_bits; > > -- > > 1.9.1 > > -- > Michal Hocko > SUSE Labs
On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote: > +++ b/mm/workingset.c > @@ -159,7 +159,7 @@ > NODES_SHIFT + \ > MEM_CGROUP_ID_SHIFT) > #define EVICTION_MASK (~0UL >> EVICTION_SHIFT) > - > +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3) > /* > * Eviction timestamps need to be able to cover the full range of > * actionable refaults. However, bits are tight in the radix tree > @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction) > eviction >>= bucket_order; > eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; > eviction = (eviction << NODES_SHIFT) | pgdat->node_id; > + eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES); > eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT); ... this isn't against current, or even 5.0. > entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT; > + entry >>= EVICTION_JIFFIES; > + prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES; These two lines are in the wrong order. So you're getting (effectively) a random answer in your 'prev_jiff', which means your testing isn't thorough enough. I suspect you're only testing cases you're expecting to improve, and you aren't testing to make sure that other cases don't regress.
On Fri, Apr 05, 2019 at 07:23:46AM +0800, Zhaoyang Huang wrote: > On Fri, Apr 5, 2019 at 12:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > In previous implementation, the number of refault pages is used > > > for judging the refault period of each page, which is not precised as > > > eviction of other files will be affect a lot on current cache. > > > We introduce the timestamp into the workingset's entry and refault ratio > > > to measure the file page's activity. It helps to decrease the affection > > > of other files(average refault ratio can reflect the view of whole system > > > 's memory). > > > > I don't understand what exactly you're saying here, can you please > > elaborate? > > > > The reason it's using distances instead of absolute time is because > > the ordering of the LRU is relative and not based on absolute time. > > > > E.g. if a page is accessed every 500ms, it depends on all other pages > > to determine whether this page is at the head or the tail of the LRU. > > > > So when you refault, in order to determine the relative position of > > the refaulted page in the LRU, you have to compare it to how fast that > > LRU is moving. The absolute refault time, or the average time between > > refaults, is not comparable to what's already in memory. > How do you know how long time did these pages' dropping taken.Actruly, > a quick dropping of large mount of pages will be wrongly deemed as > slow dropping instead of the exact hard situation.That is to say, 100 > pages per million second or per second have same impaction on > calculating the refault distance, which may cause less protection on > this page cache for former scenario and introduce page thrashing. > especially when global reclaim, a round of kswapd reclaiming that > waked up by a high order allocation or large number of single page > allocations may cause such things as all pages within the node are > counted in the same lru. This commit can decreasing above things by > comparing refault time of single page with avg_refault_time = > delta_lru_reclaimed_pages/ avg_refault_retio (refault_ratio = > lru->inactive_ages / time). When something like a higher-order allocation drops a large number of file pages, it's *intentional* that the pages that were evicted before them become less valuable and less likely to be activated on refault. There is a finite amount of in-memory LRU space and the pages that have been evicted the most recently have precedence because they have the highest proven access frequency. Of course, when a large amount of the cache that was pushed out in between is not re-used again, and don't claim their space in memory, it would be great if we could then activate the older pages that *are* re-used again in their stead. But that would require us being able to look into the future. When an old page refaults, we don't know if a younger page is still going to refault with a shorter refault distance or not. If it won't, then we were right to activate it. If it will refault, then we put something on the active list whose reuse frequency is too low to be able to fit into memory, and we thrash the hottest pages in the system. As Matthew says, you are fairly randomly making refault activations more aggressive (especially with that timestamp unpacking bug), and while that expectedly boosts workload transition / startup, it comes at the cost of disrupting stable states because you can flood a very active in-ram workingset with completely cold cache pages simply because they refault uniformly wrt each other.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 32699b2..c38ba0a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -240,6 +240,8 @@ struct lruvec { atomic_long_t inactive_age; /* Refaults at the time of last reclaim cycle */ unsigned long refaults; + atomic_long_t refaults_ratio; + atomic_long_t prev_fault; #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif diff --git a/mm/workingset.c b/mm/workingset.c index 40ee02c..6361853 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -159,7 +159,7 @@ NODES_SHIFT + \ MEM_CGROUP_ID_SHIFT) #define EVICTION_MASK (~0UL >> EVICTION_SHIFT) - +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3) /* * Eviction timestamps need to be able to cover the full range of * actionable refaults. However, bits are tight in the radix tree @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction) eviction >>= bucket_order; eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction = (eviction << NODES_SHIFT) | pgdat->node_id; + eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES); eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT); return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY); } static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, - unsigned long *evictionp) + unsigned long *evictionp, unsigned long *prev_jiffp) { unsigned long entry = (unsigned long)shadow; int memcgid, nid; + unsigned long prev_jiff; entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT; + entry >>= EVICTION_JIFFIES; + prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES; nid = entry & ((1UL << NODES_SHIFT) - 1); entry >>= NODES_SHIFT; memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1); @@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *memcgidp = memcgid; *pgdat = NODE_DATA(nid); *evictionp = entry << bucket_order; + *prev_jiffp = prev_jiff; } /** @@ -242,8 +247,12 @@ bool workingset_refault(void *shadow) unsigned long refault; struct pglist_data *pgdat; int memcgid; + unsigned long refault_ratio; + unsigned long prev_jiff; + unsigned long avg_refault_time; + unsigned long refault_time; - unpack_shadow(shadow, &memcgid, &pgdat, &eviction); + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff); rcu_read_lock(); /* @@ -288,10 +297,11 @@ bool workingset_refault(void *shadow) * list is not a problem. */ refault_distance = (refault - eviction) & EVICTION_MASK; - inc_lruvec_state(lruvec, WORKINGSET_REFAULT); - - if (refault_distance <= active_file) { + lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies; + refault_time = jiffies - prev_jiff; + avg_refault_time = refault_distance / lruvec->refaults_ratio; + if (refault_time <= avg_refault_time) { inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); rcu_read_unlock(); return true; @@ -521,7 +531,7 @@ static int __init workingset_init(void) * some more pages at runtime, so keep working with up to * double the initial memory by using totalram_pages as-is. */ - timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT; + timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES; max_order = fls_long(totalram_pages - 1); if (max_order > timestamp_bits) bucket_order = max_order - timestamp_bits;