Message ID | 20200520232525.798933-6-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: balance LRU lists based on relative thrashing v2 | expand |
2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > We activate cache refaults with reuse distances in pages smaller than > the size of the total cache. This allows new pages with competitive > access frequencies to establish themselves, as well as challenge and > potentially displace pages on the active list that have gone cold. > > However, that assumes that active cache can only replace other active > cache in a competition for the hottest memory. This is not a great > default assumption. The page cache might be thrashing while there are > enough completely cold and unused anonymous pages sitting around that > we'd only have to write to swap once to stop all IO from the cache. > > Activate cache refaults when their reuse distance in pages is smaller > than the total userspace workingset, including anonymous pages. Hmm... I'm not sure the correctness of this change. IIUC, this patch leads to more activations in the file list and more activations here will challenge the anon list since rotation ratio for the file list will be increased. However, this change breaks active/inactive concept of the file list. active/inactive separation is implemented by in-list refault distance. anon list size has no direct connection with refault distance of the file list so using anon list size to detect workingset for file page breaks the concept. My suspicion is started by this counter example. Environment: anon: 500 MB (so hot) / 500 MB (so hot) file: 50 MB (hot) / 50 MB (cold) Think about the situation that there is periodical access to other file (100 MB) with low frequency (refault distance is 500 MB) Without your change, this periodical access doesn't make thrashing for cached active file page since refault distance of periodical access is larger than the size of the active file list. However, with your change, it causes thrashing on the file list. In fact, I'm not sure that I'm thinking correctly. It's very complicated for me. :) Please let me know if there is a missing piece. Thanks.
On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote: > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > We activate cache refaults with reuse distances in pages smaller than > > the size of the total cache. This allows new pages with competitive > > access frequencies to establish themselves, as well as challenge and > > potentially displace pages on the active list that have gone cold. > > > > However, that assumes that active cache can only replace other active > > cache in a competition for the hottest memory. This is not a great > > default assumption. The page cache might be thrashing while there are > > enough completely cold and unused anonymous pages sitting around that > > we'd only have to write to swap once to stop all IO from the cache. > > > > Activate cache refaults when their reuse distance in pages is smaller > > than the total userspace workingset, including anonymous pages. > > Hmm... I'm not sure the correctness of this change. > > IIUC, this patch leads to more activations in the file list and more activations > here will challenge the anon list since rotation ratio for the file > list will be increased. Yes. > However, this change breaks active/inactive concept of the file list. > active/inactive > separation is implemented by in-list refault distance. anon list size has > no direct connection with refault distance of the file list so using > anon list size > to detect workingset for file page breaks the concept. This is intentional, because there IS a connection: they both take up space in RAM, and they both cost IO to bring back once reclaimed. When file is refaulting, it means we need to make more space for cache. That space can come from stale active file pages. But what if active cache is all hot, and meanwhile there are cold anon pages that we could swap out once and then serve everything from RAM? When file is refaulting, we should find the coldest data that is taking up RAM and kick it out. It doesn't matter whether it's file or anon: the goal is to free up RAM with the least amount of IO risk. Remember that the file/anon split, and the inactive/active split, are there to optimize reclaim. It doesn't mean that these memory pools are independent from each other. The file list is split in two because of use-once cache. The anon and file lists are split because of different IO patterns, because we may not have swap etc. But once we are out of use-once cache, have swap space available, and have corrected for the different cost of IO, there needs to be a relative order between all pages in the system to find the optimal candidates to reclaim. > My suspicion is started by this counter example. > > Environment: > anon: 500 MB (so hot) / 500 MB (so hot) > file: 50 MB (hot) / 50 MB (cold) > > Think about the situation that there is periodical access to other file (100 MB) > with low frequency (refault distance is 500 MB) > > Without your change, this periodical access doesn't make thrashing for cached > active file page since refault distance of periodical access is larger > than the size of > the active file list. However, with your change, it causes thrashing > on the file list. It doesn't cause thrashing. It causes scanning because that 100M file IS thrashing: with or without my patch, that refault IO is occuring. What this patch acknowledges is that the 100M file COULD fit fully into memory, and not require any IO to serve, IFF 100M of the active file or anon pages were cold and could be reclaimed or swapped out. In your example, the anon set is hot. We'll scan it slowly (at the rate of IO from the other file) and rotate the pages that are in use - which would be all of them. Likewise for the file - there will be some deactivations, but mark_page_accessed() or the second chance algorithm in page_check_references() for mapped will keep the hottest pages active. In a slightly modified example, 400M of the anon set is hot and 100M cold. Without my patch, we would never look for them and the second file would be IO-bound forever. After my patch, we would scan anon, eventually find the cold pages, swap them out, and then serve the entire workingset from memory. Does it cause more scanning than before in your scenario? Yes, but we don't even know it's your scenario instead of mine until we actually sample references of all memory. Not scanning is a false stable state. And your scenario could easily change over time. Even if the amount of repeatedly accessed pages stays larger than memory, and will always require IO to serve, the relative access frequencies could change. Some pages could become hotter, others colder. Without scanning, we wouldn't adapt the LRU ordering and cause more IO than necessary.
2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote: > > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > > We activate cache refaults with reuse distances in pages smaller than > > > the size of the total cache. This allows new pages with competitive > > > access frequencies to establish themselves, as well as challenge and > > > potentially displace pages on the active list that have gone cold. > > > > > > However, that assumes that active cache can only replace other active > > > cache in a competition for the hottest memory. This is not a great > > > default assumption. The page cache might be thrashing while there are > > > enough completely cold and unused anonymous pages sitting around that > > > we'd only have to write to swap once to stop all IO from the cache. > > > > > > Activate cache refaults when their reuse distance in pages is smaller > > > than the total userspace workingset, including anonymous pages. > > > > Hmm... I'm not sure the correctness of this change. > > > > IIUC, this patch leads to more activations in the file list and more activations > > here will challenge the anon list since rotation ratio for the file > > list will be increased. > > Yes. > > > However, this change breaks active/inactive concept of the file list. > > active/inactive > > separation is implemented by in-list refault distance. anon list size has > > no direct connection with refault distance of the file list so using > > anon list size > > to detect workingset for file page breaks the concept. > > This is intentional, because there IS a connection: they both take up > space in RAM, and they both cost IO to bring back once reclaimed. I know that. This is the reason that I said 'no direct connection'. The anon list size is directly related the *possible* file list size. But, active/inactive separation in one list is firstly based on *current* list size rather than the possible list size. Adding anon list size to detect workingset means to use the possible list size and I think that it's wrong. > When file is refaulting, it means we need to make more space for > cache. That space can come from stale active file pages. But what if > active cache is all hot, and meanwhile there are cold anon pages that > we could swap out once and then serve everything from RAM? > > When file is refaulting, we should find the coldest data that is > taking up RAM and kick it out. It doesn't matter whether it's file or > anon: the goal is to free up RAM with the least amount of IO risk. I understand your purpose and agree with it. We need to find a solution. To achieve your goal, my suggestion is: - refault distance < active file, then do activation and add up IO cost - refault distance < active file + anon list, then add up IO cost This doesn't break workingset detection on file list and challenge the anon list as the same degree as you did. > Remember that the file/anon split, and the inactive/active split, are > there to optimize reclaim. It doesn't mean that these memory pools are > independent from each other. > > The file list is split in two because of use-once cache. The anon and > file lists are split because of different IO patterns, because we may > not have swap etc. But once we are out of use-once cache, have swap > space available, and have corrected for the different cost of IO, > there needs to be a relative order between all pages in the system to > find the optimal candidates to reclaim. > > > My suspicion is started by this counter example. > > > > Environment: > > anon: 500 MB (so hot) / 500 MB (so hot) > > file: 50 MB (hot) / 50 MB (cold) > > > > Think about the situation that there is periodical access to other file (100 MB) > > with low frequency (refault distance is 500 MB) > > > > Without your change, this periodical access doesn't make thrashing for cached > > active file page since refault distance of periodical access is larger > > than the size of > > the active file list. However, with your change, it causes thrashing > > on the file list. > > It doesn't cause thrashing. It causes scanning because that 100M file > IS thrashing: with or without my patch, that refault IO is occuring. It could cause thrashing for your patch. Without the patch, current logic try to find most hottest file pages that are fit into the current file list size and protect them successfully. Assume that access distance of 50 MB hot file pages is 60 MB which is less than whole file list size but larger than inactive list size. Without your patch, 50 MB (hot) pages are not evicted at all. All these hot pages will be protected from the 100MB low access frequency pages. 100 MB low access frequency pages will be refaulted repeatedely but it's correct behaviour. However, with your patch, 50 MB (hot) file pages are deactivated due to newly added file pages with low access frequency. And, then, since access distance of 50 MB (hot) pages is larger than inactive list size, they could not get a second chance and finally could be evicted. I think that this is a thrashing since low access frequency pages that are not fit for current file list size pushes out the high access frequency pages that are fit for current file list size and it would happen again and again. Maybe, logic can be corrected if the patch considers inactive age of anon list but I think that my above suggestion would be enough. > What this patch acknowledges is that the 100M file COULD fit fully > into memory, and not require any IO to serve, IFF 100M of the active > file or anon pages were cold and could be reclaimed or swapped out. > > In your example, the anon set is hot. We'll scan it slowly (at the > rate of IO from the other file) and rotate the pages that are in use - > which would be all of them. Likewise for the file - there will be some > deactivations, but mark_page_accessed() or the second chance algorithm > in page_check_references() for mapped will keep the hottest pages active. > In a slightly modified example, 400M of the anon set is hot and 100M > cold. Without my patch, we would never look for them and the second > file would be IO-bound forever. After my patch, we would scan anon, > eventually find the cold pages, swap them out, and then serve the > entire workingset from memory. Again, I agree with your goal. What I don't agree is the implementation to achieve the goal. Thanks.
On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote: > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote: > > > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > > > > We activate cache refaults with reuse distances in pages smaller than > > > > the size of the total cache. This allows new pages with competitive > > > > access frequencies to establish themselves, as well as challenge and > > > > potentially displace pages on the active list that have gone cold. > > > > > > > > However, that assumes that active cache can only replace other active > > > > cache in a competition for the hottest memory. This is not a great > > > > default assumption. The page cache might be thrashing while there are > > > > enough completely cold and unused anonymous pages sitting around that > > > > we'd only have to write to swap once to stop all IO from the cache. > > > > > > > > Activate cache refaults when their reuse distance in pages is smaller > > > > than the total userspace workingset, including anonymous pages. > > > > > > Hmm... I'm not sure the correctness of this change. > > > > > > IIUC, this patch leads to more activations in the file list and more activations > > > here will challenge the anon list since rotation ratio for the file > > > list will be increased. > > > > Yes. > > > > > However, this change breaks active/inactive concept of the file list. > > > active/inactive > > > separation is implemented by in-list refault distance. anon list size has > > > no direct connection with refault distance of the file list so using > > > anon list size > > > to detect workingset for file page breaks the concept. > > > > This is intentional, because there IS a connection: they both take up > > space in RAM, and they both cost IO to bring back once reclaimed. > > I know that. This is the reason that I said 'no direct connection'. The anon > list size is directly related the *possible* file list size. But, > active/inactive > separation in one list is firstly based on *current* list size rather than > the possible list size. Adding anon list size to detect workingset means > to use the possible list size and I think that it's wrong. Hm so you're saying we should try to grow the page cache, but maintain an inactive/active balance within the cache based on its current size in case growing is not possible. I think it would be implementable*, but I'm not quite convinced that it's worth distinguishing. We're talking about an overcommitted workingset and thereby an inherently unstable state that may thrash purely based on timing differences anyway. See below. *It would require another page flag to tell whether a refaulting cache page has challenged the anon set once (transitioning) or repeatedly (thrashing), as we currently use the active state for that. If we would repurpose PG_workingset to tell the first from the second refault, we'd need a new flag to mark a page for memstall accounting. > > When file is refaulting, it means we need to make more space for > > cache. That space can come from stale active file pages. But what if > > active cache is all hot, and meanwhile there are cold anon pages that > > we could swap out once and then serve everything from RAM? > > > > When file is refaulting, we should find the coldest data that is > > taking up RAM and kick it out. It doesn't matter whether it's file or > > anon: the goal is to free up RAM with the least amount of IO risk. > > I understand your purpose and agree with it. We need to find a solution. > To achieve your goal, my suggestion is: > > - refault distance < active file, then do activation and add up IO cost > - refault distance < active file + anon list, then add up IO cost > > This doesn't break workingset detection on file list and challenge > the anon list as the same degree as you did. Unfortunately, it doesn't challenge the anon set. We'll stay in cache trimming mode where the IO cost balance doesn't have any effect. We would need to 1. activate on distance < active file 2. force reclaim to scan anon when distance < active file + anon 3. record thrashing and IO cost when distance < active file + anon and it's the second refault of the page. Something like this, where vmscan would scan anon when WORKINGSET_RESTORE or WORKINGSET_EXPAND events are occuring: diff --git a/mm/workingset.c b/mm/workingset.c index d481ea452eeb..41dde6274fff 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -275,9 +275,9 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) void workingset_refault(struct page *page, void *shadow) { struct mem_cgroup *eviction_memcg; + unsigned long active_file, anon; struct lruvec *eviction_lruvec; unsigned long refault_distance; - unsigned long workingset_size; struct pglist_data *pgdat; struct mem_cgroup *memcg; unsigned long eviction; @@ -330,7 +330,7 @@ void workingset_refault(struct page *page, void *shadow) refault_distance = (refault - eviction) & EVICTION_MASK; /* - * The activation decision for this page is made at the level + * The distance decisions for this page is made at the level * where the eviction occurred, as that is where the LRU order * during page reclaim is being determined. * @@ -344,32 +344,52 @@ void workingset_refault(struct page *page, void *shadow) /* * Compare the distance to the existing workingset size. We - * don't activate pages that couldn't stay resident even if - * all the memory was available to the page cache. Whether - * cache can compete with anon or not depends on having swap. + * ignore pages that couldn't stay resident even if all the + * memory was available to the page cache. Whether cache can + * compete with anon or not depends on having swap. */ - workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); + active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); + anon = 0; if (mem_cgroup_get_nr_swap_pages(memcg) > 0) { - workingset_size += lruvec_page_state(eviction_lruvec, - NR_INACTIVE_ANON); - workingset_size += lruvec_page_state(eviction_lruvec, - NR_ACTIVE_ANON); + anon += lruvec_page_state(eviction_lruvec, NR_INACTIVE_ANON); + anon += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON); } - if (refault_distance > workingset_size) - goto out; - SetPageActive(page); - advance_inactive_age(memcg, pgdat); - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); + if (refault_distance > active_file + anon) + goto out; - /* Page was active prior to eviction */ + /* + * Page has already challenged the workingset before, and it's + * refaulting again: we're not transitioning out old cache, + * we're thrashing and need to grow. Record the IO to tip the + * reclaim balance and mark the page for memstall detection. + */ if (workingset) { - SetPageWorkingset(page); + inc_lruvec_state(lruvec, WORKINGSET_RESTORE); + /* XXX: Move to lru_cache_add() when it supports new vs putback */ spin_lock_irq(&page_pgdat(page)->lru_lock); lru_note_cost_page(page); spin_unlock_irq(&page_pgdat(page)->lru_lock); - inc_lruvec_state(lruvec, WORKINGSET_RESTORE); + + SetPageThrashing(page); + } + + SetPageWorkingset(page); + + /* + * Activate right away if page can compete with the existing + * active set given the *current* size of the page cache. + * + * Otherwise, the page cache needs to grow to house this + * page. Tell reclaim to rebalance against anonymous memory. + */ + if (refault_distance <= active_file) { + SetPageActive(page); + advance_inactive_age(memcg, pgdat); + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); + } else { + inc_lruvec_state(lruvec, WORKINGSET_EXPAND); } out: rcu_read_unlock(); > > Remember that the file/anon split, and the inactive/active split, are > > there to optimize reclaim. It doesn't mean that these memory pools are > > independent from each other. > > > > The file list is split in two because of use-once cache. The anon and > > file lists are split because of different IO patterns, because we may > > not have swap etc. But once we are out of use-once cache, have swap > > space available, and have corrected for the different cost of IO, > > there needs to be a relative order between all pages in the system to > > find the optimal candidates to reclaim. > > > > > My suspicion is started by this counter example. > > > > > > Environment: > > > anon: 500 MB (so hot) / 500 MB (so hot) > > > file: 50 MB (hot) / 50 MB (cold) > > > > > > Think about the situation that there is periodical access to other file (100 MB) > > > with low frequency (refault distance is 500 MB) > > > > > > Without your change, this periodical access doesn't make thrashing for cached > > > active file page since refault distance of periodical access is larger > > > than the size of > > > the active file list. However, with your change, it causes thrashing > > > on the file list. > > > > It doesn't cause thrashing. It causes scanning because that 100M file > > IS thrashing: with or without my patch, that refault IO is occuring. > > It could cause thrashing for your patch. > Without the patch, current logic try to > find most hottest file pages that are fit into the current file list > size and protect them > successfully. Assume that access distance of 50 MB hot file pages is 60 MB > which is less than whole file list size but larger than inactive list > size. Without > your patch, 50 MB (hot) pages are not evicted at all. All these hot > pages will be > protected from the 100MB low access frequency pages. 100 MB low access > frequency pages will be refaulted repeatedely but it's correct behaviour. Hm, something doesn't quite add up. Why is the 50M hot set evicted with my patch? With a 60M access distance and a 100M page cache, they might get deactivated, but then activated back to the head of the active list on their next access. They should get scanned but not actually reclaimed. The only way they could get reclaimed is if their access distance ends up bigger than the file cache. But if that's the case, then the workingset is overcommitted, and none of the pages qualify for reclaim protection. Picking a subset to protect against the rest is arbitrary. For example, if you started out with 50M of free space and an empty cache and started accessing the 50M and 100M files simultaneously, with the access distances you specified, none of them would get activated. You would have the same behavior with or without my patch. > However, with your patch, 50 MB (hot) file pages are deactivated due to newly > added file pages with low access frequency. And, then, since access distance of > 50 MB (hot) pages is larger than inactive list size, they could not > get a second chance > and finally could be evicted. Ah okay, that's a different concern. We DO get two references, but we fail to detect them properly. First, we don't enforce the small inactive list size when there is refault turnover like this. Deactivation is permanently allowed, which means the lists are scanned in proportion to their size and should converge on a 50/50 balance. The 50M pages should get referenced again in the second half of their in-memory time and get re-activated. It's somewhat debatable whether that's the right thing to do when access distances are bunched up instead of uniform. Should a page that's accessed twice with a 10M distance, and then not again for another 80M, stay resident in a 60M cache? It's not entirely unreasonable to say a page needs to get accessed in each half of in-memory time to be considered still active, to ensure a reasonable maximum distance between reference clusters. But if this is an actual concern, shouldn't we instead improve the accuracy of the LRU ordering? For example, if we're refaulting and have dissolved the inactive/active separation, shouldn't shrink_active_list() and shrink_page_list() actually rotate pages with PG_referenced or pte references? We ignore those references right now because we rely on second chance to catch them; and respecting them would incur too much CPU overhead when all we try to do is funnel one-off cache through the list. But if we don't trust second-chance with this, and we're already having refault IO and CPU is less of a concern, that seems like an obvious change to make.
2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote: > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote: > > > > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > > > > > > We activate cache refaults with reuse distances in pages smaller than > > > > > the size of the total cache. This allows new pages with competitive > > > > > access frequencies to establish themselves, as well as challenge and > > > > > potentially displace pages on the active list that have gone cold. > > > > > > > > > > However, that assumes that active cache can only replace other active > > > > > cache in a competition for the hottest memory. This is not a great > > > > > default assumption. The page cache might be thrashing while there are > > > > > enough completely cold and unused anonymous pages sitting around that > > > > > we'd only have to write to swap once to stop all IO from the cache. > > > > > > > > > > Activate cache refaults when their reuse distance in pages is smaller > > > > > than the total userspace workingset, including anonymous pages. > > > > > > > > Hmm... I'm not sure the correctness of this change. > > > > > > > > IIUC, this patch leads to more activations in the file list and more activations > > > > here will challenge the anon list since rotation ratio for the file > > > > list will be increased. > > > > > > Yes. > > > > > > > However, this change breaks active/inactive concept of the file list. > > > > active/inactive > > > > separation is implemented by in-list refault distance. anon list size has > > > > no direct connection with refault distance of the file list so using > > > > anon list size > > > > to detect workingset for file page breaks the concept. > > > > > > This is intentional, because there IS a connection: they both take up > > > space in RAM, and they both cost IO to bring back once reclaimed. > > > > I know that. This is the reason that I said 'no direct connection'. The anon > > list size is directly related the *possible* file list size. But, > > active/inactive > > separation in one list is firstly based on *current* list size rather than > > the possible list size. Adding anon list size to detect workingset means > > to use the possible list size and I think that it's wrong. > > Hm so you're saying we should try to grow the page cache, but maintain > an inactive/active balance within the cache based on its current size > in case growing is not possible. > > I think it would be implementable*, but I'm not quite convinced that > it's worth distinguishing. We're talking about an overcommitted > workingset and thereby an inherently unstable state that may thrash > purely based on timing differences anyway. See below. I think that this patch wrongly judge the overcommitted workingset. See below new example in this reply. > *It would require another page flag to tell whether a refaulting cache > page has challenged the anon set once (transitioning) or repeatedly > (thrashing), as we currently use the active state for that. If we > would repurpose PG_workingset to tell the first from the second > refault, we'd need a new flag to mark a page for memstall accounting. I don't understand why a new flag is needed. Whenever we found that challenge is needed (dist < active + anon), we need to add up IO cost. > > > When file is refaulting, it means we need to make more space for > > > cache. That space can come from stale active file pages. But what if > > > active cache is all hot, and meanwhile there are cold anon pages that > > > we could swap out once and then serve everything from RAM? > > > > > > When file is refaulting, we should find the coldest data that is > > > taking up RAM and kick it out. It doesn't matter whether it's file or > > > anon: the goal is to free up RAM with the least amount of IO risk. > > > > I understand your purpose and agree with it. We need to find a solution. > > To achieve your goal, my suggestion is: > > > > - refault distance < active file, then do activation and add up IO cost > > - refault distance < active file + anon list, then add up IO cost > > > > This doesn't break workingset detection on file list and challenge > > the anon list as the same degree as you did. > > Unfortunately, it doesn't challenge the anon set. We'll stay in cache > trimming mode where the IO cost balance doesn't have any effect. Ah... I missed that. I think that we can avoid cache trimming mode easily. Like as WORKINGSET_REFAULT, we need to snapshot WORKINGSET_EXPAND. And, if we find a change to it, we can skip cache trimming mode. > We would need to > > 1. activate on distance < active file > 2. force reclaim to scan anon when distance < active file + anon > 3. record thrashing and IO cost when distance < active file + anon > and it's the second refault of the page. > > Something like this, where vmscan would scan anon when > WORKINGSET_RESTORE or WORKINGSET_EXPAND events are occuring: > > diff --git a/mm/workingset.c b/mm/workingset.c > index d481ea452eeb..41dde6274fff 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -275,9 +275,9 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) > void workingset_refault(struct page *page, void *shadow) > { > struct mem_cgroup *eviction_memcg; > + unsigned long active_file, anon; > struct lruvec *eviction_lruvec; > unsigned long refault_distance; > - unsigned long workingset_size; > struct pglist_data *pgdat; > struct mem_cgroup *memcg; > unsigned long eviction; > @@ -330,7 +330,7 @@ void workingset_refault(struct page *page, void *shadow) > refault_distance = (refault - eviction) & EVICTION_MASK; > > /* > - * The activation decision for this page is made at the level > + * The distance decisions for this page is made at the level > * where the eviction occurred, as that is where the LRU order > * during page reclaim is being determined. > * > @@ -344,32 +344,52 @@ void workingset_refault(struct page *page, void *shadow) > > /* > * Compare the distance to the existing workingset size. We > - * don't activate pages that couldn't stay resident even if > - * all the memory was available to the page cache. Whether > - * cache can compete with anon or not depends on having swap. > + * ignore pages that couldn't stay resident even if all the > + * memory was available to the page cache. Whether cache can > + * compete with anon or not depends on having swap. > */ > - workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); > + active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); > + anon = 0; > if (mem_cgroup_get_nr_swap_pages(memcg) > 0) { > - workingset_size += lruvec_page_state(eviction_lruvec, > - NR_INACTIVE_ANON); > - workingset_size += lruvec_page_state(eviction_lruvec, > - NR_ACTIVE_ANON); > + anon += lruvec_page_state(eviction_lruvec, NR_INACTIVE_ANON); > + anon += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON); > } > - if (refault_distance > workingset_size) > - goto out; > > - SetPageActive(page); > - advance_inactive_age(memcg, pgdat); > - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); > + if (refault_distance > active_file + anon) > + goto out; > > - /* Page was active prior to eviction */ > + /* > + * Page has already challenged the workingset before, and it's > + * refaulting again: we're not transitioning out old cache, > + * we're thrashing and need to grow. Record the IO to tip the > + * reclaim balance and mark the page for memstall detection. > + */ > if (workingset) { > - SetPageWorkingset(page); > + inc_lruvec_state(lruvec, WORKINGSET_RESTORE); > + > /* XXX: Move to lru_cache_add() when it supports new vs putback */ > spin_lock_irq(&page_pgdat(page)->lru_lock); > lru_note_cost_page(page); > spin_unlock_irq(&page_pgdat(page)->lru_lock); > - inc_lruvec_state(lruvec, WORKINGSET_RESTORE); > + > + SetPageThrashing(page); > + } > + > + SetPageWorkingset(page); > + > + /* > + * Activate right away if page can compete with the existing > + * active set given the *current* size of the page cache. > + * > + * Otherwise, the page cache needs to grow to house this > + * page. Tell reclaim to rebalance against anonymous memory. > + */ > + if (refault_distance <= active_file) { > + SetPageActive(page); > + advance_inactive_age(memcg, pgdat); > + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); > + } else { > + inc_lruvec_state(lruvec, WORKINGSET_EXPAND); > } > out: > rcu_read_unlock(); > > > > Remember that the file/anon split, and the inactive/active split, are > > > there to optimize reclaim. It doesn't mean that these memory pools are > > > independent from each other. > > > > > > The file list is split in two because of use-once cache. The anon and > > > file lists are split because of different IO patterns, because we may > > > not have swap etc. But once we are out of use-once cache, have swap > > > space available, and have corrected for the different cost of IO, > > > there needs to be a relative order between all pages in the system to > > > find the optimal candidates to reclaim. > > > > > > > My suspicion is started by this counter example. > > > > > > > > Environment: > > > > anon: 500 MB (so hot) / 500 MB (so hot) > > > > file: 50 MB (hot) / 50 MB (cold) > > > > > > > > Think about the situation that there is periodical access to other file (100 MB) > > > > with low frequency (refault distance is 500 MB) > > > > > > > > Without your change, this periodical access doesn't make thrashing for cached > > > > active file page since refault distance of periodical access is larger > > > > than the size of > > > > the active file list. However, with your change, it causes thrashing > > > > on the file list. > > > > > > It doesn't cause thrashing. It causes scanning because that 100M file > > > IS thrashing: with or without my patch, that refault IO is occuring. > > > > It could cause thrashing for your patch. > > Without the patch, current logic try to > > find most hottest file pages that are fit into the current file list > > size and protect them > > successfully. Assume that access distance of 50 MB hot file pages is 60 MB > > which is less than whole file list size but larger than inactive list > > size. Without > > your patch, 50 MB (hot) pages are not evicted at all. All these hot > > pages will be > > protected from the 100MB low access frequency pages. 100 MB low access > > frequency pages will be refaulted repeatedely but it's correct behaviour. > > Hm, something doesn't quite add up. Why is the 50M hot set evicted > with my patch? Thanks for kind explanation. I read all and I found that I was confused before. Please let me correct the example. Environment: anon: 500 MB (hot) / 500 MB (hot) file: 50 MB (so hot) / 50 MB (dummy) I will call 50 MB file hot pages as F(H). Let's assume that periodical access to other file (500 MB) is started. That file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5. Problem will occur on following access pattern. F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ... With this access pattern, access distance of F(H) and Pn is: F(H) = 150 MB Pn = 750 MB Without your patch, F(H) is kept on the memory since deactivation would not happen. However, with your patch, deactivation happens since Pn's refault distance is less than 'active file + anon'. In the end, F(H) would be finally evicted. > With a 60M access distance and a 100M page cache, they might get > deactivated, but then activated back to the head of the active list on > their next access. They should get scanned but not actually reclaimed. Sorry about the wrong example. I fixed the example in the above phrase. > The only way they could get reclaimed is if their access distance ends > up bigger than the file cache. But if that's the case, then the > workingset is overcommitted, and none of the pages qualify for reclaim > protection. Picking a subset to protect against the rest is arbitrary. In the fixed example, although other file (500 MB) is repeatedly accessed, it's not workingset. If we have unified list (file + anon), access distance of Pn will be larger than whole memory size. Therefore, it's not overcommitted workingset and this patch wrongly try to activate it. As I said before, without considering inactive_age for anon list, this calculation can not be correct. > For example, if you started out with 50M of free space and an empty > cache and started accessing the 50M and 100M files simultaneously, > with the access distances you specified, none of them would get > activated. You would have the same behavior with or without my patch. It's not good example for my case. The purpose of active/inactive separation is to protect previous hot pages until new evidence is found. If accessing 50M and 100M files simultaneously on 100M free space, all of them are not activated. However, if 50M is in active list and then access to 50M and 100M happens simultaneously, we need to protect previous 50M. My example is started with 50M in active list. So, it should be protected until new evidence is found. Pn is not new evidence. > > However, with your patch, 50 MB (hot) file pages are deactivated due to newly > > added file pages with low access frequency. And, then, since access distance of > > 50 MB (hot) pages is larger than inactive list size, they could not > > get a second chance > > and finally could be evicted. > > Ah okay, that's a different concern. We DO get two references, but we > fail to detect them properly. As I said before, I was confused here. Your following explanation is correct and I understand it. Please focus on my fixed example. Thanks.
On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote: > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote: > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote: > > *It would require another page flag to tell whether a refaulting cache > > page has challenged the anon set once (transitioning) or repeatedly > > (thrashing), as we currently use the active state for that. If we > > would repurpose PG_workingset to tell the first from the second > > refault, we'd need a new flag to mark a page for memstall accounting. > > I don't understand why a new flag is needed. Whenever we found that > challenge is needed (dist < active + anon), we need to add up IO cost. It sounds like this was cleared up later on in the email. > > > It could cause thrashing for your patch. > > > Without the patch, current logic try to > > > find most hottest file pages that are fit into the current file list > > > size and protect them > > > successfully. Assume that access distance of 50 MB hot file pages is 60 MB > > > which is less than whole file list size but larger than inactive list > > > size. Without > > > your patch, 50 MB (hot) pages are not evicted at all. All these hot > > > pages will be > > > protected from the 100MB low access frequency pages. 100 MB low access > > > frequency pages will be refaulted repeatedely but it's correct behaviour. > > > > Hm, something doesn't quite add up. Why is the 50M hot set evicted > > with my patch? > > Thanks for kind explanation. I read all and I found that I was confused before. > Please let me correct the example. > > Environment: > anon: 500 MB (hot) / 500 MB (hot) > file: 50 MB (so hot) / 50 MB (dummy) > > I will call 50 MB file hot pages as F(H). > Let's assume that periodical access to other file (500 MB) is started. That > file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5. > > Problem will occur on following access pattern. > > F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ... > > With this access pattern, access distance of F(H) and Pn is: > > F(H) = 150 MB > Pn = 750 MB > > Without your patch, F(H) is kept on the memory since deactivation would not > happen. However, with your patch, deactivation happens since Pn's refault > distance is less than 'active file + anon'. In the end, F(H) would be finally > evicted. Okay, this example makes sense to me. I do think P needs to challenge the workingset - at first. P could easily fit into memory by itself if anon and active_file were cold, so we need to challenge them to find out that they're hot. As you can see, if anon and F(H) were completely unused, the current behavior would be incorrect. The current behavior would do the same in a cache-only example: anon = 1G (unreclaimable) file = 500M (hot) / 300M (dummy) P = 400M F(H) -> P1 -> F(H) -> P2 ... If F(H) is already active, the first P refaults would have a distance of 100M, thereby challenging F(H). As F(H) reactivates, its size will be reflected in the refault distances, pushing them beyond the size of memory that is available to the cache: 600M refault distance > 500M active cache, or 900M access distance > 800M cache space. However, I agree with your observation about the anon age below. When we start aging the anon set, we have to reflect that in the refault distances. Once we know that the 1G anon pages are indeed hotter than the pages in P, there is no reason to keep churning the workingset. > > The only way they could get reclaimed is if their access distance ends > > up bigger than the file cache. But if that's the case, then the > > workingset is overcommitted, and none of the pages qualify for reclaim > > protection. Picking a subset to protect against the rest is arbitrary. > > In the fixed example, although other file (500 MB) is repeatedly accessed, > it's not workingset. If we have unified list (file + anon), access distance of > Pn will be larger than whole memory size. Therefore, it's not overcommitted > workingset and this patch wrongly try to activate it. As I said before, > without considering inactive_age for anon list, this calculation can not be > correct. You're right. If we don't take anon age into account, the activations could be over-eager; however, so would counting IO cost and exerting pressure on anon be, which means my previous patch to split these two wouldn't fix fundamental the problem you're pointing out. We simply have to take anon age into account for the refaults to be comparable. Once we do that, in your example, we would see activations in the beginning in order to challenge the combined workingset (active_file + anon) - which is legitimate as long as we don't know it's hot. And as the anon pages are scanned and rotated (and the challenged F(h) reactivated), the refault distances increase accordingly to reflect the size of the hot pages sampled, which will correctly put P's distances beyond the size of available memory. However, your example cannot have a completely silent stable state. As we stop workingset aging, the refault distances will slowly increase again. We will always have a bit of churn, and rightfully so, because the workingset *could* go stale. That's the same situation in my cache-only example above. Anytime you have a subset of pages that by itself could fit into memory, but can't because of an established workingset, ongoing sampling is necessary. But the rate definitely needs to reduce as we detect that in-memory pages are indeed hot. Otherwise we cause more churn than is required for an appropriate rate of workingset sampling. How about the patch below? It looks correct, but I will have to re-run my tests to make sure I / we are not missing anything. --- From b105c85bf1cebfc4d1057f7228d7484c0ee77127 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Fri, 29 May 2020 09:40:00 -0400 Subject: [PATCH] mm: workingset: age nonresident information alongside anonymous pages After ("mm: workingset: let cache workingset challenge anon fix"), we compare refault distances to active_file + anon. But age of the non-resident information is only driven by the file LRU. As a result, we may overestimate the recency of any incoming refaults and activate them too eagerly, causing unnecessary LRU churn in certain situations. Make anon aging drive nonresident age as well to address that. Reported-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/mmzone.h | 4 ++-- include/linux/swap.h | 1 + mm/vmscan.c | 3 +++ mm/workingset.c | 46 +++++++++++++++++++++++++----------------- 4 files changed, 33 insertions(+), 21 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e959602140b4..39459b8a7bb8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -255,8 +255,8 @@ struct lruvec { */ unsigned long anon_cost; unsigned long file_cost; - /* Evictions & activations on the inactive file list */ - atomic_long_t inactive_age; + /* Non-resident age, driven by LRU movement */ + atomic_long_t nonresident_age; /* Refaults at the time of last reclaim cycle */ unsigned long refaults; /* Various lruvec state flags (enum lruvec_flags) */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 157e5081bf98..216df7bdaf62 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -312,6 +312,7 @@ struct vma_swap_readahead { }; /* linux/mm/workingset.c */ +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages); void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg); void workingset_refault(struct page *page, void *shadow); void workingset_activation(struct page *page); diff --git a/mm/vmscan.c b/mm/vmscan.c index c628f9ab886b..2fee237063e7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -904,6 +904,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, __delete_from_swap_cache(page, swap); xa_unlock_irqrestore(&mapping->i_pages, flags); put_swap_page(page, swap); + workingset_eviction(page, target_memcg); } else { void (*freepage)(struct page *); void *shadow = NULL; @@ -1884,6 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, list_add(&page->lru, &pages_to_free); } else { nr_moved += nr_pages; + if (PageActive(page)) + workingset_age_nonresident(lruvec, nr_pages); } } diff --git a/mm/workingset.c b/mm/workingset.c index d481ea452eeb..50b7937bab32 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -156,8 +156,8 @@ * * Implementation * - * For each node's file LRU lists, a counter for inactive evictions - * and activations is maintained (node->inactive_age). + * For each node's LRU lists, a counter for inactive evictions and + * activations is maintained (node->nonresident_age). * * On eviction, a snapshot of this counter (along with some bits to * identify the node) is stored in the now empty page cache @@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *workingsetp = workingset; } -static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) +/** + * workingset_age_nonresident - age non-resident entries as LRU ages + * @memcg: the lruvec that was aged + * @nr_pages: the number of pages to count + * + * As in-memory pages are aged, non-resident pages need to be aged as + * well, in order for the refault distances later on to be comparable + * to the in-memory dimensions. This function allows reclaim and LRU + * operations to drive the non-resident aging along in parallel. + */ +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages) { /* * Reclaiming a cgroup means reclaiming all its children in a @@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) * the root cgroup's, age as well. */ do { - struct lruvec *lruvec; - - lruvec = mem_cgroup_lruvec(memcg, pgdat); - atomic_long_inc(&lruvec->inactive_age); - } while (memcg && (memcg = parent_mem_cgroup(memcg))); + atomic_long_add(nr_pages, &lruvec->nonresident_age); + } while ((lruvec = parent_lruvec(lruvec))); } /** @@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) VM_BUG_ON_PAGE(page_count(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page); - advance_inactive_age(page_memcg(page), pgdat); - lruvec = mem_cgroup_lruvec(target_memcg, pgdat); + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); /* XXX: target_memcg can be NULL, go through lruvec */ memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); - eviction = atomic_long_read(&lruvec->inactive_age); + eviction = atomic_long_read(&lruvec->nonresident_age); return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); } @@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow) if (!mem_cgroup_disabled() && !eviction_memcg) goto out; eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); - refault = atomic_long_read(&eviction_lruvec->inactive_age); + refault = atomic_long_read(&eviction_lruvec->nonresident_age); /* * Calculate the refault distance * * The unsigned subtraction here gives an accurate distance - * across inactive_age overflows in most cases. There is a + * across nonresident_age overflows in most cases. There is a * special case: usually, shadow entries have a short lifetime * and are either refaulted or reclaimed along with the inode * before they get too old. But it is not impossible for the - * inactive_age to lap a shadow entry in the field, which can - * then result in a false small refault distance, leading to a - * false activation should this old entry actually refault - * again. However, earlier kernels used to deactivate + * nonresident_age to lap a shadow entry in the field, which + * can then result in a false small refault distance, leading + * to a false activation should this old entry actually + * refault again. However, earlier kernels used to deactivate * unconditionally with *every* reclaim invocation for the * longest time, so the occasional inappropriate activation * leading to pressure on the active list is not a problem. @@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow) goto out; SetPageActive(page); - advance_inactive_age(memcg, pgdat); + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); /* Page was active prior to eviction */ @@ -382,6 +388,7 @@ void workingset_refault(struct page *page, void *shadow) void workingset_activation(struct page *page) { struct mem_cgroup *memcg; + struct lruvec *lruvec; rcu_read_lock(); /* @@ -394,7 +401,8 @@ void workingset_activation(struct page *page) memcg = page_memcg_rcu(page); if (!mem_cgroup_disabled() && !memcg) goto out; - advance_inactive_age(memcg, page_pgdat(page)); + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); out: rcu_read_unlock(); }
2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote: > > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote: > > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote: > > > *It would require another page flag to tell whether a refaulting cache > > > page has challenged the anon set once (transitioning) or repeatedly > > > (thrashing), as we currently use the active state for that. If we > > > would repurpose PG_workingset to tell the first from the second > > > refault, we'd need a new flag to mark a page for memstall accounting. > > > > I don't understand why a new flag is needed. Whenever we found that > > challenge is needed (dist < active + anon), we need to add up IO cost. > > It sounds like this was cleared up later on in the email. > > > > > It could cause thrashing for your patch. > > > > Without the patch, current logic try to > > > > find most hottest file pages that are fit into the current file list > > > > size and protect them > > > > successfully. Assume that access distance of 50 MB hot file pages is 60 MB > > > > which is less than whole file list size but larger than inactive list > > > > size. Without > > > > your patch, 50 MB (hot) pages are not evicted at all. All these hot > > > > pages will be > > > > protected from the 100MB low access frequency pages. 100 MB low access > > > > frequency pages will be refaulted repeatedely but it's correct behaviour. > > > > > > Hm, something doesn't quite add up. Why is the 50M hot set evicted > > > with my patch? > > > > Thanks for kind explanation. I read all and I found that I was confused before. > > Please let me correct the example. > > > > Environment: > > anon: 500 MB (hot) / 500 MB (hot) > > file: 50 MB (so hot) / 50 MB (dummy) > > > > I will call 50 MB file hot pages as F(H). > > Let's assume that periodical access to other file (500 MB) is started. That > > file consists of 5 parts and each one is 100 MB. I will call it P1, P2, ..., P5. > > > > Problem will occur on following access pattern. > > > > F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ... > > > > With this access pattern, access distance of F(H) and Pn is: > > > > F(H) = 150 MB > > Pn = 750 MB > > > > Without your patch, F(H) is kept on the memory since deactivation would not > > happen. However, with your patch, deactivation happens since Pn's refault > > distance is less than 'active file + anon'. In the end, F(H) would be finally > > evicted. > > Okay, this example makes sense to me. > > I do think P needs to challenge the workingset - at first. P could > easily fit into memory by itself if anon and active_file were cold, so > we need to challenge them to find out that they're hot. As you can > see, if anon and F(H) were completely unused, the current behavior > would be incorrect. > > The current behavior would do the same in a cache-only example: > > anon = 1G (unreclaimable) > file = 500M (hot) / 300M (dummy) > > P = 400M > > F(H) -> P1 -> F(H) -> P2 ... > > If F(H) is already active, the first P refaults would have a distance > of 100M, thereby challenging F(H). As F(H) reactivates, its size will > be reflected in the refault distances, pushing them beyond the size of > memory that is available to the cache: 600M refault distance > 500M > active cache, or 900M access distance > 800M cache space. Hmm... It seems that the current behavior (before your patch) for this example has no problem. It causes deactivation but doesn't cause eviction so there is no workingset thrashing. > However, I agree with your observation about the anon age below. When > we start aging the anon set, we have to reflect that in the refault > distances. Once we know that the 1G anon pages are indeed hotter than > the pages in P, there is no reason to keep churning the workingset. Okay. > > > The only way they could get reclaimed is if their access distance ends > > > up bigger than the file cache. But if that's the case, then the > > > workingset is overcommitted, and none of the pages qualify for reclaim > > > protection. Picking a subset to protect against the rest is arbitrary. > > > > In the fixed example, although other file (500 MB) is repeatedly accessed, > > it's not workingset. If we have unified list (file + anon), access distance of > > Pn will be larger than whole memory size. Therefore, it's not overcommitted > > workingset and this patch wrongly try to activate it. As I said before, > > without considering inactive_age for anon list, this calculation can not be > > correct. > > You're right. If we don't take anon age into account, the activations > could be over-eager; however, so would counting IO cost and exerting > pressure on anon be, which means my previous patch to split these two > wouldn't fix fundamental the problem you're pointing out. We simply Splitting would not fix the fundamental problem (over-eager) but it would greatly weaken the problem. Just counting IO cost doesn't break the active/inactive separation in file list. It does cause more scan on anon list but I think that it's endurable. > have to take anon age into account for the refaults to be comparable. Yes, taking anon age into account is also a good candidate to fix the problem. > Once we do that, in your example, we would see activations in the > beginning in order to challenge the combined workingset (active_file + > anon) - which is legitimate as long as we don't know it's hot. And as > the anon pages are scanned and rotated (and the challenged F(h) > reactivated), the refault distances increase accordingly to reflect > the size of the hot pages sampled, which will correctly put P's > distances beyond the size of available memory. Okay. > However, your example cannot have a completely silent stable state. As > we stop workingset aging, the refault distances will slowly increase > again. We will always have a bit of churn, and rightfully so, because > the workingset *could* go stale. > > That's the same situation in my cache-only example above. Anytime you > have a subset of pages that by itself could fit into memory, but can't > because of an established workingset, ongoing sampling is necessary. > > But the rate definitely needs to reduce as we detect that in-memory > pages are indeed hot. Otherwise we cause more churn than is required > for an appropriate rate of workingset sampling. > > How about the patch below? It looks correct, but I will have to re-run > my tests to make sure I / we are not missing anything. Much better! It may solve my concern mostly. But, I still think that modified refault activation equation isn't safe. The next problem I found is related to the scan ratio limit patch ("limit the range of LRU type balancing") on this series. See the below example. anon: Hot (X M) file: Hot (200 M) / dummy (200 M) P: 1200 M (3 parts, each one 400 M, P1, P2, P3) Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... -> Without this patch, A and F(H) are kept on the memory and look like it's correct. With this patch and below fix, refault equation for Pn would be: Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan ratio (from anon non-resident) anon + active file = X + 200 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200 According to the size of X, Pn's refault result would be different. Pn could be activated with large enough X and then F(H) could be evicted. In ideal case (unified list), for this example, Pn should not be activated in any X. This is a fundamental problem since we have two list type (file/anon) and scan ratio limit is required. Anyway, we need to take care of this reality and the way most safe is to count IO cost instead of doing activation in this 'non-resident dist < (active + anon list)' case. Again, for this patch, I'm not confident myself so please let me know if I'm wrong. Thanks.
On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote: > 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote: > > > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote: > > > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote: > > > > The only way they could get reclaimed is if their access distance ends > > > > up bigger than the file cache. But if that's the case, then the > > > > workingset is overcommitted, and none of the pages qualify for reclaim > > > > protection. Picking a subset to protect against the rest is arbitrary. > > > > > > In the fixed example, although other file (500 MB) is repeatedly accessed, > > > it's not workingset. If we have unified list (file + anon), access distance of > > > Pn will be larger than whole memory size. Therefore, it's not overcommitted > > > workingset and this patch wrongly try to activate it. As I said before, > > > without considering inactive_age for anon list, this calculation can not be > > > correct. > > > > You're right. If we don't take anon age into account, the activations > > could be over-eager; however, so would counting IO cost and exerting > > pressure on anon be, which means my previous patch to split these two > > wouldn't fix fundamental the problem you're pointing out. We simply > > Splitting would not fix the fundamental problem (over-eager) but it would > greatly weaken the problem. Just counting IO cost doesn't break the > active/inactive separation in file list. It does cause more scan on anon list > but I think that it's endurable. I think the split is a good idea. The only thing I'm not sure yet is if we can get away without an additional page flag if the active flag cannot be reused to denote thrashing. I'll keep at it, maybe I can figure something out. But I think it would be follow-up work. > > have to take anon age into account for the refaults to be comparable. > > Yes, taking anon age into account is also a good candidate to fix the problem. Okay, good. > > However, your example cannot have a completely silent stable state. As > > we stop workingset aging, the refault distances will slowly increase > > again. We will always have a bit of churn, and rightfully so, because > > the workingset *could* go stale. > > > > That's the same situation in my cache-only example above. Anytime you > > have a subset of pages that by itself could fit into memory, but can't > > because of an established workingset, ongoing sampling is necessary. > > > > But the rate definitely needs to reduce as we detect that in-memory > > pages are indeed hot. Otherwise we cause more churn than is required > > for an appropriate rate of workingset sampling. > > > > How about the patch below? It looks correct, but I will have to re-run > > my tests to make sure I / we are not missing anything. > > Much better! It may solve my concern mostly. Okay thanks for confirming. I'll send a proper version to Andrew. > But, I still think that modified refault activation equation isn't > safe. The next > problem I found is related to the scan ratio limit patch ("limit the range of > LRU type balancing") on this series. See the below example. > > anon: Hot (X M) > file: Hot (200 M) / dummy (200 M) > P: 1200 M (3 parts, each one 400 M, P1, P2, P3) > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... -> > > Without this patch, A and F(H) are kept on the memory and look like > it's correct. > > With this patch and below fix, refault equation for Pn would be: > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan > ratio (from anon non-resident) > anon + active file = X + 200 > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200 That doesn't look quite right to me. The anon part of the refault distance is driven by X, so the left-hand of this formula contains X as well. 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000 Activations persist as long as anon isn't fully scanned and it isn't established yet that it's fully hot. Meaning, we optimistically assume the refaulting pages can be workingset until we're proven wrong. > According to the size of X, Pn's refault result would be different. Pn could > be activated with large enough X and then F(H) could be evicted. In ideal > case (unified list), for this example, Pn should not be activated in any X. Yes. The active/iocost split would allow us to be smarter about it. > This is a fundamental problem since we have two list type (file/anon) and > scan ratio limit is required. Anyway, we need to take care of this reality and > the way most safe is to count IO cost instead of doing activation in this > 'non-resident dist < (active + anon list)' case. Agreed here again. > Again, for this patch, I'm not confident myself so please let me know if I'm > wrong. As far as this patch goes, I think it's important to look at the bigger picture. We need to have convergence first before being able to worry about optimizing. Stable states are optimizations, but false stable states are correctness problems. For the longest time, we scanned active pages unconditionally during page reclaim. This was always safe in the sense that it wouldn't get stuck on a stale workingset, but it incurs unnecessary workingset churn when reclaim is driven by use-once patterns. We optimized the latter too aggressively, and as a result caused situations where we indefinitely fail to cache the hottest data. That's not really a workable trade-off. With the active/iocost split you're suggesting, we can reasonably optimize your example scenario. But we can't do it if the flipside means complete failure to transition between in-memory sets. So I think we should go ahead with this patch (with the anon age recognition fixed, because that's a correctness issue), and follow it up with the stable state optimization to shrink anon first.
On Mon, Jun 01, 2020 at 11:56:17AM -0400, Johannes Weiner wrote: > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote: > > 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > However, your example cannot have a completely silent stable state. As > > > we stop workingset aging, the refault distances will slowly increase > > > again. We will always have a bit of churn, and rightfully so, because > > > the workingset *could* go stale. > > > > > > That's the same situation in my cache-only example above. Anytime you > > > have a subset of pages that by itself could fit into memory, but can't > > > because of an established workingset, ongoing sampling is necessary. > > > > > > But the rate definitely needs to reduce as we detect that in-memory > > > pages are indeed hot. Otherwise we cause more churn than is required > > > for an appropriate rate of workingset sampling. > > > > > > How about the patch below? It looks correct, but I will have to re-run > > > my tests to make sure I / we are not missing anything. > > > > Much better! It may solve my concern mostly. > > Okay thanks for confirming. I'll send a proper version to Andrew. I sent the below patch through my battery of tests to make sure it doesn't break anything fundamental, and they didn't. For the convergence test, the overload test, and the kernel build, the difference, if any, is in the noise, which is expected. Andrew, absent any objections, can you please fold the below into the original patch? It's a bit noisy due to the rename of inactive_age now that it's used for anon as well (frankly, it was a misnomer from the start), and switching the interface to take an lruvec instead of a memcg and a pgdat, which works better for the callers (I stole parent_lruvec() from "mm: vmscan: determine anon/file pressure balance at the reclaim root" later in the series, so that isn't new code). The main change is the two bits in __remove_mapping and move_pages_to_lru, so that shadow entries are aged when anon is reclaimed or rotated. --- From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Mon, 1 Jun 2020 14:04:09 -0400 Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix We compare refault distances to active_file + anon. But age of the non-resident information is only driven by the file LRU. As a result, we may overestimate the recency of any incoming refaults and activate them too eagerly, causing unnecessary LRU churn in certain situations. Make anon aging drive nonresident age as well to address that. Reported-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/memcontrol.h | 13 +++++++++++ include/linux/mmzone.h | 4 ++-- include/linux/swap.h | 1 + mm/vmscan.c | 3 +++ mm/workingset.c | 46 ++++++++++++++++++++++---------------- 5 files changed, 46 insertions(+), 21 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 32a0b4d47540..d982c80da157 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1303,6 +1303,19 @@ static inline void dec_lruvec_page_state(struct page *page, mod_lruvec_page_state(page, idx, -1); } +static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) +{ + struct mem_cgroup *memcg; + + memcg = lruvec_memcg(lruvec); + if (!memcg) + return NULL; + memcg = parent_mem_cgroup(memcg); + if (!memcg) + return NULL; + return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); +} + #ifdef CONFIG_CGROUP_WRITEBACK struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c1fbda9ddd1f..7cccbb8bc2d7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -262,8 +262,8 @@ enum lruvec_flags { struct lruvec { struct list_head lists[NR_LRU_LISTS]; struct zone_reclaim_stat reclaim_stat; - /* Evictions & activations on the inactive file list */ - atomic_long_t inactive_age; + /* Non-resident age, driven by LRU movement */ + atomic_long_t nonresident_age; /* Refaults at the time of last reclaim cycle */ unsigned long refaults; /* Various lruvec state flags (enum lruvec_flags) */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 30fd4641890e..f0d2dca28c99 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -312,6 +312,7 @@ struct vma_swap_readahead { }; /* linux/mm/workingset.c */ +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages); void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg); void workingset_refault(struct page *page, void *shadow); void workingset_activation(struct page *page); diff --git a/mm/vmscan.c b/mm/vmscan.c index 43f88b1a4f14..3033f1c951cd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -898,6 +898,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, __delete_from_swap_cache(page, swap); xa_unlock_irqrestore(&mapping->i_pages, flags); put_swap_page(page, swap); + workingset_eviction(page, target_memcg); } else { void (*freepage)(struct page *); void *shadow = NULL; @@ -1876,6 +1877,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, list_add(&page->lru, &pages_to_free); } else { nr_moved += nr_pages; + if (PageActive(page)) + workingset_age_nonresident(lruvec, nr_pages); } } diff --git a/mm/workingset.c b/mm/workingset.c index e69865739539..98b91e623b85 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -156,8 +156,8 @@ * * Implementation * - * For each node's file LRU lists, a counter for inactive evictions - * and activations is maintained (node->inactive_age). + * For each node's LRU lists, a counter for inactive evictions and + * activations is maintained (node->nonresident_age). * * On eviction, a snapshot of this counter (along with some bits to * identify the node) is stored in the now empty page cache @@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *workingsetp = workingset; } -static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) +/** + * workingset_age_nonresident - age non-resident entries as LRU ages + * @memcg: the lruvec that was aged + * @nr_pages: the number of pages to count + * + * As in-memory pages are aged, non-resident pages need to be aged as + * well, in order for the refault distances later on to be comparable + * to the in-memory dimensions. This function allows reclaim and LRU + * operations to drive the non-resident aging along in parallel. + */ +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages) { /* * Reclaiming a cgroup means reclaiming all its children in a @@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) * the root cgroup's, age as well. */ do { - struct lruvec *lruvec; - - lruvec = mem_cgroup_lruvec(memcg, pgdat); - atomic_long_inc(&lruvec->inactive_age); - } while (memcg && (memcg = parent_mem_cgroup(memcg))); + atomic_long_add(nr_pages, &lruvec->nonresident_age); + } while ((lruvec = parent_lruvec(lruvec))); } /** @@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) VM_BUG_ON_PAGE(page_count(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page); - advance_inactive_age(page_memcg(page), pgdat); - lruvec = mem_cgroup_lruvec(target_memcg, pgdat); + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); /* XXX: target_memcg can be NULL, go through lruvec */ memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); - eviction = atomic_long_read(&lruvec->inactive_age); + eviction = atomic_long_read(&lruvec->nonresident_age); return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); } @@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow) if (!mem_cgroup_disabled() && !eviction_memcg) goto out; eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); - refault = atomic_long_read(&eviction_lruvec->inactive_age); + refault = atomic_long_read(&eviction_lruvec->nonresident_age); /* * Calculate the refault distance * * The unsigned subtraction here gives an accurate distance - * across inactive_age overflows in most cases. There is a + * across nonresident_age overflows in most cases. There is a * special case: usually, shadow entries have a short lifetime * and are either refaulted or reclaimed along with the inode * before they get too old. But it is not impossible for the - * inactive_age to lap a shadow entry in the field, which can - * then result in a false small refault distance, leading to a - * false activation should this old entry actually refault - * again. However, earlier kernels used to deactivate + * nonresident_age to lap a shadow entry in the field, which + * can then result in a false small refault distance, leading + * to a false activation should this old entry actually + * refault again. However, earlier kernels used to deactivate * unconditionally with *every* reclaim invocation for the * longest time, so the occasional inappropriate activation * leading to pressure on the active list is not a problem. @@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow) goto out; SetPageActive(page); - advance_inactive_age(memcg, pgdat); + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); /* Page was active prior to eviction */ @@ -378,6 +384,7 @@ void workingset_refault(struct page *page, void *shadow) void workingset_activation(struct page *page) { struct mem_cgroup *memcg; + struct lruvec *lruvec; rcu_read_lock(); /* @@ -390,7 +397,8 @@ void workingset_activation(struct page *page) memcg = page_memcg_rcu(page); if (!mem_cgroup_disabled() && !memcg) goto out; - advance_inactive_age(memcg, page_pgdat(page)); + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); out: rcu_read_unlock(); }
2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote: > > 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > > On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote: > > > > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote: > > > > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote: > > > > > The only way they could get reclaimed is if their access distance ends > > > > > up bigger than the file cache. But if that's the case, then the > > > > > workingset is overcommitted, and none of the pages qualify for reclaim > > > > > protection. Picking a subset to protect against the rest is arbitrary. > > > > > > > > In the fixed example, although other file (500 MB) is repeatedly accessed, > > > > it's not workingset. If we have unified list (file + anon), access distance of > > > > Pn will be larger than whole memory size. Therefore, it's not overcommitted > > > > workingset and this patch wrongly try to activate it. As I said before, > > > > without considering inactive_age for anon list, this calculation can not be > > > > correct. > > > > > > You're right. If we don't take anon age into account, the activations > > > could be over-eager; however, so would counting IO cost and exerting > > > pressure on anon be, which means my previous patch to split these two > > > wouldn't fix fundamental the problem you're pointing out. We simply > > > > Splitting would not fix the fundamental problem (over-eager) but it would > > greatly weaken the problem. Just counting IO cost doesn't break the > > active/inactive separation in file list. It does cause more scan on anon list > > but I think that it's endurable. > > I think the split is a good idea. > > The only thing I'm not sure yet is if we can get away without an > additional page flag if the active flag cannot be reused to denote > thrashing. I'll keep at it, maybe I can figure something out. > > But I think it would be follow-up work. > > > > have to take anon age into account for the refaults to be comparable. > > > > Yes, taking anon age into account is also a good candidate to fix the problem. > > Okay, good. > > > > However, your example cannot have a completely silent stable state. As > > > we stop workingset aging, the refault distances will slowly increase > > > again. We will always have a bit of churn, and rightfully so, because > > > the workingset *could* go stale. > > > > > > That's the same situation in my cache-only example above. Anytime you > > > have a subset of pages that by itself could fit into memory, but can't > > > because of an established workingset, ongoing sampling is necessary. > > > > > > But the rate definitely needs to reduce as we detect that in-memory > > > pages are indeed hot. Otherwise we cause more churn than is required > > > for an appropriate rate of workingset sampling. > > > > > > How about the patch below? It looks correct, but I will have to re-run > > > my tests to make sure I / we are not missing anything. > > > > Much better! It may solve my concern mostly. > > Okay thanks for confirming. I'll send a proper version to Andrew. Okay. > > But, I still think that modified refault activation equation isn't > > safe. The next > > problem I found is related to the scan ratio limit patch ("limit the range of > > LRU type balancing") on this series. See the below example. > > > > anon: Hot (X M) > > file: Hot (200 M) / dummy (200 M) > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3) > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... -> > > > > Without this patch, A and F(H) are kept on the memory and look like > > it's correct. > > > > With this patch and below fix, refault equation for Pn would be: > > > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan > > ratio (from anon non-resident) > > anon + active file = X + 200 > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200 > > That doesn't look quite right to me. The anon part of the refault > distance is driven by X, so the left-hand of this formula contains X > as well. > > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000 As I said before, there is no X on left-hand of this formula. To access all Pn and re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed. With your patch "limit the range of LRU type balancing", scan ratio between file/anon list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M * 2.0, that is, 2400 M and not bounded by X. That means that file list cannot be stable with some X. > Activations persist as long as anon isn't fully scanned and it isn't > established yet that it's fully hot. Meaning, we optimistically assume > the refaulting pages can be workingset until we're proven wrong. > > > According to the size of X, Pn's refault result would be different. Pn could > > be activated with large enough X and then F(H) could be evicted. In ideal > > case (unified list), for this example, Pn should not be activated in any X. > > Yes. The active/iocost split would allow us to be smarter about it. > > > This is a fundamental problem since we have two list type (file/anon) and > > scan ratio limit is required. Anyway, we need to take care of this reality and > > the way most safe is to count IO cost instead of doing activation in this > > 'non-resident dist < (active + anon list)' case. > > Agreed here again. > > > Again, for this patch, I'm not confident myself so please let me know if I'm > > wrong. > > As far as this patch goes, I think it's important to look at the > bigger picture. > > We need to have convergence first before being able to worry about > optimizing. Stable states are optimizations, but false stable states > are correctness problems. > > For the longest time, we scanned active pages unconditionally during > page reclaim. This was always safe in the sense that it wouldn't get > stuck on a stale workingset, but it incurs unnecessary workingset > churn when reclaim is driven by use-once patterns. > > We optimized the latter too aggressively, and as a result caused > situations where we indefinitely fail to cache the hottest > data. That's not really a workable trade-off. > > With the active/iocost split you're suggesting, we can reasonably > optimize your example scenario. But we can't do it if the flipside > means complete failure to transition between in-memory sets. > > So I think we should go ahead with this patch (with the anon age > recognition fixed, because that's a correctness issue), and follow it > up with the stable state optimization to shrink anon first. If my lastly found example is a correct example (your confirm is required), it is also related to the correctness issue since cold pages causes eviction of the hot pages repeatedly. In this case, they (without patch, with patch) all have some correctness issue so we need to judge which one is better in terms of overall impact. I don't have strong opinion about it so it's up to you to decide the way to go. Thanks.
On Tue, Jun 02, 2020 at 11:34:17AM +0900, Joonsoo Kim wrote: > 2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote: > > > But, I still think that modified refault activation equation isn't > > > safe. The next > > > problem I found is related to the scan ratio limit patch ("limit the range of > > > LRU type balancing") on this series. See the below example. > > > > > > anon: Hot (X M) > > > file: Hot (200 M) / dummy (200 M) > > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3) > > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... -> > > > > > > Without this patch, A and F(H) are kept on the memory and look like > > > it's correct. > > > > > > With this patch and below fix, refault equation for Pn would be: > > > > > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan > > > ratio (from anon non-resident) > > > anon + active file = X + 200 > > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200 > > > > That doesn't look quite right to me. The anon part of the refault > > distance is driven by X, so the left-hand of this formula contains X > > as well. > > > > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000 > > As I said before, there is no X on left-hand of this formula. To > access all Pn and > re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed. > With your patch "limit the range of LRU type balancing", scan ratio > between file/anon > list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M * > 2.0, that is, > 2400 M and not bounded by X. That means that file list cannot be > stable with some X. Oh, no X on the left because you're talking about the number of pages scanned until the first refaults, which is fixed - so why are we still interpreting the refault distance against a variable anon size X? Well, that's misleading. We compare against anon because part of the cache is already encoded in the refault distance. What we're really checking is access distance against total amount of available RAM. Consider this. We want to activate pages where access_distance <= RAM and our measure of access distance is: access_distance = refault_distance + inactive_file So the comparison becomes: refault_distance + inactive_file < RAM which we simplify to: refault_distance < active_file + anon There is a certain threshold for X simply because there is a certain threshold for RAM beyond which we need to start activating. X cannot be arbitrary, it must be X + cache filling up memory - after all we have page reclaim evicting pages. Again, this isn't new. In the current code, we activate when: refault_distance < active_file which is access_distance <= RAM - anon You can see, whether things are stable or not always depends on the existing workingset size. It's just a proxy for how much total RAM we have potentially available to the refaulting page. > If my lastly found example is a correct example (your confirm is required), > it is also related to the correctness issue since cold pages causes > eviction of the hot pages repeatedly. I think your example is correct, but it benefits from the VM arbitrarily making an assumption that has a 50/50 shot of being true. You and I know which pages are hot and which are cold because you designed the example. All the VM sees is this: - We have an established workingset that has previously shown an access distance <= RAM and therefor was activated. - We now have another set that also appears to have an access distance <= RAM. The only way to know for sure, however, is sample the established workingset and compare the relative access frequencies. Currently, we just assume the incoming pages are colder. Clearly that's beneficial when it's true. Clearly that can be totally wrong. We must allow a fair comparison between these two sets. For cache, that's already the case - that's why I brought up the cache-only example: if refault distances are 50M and you have 60M of active cache, we activate all refaults and force an even competition between the established workingset and the new pages. Whether we can protect active file when anon needs to shrink first and can't (the activate/iocost split) that's a different question. But I'm no longer so sure after looking into it further. First, we would need two different refault distances: either we consider anon age and need to compare to active_file + anon, or we don't and compare to active_file only. We cannot mix willy nilly, because the metrics wouldn't be comparable. We don't have the space to store two different eviction timestamps, nor could we afford to cut the precision in half. Second, the additional page flag required to implement it. Third, it's somewhat moot because we still have the same behavior when active_file would need to shrink and can't. There can't be a stable state as long as refault distances <= active_file. > In this case, they (without patch, with patch) all have some correctness > issue so we need to judge which one is better in terms of overall impact. > I don't have strong opinion about it so it's up to you to decide the way to go. If my patch was simply changing the default assumption on which pages are hot and which are cold, I would agree with you - the pros would be equal to the cons, one way wouldn't be more correct than the other. But that isn't what my patch is doing. What it does is get rid of the assumption, to actually sample and compare the access frequencies when there isn't enough data to make an informed decision. That's a net improvement.
2020년 6월 3일 (수) 오전 1:48, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Tue, Jun 02, 2020 at 11:34:17AM +0900, Joonsoo Kim wrote: > > 2020년 6월 2일 (화) 오전 12:56, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote: > > > > But, I still think that modified refault activation equation isn't > > > > safe. The next > > > > problem I found is related to the scan ratio limit patch ("limit the range of > > > > LRU type balancing") on this series. See the below example. > > > > > > > > anon: Hot (X M) > > > > file: Hot (200 M) / dummy (200 M) > > > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3) > > > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... -> > > > > > > > > Without this patch, A and F(H) are kept on the memory and look like > > > > it's correct. > > > > > > > > With this patch and below fix, refault equation for Pn would be: > > > > > > > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan > > > > ratio (from anon non-resident) > > > > anon + active file = X + 200 > > > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200 > > > > > > That doesn't look quite right to me. The anon part of the refault > > > distance is driven by X, so the left-hand of this formula contains X > > > as well. > > > > > > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations + X * scan ratio < X + 1000 > > > > As I said before, there is no X on left-hand of this formula. To > > access all Pn and > > re-access P1, we need 1200M file list scan and reclaim. More scan isn't needed. > > With your patch "limit the range of LRU type balancing", scan ratio > > between file/anon > > list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M * > > 2.0, that is, > > 2400 M and not bounded by X. That means that file list cannot be > > stable with some X. > > Oh, no X on the left because you're talking about the number of pages > scanned until the first refaults, which is fixed - so why are we still > interpreting the refault distance against a variable anon size X? Looks like I was confused again. Your formula is correct and mine is wrong. My mistake is I thought that your patch "limit the range of LRU type balancing" which makes scan *ratio* 2:1 leads to actual scan *count* ratio between anon/file to 2:1. But, now I realized that 2:1 is just scan ratio and actual scan *count* ratio could be far larger with certain list size. It would be X * scan ratio in above example so my explanation is wrong and you are right. Sorry for making a trouble. > Well, that's misleading. We compare against anon because part of the > cache is already encoded in the refault distance. What we're really > checking is access distance against total amount of available RAM. > > Consider this. We want to activate pages where > > access_distance <= RAM > > and our measure of access distance is: > > access_distance = refault_distance + inactive_file > > So the comparison becomes: > > refault_distance + inactive_file < RAM > > which we simplify to: > > refault_distance < active_file + anon > > There is a certain threshold for X simply because there is a certain > threshold for RAM beyond which we need to start activating. X cannot > be arbitrary, it must be X + cache filling up memory - after all we > have page reclaim evicting pages. > > Again, this isn't new. In the current code, we activate when: > > refault_distance < active_file > > which is > > access_distance <= RAM - anon > > You can see, whether things are stable or not always depends on the > existing workingset size. It's just a proxy for how much total RAM we > have potentially available to the refaulting page. > > > If my lastly found example is a correct example (your confirm is required), > > it is also related to the correctness issue since cold pages causes > > eviction of the hot pages repeatedly. > > I think your example is correct, but it benefits from the VM > arbitrarily making an assumption that has a 50/50 shot of being true. > > You and I know which pages are hot and which are cold because you > designed the example. > > All the VM sees is this: > > - We have an established workingset that has previously shown an > access distance <= RAM and therefor was activated. > > - We now have another set that also appears to have an access distance > <= RAM. The only way to know for sure, however, is sample the > established workingset and compare the relative access frequencies. > > Currently, we just assume the incoming pages are colder. Clearly > that's beneficial when it's true. Clearly that can be totally wrong. > > We must allow a fair comparison between these two sets. > > For cache, that's already the case - that's why I brought up the > cache-only example: if refault distances are 50M and you have 60M of > active cache, we activate all refaults and force an even competition > between the established workingset and the new pages. > > Whether we can protect active file when anon needs to shrink first and > can't (the activate/iocost split) that's a different question. But I'm > no longer so sure after looking into it further. > > First, we would need two different refault distances: either we > consider anon age and need to compare to active_file + anon, or we > don't and compare to active_file only. We cannot mix willy nilly, > because the metrics wouldn't be comparable. We don't have the space to > store two different eviction timestamps, nor could we afford to cut > the precision in half. > > Second, the additional page flag required to implement it. > > Third, it's somewhat moot because we still have the same behavior when > active_file would need to shrink and can't. There can't be a stable > state as long as refault distances <= active_file. > > > In this case, they (without patch, with patch) all have some correctness > > issue so we need to judge which one is better in terms of overall impact. > > I don't have strong opinion about it so it's up to you to decide the way to go. > > If my patch was simply changing the default assumption on which pages > are hot and which are cold, I would agree with you - the pros would be > equal to the cons, one way wouldn't be more correct than the other. > > But that isn't what my patch is doing. What it does is get rid of the > assumption, to actually sample and compare the access frequencies when > there isn't enough data to make an informed decision. > > That's a net improvement. Okay. Now I think that this patch has a net improvement. Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Thanks.
On 6/1/20 10:44 PM, Johannes Weiner wrote: > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner <hannes@cmpxchg.org> > Date: Mon, 1 Jun 2020 14:04:09 -0400 > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix Looks like the whole series is now in mainline (that was quick!) without this followup? As such it won't be squashed so perhaps the subject should be more "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset: let cache workingset challenge anon"), possibly with Fixes: tag etc... > We compare refault distances to active_file + anon. But age of the > non-resident information is only driven by the file LRU. As a result, > we may overestimate the recency of any incoming refaults and activate > them too eagerly, causing unnecessary LRU churn in certain situations. > > Make anon aging drive nonresident age as well to address that. > > Reported-by: Joonsoo Kim <js1304@gmail.com> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > include/linux/memcontrol.h | 13 +++++++++++ > include/linux/mmzone.h | 4 ++-- > include/linux/swap.h | 1 + > mm/vmscan.c | 3 +++ > mm/workingset.c | 46 ++++++++++++++++++++++---------------- > 5 files changed, 46 insertions(+), 21 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 32a0b4d47540..d982c80da157 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1303,6 +1303,19 @@ static inline void dec_lruvec_page_state(struct page *page, > mod_lruvec_page_state(page, idx, -1); > } > > +static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) > +{ > + struct mem_cgroup *memcg; > + > + memcg = lruvec_memcg(lruvec); > + if (!memcg) > + return NULL; > + memcg = parent_mem_cgroup(memcg); > + if (!memcg) > + return NULL; > + return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); > +} > + > #ifdef CONFIG_CGROUP_WRITEBACK > > struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index c1fbda9ddd1f..7cccbb8bc2d7 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -262,8 +262,8 @@ enum lruvec_flags { > struct lruvec { > struct list_head lists[NR_LRU_LISTS]; > struct zone_reclaim_stat reclaim_stat; > - /* Evictions & activations on the inactive file list */ > - atomic_long_t inactive_age; > + /* Non-resident age, driven by LRU movement */ > + atomic_long_t nonresident_age; > /* Refaults at the time of last reclaim cycle */ > unsigned long refaults; > /* Various lruvec state flags (enum lruvec_flags) */ > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 30fd4641890e..f0d2dca28c99 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -312,6 +312,7 @@ struct vma_swap_readahead { > }; > > /* linux/mm/workingset.c */ > +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages); > void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg); > void workingset_refault(struct page *page, void *shadow); > void workingset_activation(struct page *page); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 43f88b1a4f14..3033f1c951cd 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -898,6 +898,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > __delete_from_swap_cache(page, swap); > xa_unlock_irqrestore(&mapping->i_pages, flags); > put_swap_page(page, swap); > + workingset_eviction(page, target_memcg); > } else { > void (*freepage)(struct page *); > void *shadow = NULL; > @@ -1876,6 +1877,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > list_add(&page->lru, &pages_to_free); > } else { > nr_moved += nr_pages; > + if (PageActive(page)) > + workingset_age_nonresident(lruvec, nr_pages); > } > } > > diff --git a/mm/workingset.c b/mm/workingset.c > index e69865739539..98b91e623b85 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -156,8 +156,8 @@ > * > * Implementation > * > - * For each node's file LRU lists, a counter for inactive evictions > - * and activations is maintained (node->inactive_age). > + * For each node's LRU lists, a counter for inactive evictions and > + * activations is maintained (node->nonresident_age). > * > * On eviction, a snapshot of this counter (along with some bits to > * identify the node) is stored in the now empty page cache > @@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, > *workingsetp = workingset; > } > > -static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) > +/** > + * workingset_age_nonresident - age non-resident entries as LRU ages > + * @memcg: the lruvec that was aged > + * @nr_pages: the number of pages to count > + * > + * As in-memory pages are aged, non-resident pages need to be aged as > + * well, in order for the refault distances later on to be comparable > + * to the in-memory dimensions. This function allows reclaim and LRU > + * operations to drive the non-resident aging along in parallel. > + */ > +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages) > { > /* > * Reclaiming a cgroup means reclaiming all its children in a > @@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) > * the root cgroup's, age as well. > */ > do { > - struct lruvec *lruvec; > - > - lruvec = mem_cgroup_lruvec(memcg, pgdat); > - atomic_long_inc(&lruvec->inactive_age); > - } while (memcg && (memcg = parent_mem_cgroup(memcg))); > + atomic_long_add(nr_pages, &lruvec->nonresident_age); > + } while ((lruvec = parent_lruvec(lruvec))); > } > > /** > @@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) > VM_BUG_ON_PAGE(page_count(page), page); > VM_BUG_ON_PAGE(!PageLocked(page), page); > > - advance_inactive_age(page_memcg(page), pgdat); > - > lruvec = mem_cgroup_lruvec(target_memcg, pgdat); > + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); > /* XXX: target_memcg can be NULL, go through lruvec */ > memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); > - eviction = atomic_long_read(&lruvec->inactive_age); > + eviction = atomic_long_read(&lruvec->nonresident_age); > return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); > } > > @@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow) > if (!mem_cgroup_disabled() && !eviction_memcg) > goto out; > eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); > - refault = atomic_long_read(&eviction_lruvec->inactive_age); > + refault = atomic_long_read(&eviction_lruvec->nonresident_age); > > /* > * Calculate the refault distance > * > * The unsigned subtraction here gives an accurate distance > - * across inactive_age overflows in most cases. There is a > + * across nonresident_age overflows in most cases. There is a > * special case: usually, shadow entries have a short lifetime > * and are either refaulted or reclaimed along with the inode > * before they get too old. But it is not impossible for the > - * inactive_age to lap a shadow entry in the field, which can > - * then result in a false small refault distance, leading to a > - * false activation should this old entry actually refault > - * again. However, earlier kernels used to deactivate > + * nonresident_age to lap a shadow entry in the field, which > + * can then result in a false small refault distance, leading > + * to a false activation should this old entry actually > + * refault again. However, earlier kernels used to deactivate > * unconditionally with *every* reclaim invocation for the > * longest time, so the occasional inappropriate activation > * leading to pressure on the active list is not a problem. > @@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow) > goto out; > > SetPageActive(page); > - advance_inactive_age(memcg, pgdat); > + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); > inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); > > /* Page was active prior to eviction */ > @@ -378,6 +384,7 @@ void workingset_refault(struct page *page, void *shadow) > void workingset_activation(struct page *page) > { > struct mem_cgroup *memcg; > + struct lruvec *lruvec; > > rcu_read_lock(); > /* > @@ -390,7 +397,8 @@ void workingset_activation(struct page *page) > memcg = page_memcg_rcu(page); > if (!mem_cgroup_disabled() && !memcg) > goto out; > - advance_inactive_age(memcg, page_pgdat(page)); > + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); > + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); > out: > rcu_read_unlock(); > } >
On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote: > On 6/1/20 10:44 PM, Johannes Weiner wrote: > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001 > > From: Johannes Weiner <hannes@cmpxchg.org> > > Date: Mon, 1 Jun 2020 14:04:09 -0400 > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix > > Looks like the whole series is now in mainline (that was quick!) without this > followup? As such it won't be squashed so perhaps the subject should be more > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset: > let cache workingset challenge anon"), possibly with Fixes: tag etc... Yep, I'll send a stand-alone version of this. It was a bit fat to get squashed last minute anyway, and I don't mind having a bit of extra time to quantify the actual impact of this delta. Thanks Vlastimil!
2020년 6월 5일 (금) 오전 12:06, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote: > > On 6/1/20 10:44 PM, Johannes Weiner wrote: > > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001 > > > From: Johannes Weiner <hannes@cmpxchg.org> > > > Date: Mon, 1 Jun 2020 14:04:09 -0400 > > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix > > > > Looks like the whole series is now in mainline (that was quick!) without this > > followup? As such it won't be squashed so perhaps the subject should be more > > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset: > > let cache workingset challenge anon"), possibly with Fixes: tag etc... > > Yep, I'll send a stand-alone version of this. It was a bit fat to get > squashed last minute anyway, and I don't mind having a bit of extra > time to quantify the actual impact of this delta. Hello, Johannes. Now, a week has passed. I hope that this patch is merged as soon as possible, since my WIP patchset depends on this patch. How about trying to merge this patch now? If you don't mind, I could send it on your behalf. Thanks.
On Fri, Jun 12, 2020 at 12:19:58PM +0900, Joonsoo Kim wrote: > 2020년 6월 5일 (금) 오전 12:06, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote: > > > On 6/1/20 10:44 PM, Johannes Weiner wrote: > > > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001 > > > > From: Johannes Weiner <hannes@cmpxchg.org> > > > > Date: Mon, 1 Jun 2020 14:04:09 -0400 > > > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix > > > > > > Looks like the whole series is now in mainline (that was quick!) without this > > > followup? As such it won't be squashed so perhaps the subject should be more > > > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset: > > > let cache workingset challenge anon"), possibly with Fixes: tag etc... > > > > Yep, I'll send a stand-alone version of this. It was a bit fat to get > > squashed last minute anyway, and I don't mind having a bit of extra > > time to quantify the actual impact of this delta. > > Hello, Johannes. > > Now, a week has passed. I hope that this patch is merged as soon as possible, > since my WIP patchset depends on this patch. How about trying to merge > this patch now? If you don't mind, I could send it on your behalf. I didn't realize you directly depended on it, my apologies. Do you depend on it code-wise or do you have test case that shows a the difference? Here is the latest version again, you can include it in your patch series until we get it merged. But I think we should be able to merge it as a follow-up fix into 5.8 still. --- From d604062a66bc88ad1a1529c9e16952dc0d7c01c6 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Fri, 29 May 2020 09:40:00 -0400 Subject: [PATCH] mm: workingset: age nonresident information alongside anonymous pages After ("mm: workingset: let cache workingset challenge anon fix"), we compare refault distances to active_file + anon. But age of the non-resident information is only driven by the file LRU. As a result, we may overestimate the recency of any incoming refaults and activate them too eagerly, causing unnecessary LRU churn in certain situations. Make anon aging drive nonresident age as well to address that. Reported-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/mmzone.h | 4 ++-- include/linux/swap.h | 1 + mm/vmscan.c | 3 +++ mm/workingset.c | 46 +++++++++++++++++++++++++----------------- 4 files changed, 33 insertions(+), 21 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c4c37fd12104..f6f884970511 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -257,8 +257,8 @@ struct lruvec { */ unsigned long anon_cost; unsigned long file_cost; - /* Evictions & activations on the inactive file list */ - atomic_long_t inactive_age; + /* Non-resident age, driven by LRU movement */ + atomic_long_t nonresident_age; /* Refaults at the time of last reclaim cycle */ unsigned long refaults; /* Various lruvec state flags (enum lruvec_flags) */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 4c5974bb9ba9..5b3216ba39a9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -313,6 +313,7 @@ struct vma_swap_readahead { }; /* linux/mm/workingset.c */ +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages); void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg); void workingset_refault(struct page *page, void *shadow); void workingset_activation(struct page *page); diff --git a/mm/vmscan.c b/mm/vmscan.c index b6d84326bdf2..749d239c62b2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -904,6 +904,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, __delete_from_swap_cache(page, swap); xa_unlock_irqrestore(&mapping->i_pages, flags); put_swap_page(page, swap); + workingset_eviction(page, target_memcg); } else { void (*freepage)(struct page *); void *shadow = NULL; @@ -1884,6 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, list_add(&page->lru, &pages_to_free); } else { nr_moved += nr_pages; + if (PageActive(page)) + workingset_age_nonresident(lruvec, nr_pages); } } diff --git a/mm/workingset.c b/mm/workingset.c index d481ea452eeb..50b7937bab32 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -156,8 +156,8 @@ * * Implementation * - * For each node's file LRU lists, a counter for inactive evictions - * and activations is maintained (node->inactive_age). + * For each node's LRU lists, a counter for inactive evictions and + * activations is maintained (node->nonresident_age). * * On eviction, a snapshot of this counter (along with some bits to * identify the node) is stored in the now empty page cache @@ -213,7 +213,17 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *workingsetp = workingset; } -static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) +/** + * workingset_age_nonresident - age non-resident entries as LRU ages + * @memcg: the lruvec that was aged + * @nr_pages: the number of pages to count + * + * As in-memory pages are aged, non-resident pages need to be aged as + * well, in order for the refault distances later on to be comparable + * to the in-memory dimensions. This function allows reclaim and LRU + * operations to drive the non-resident aging along in parallel. + */ +void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages) { /* * Reclaiming a cgroup means reclaiming all its children in a @@ -227,11 +237,8 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) * the root cgroup's, age as well. */ do { - struct lruvec *lruvec; - - lruvec = mem_cgroup_lruvec(memcg, pgdat); - atomic_long_inc(&lruvec->inactive_age); - } while (memcg && (memcg = parent_mem_cgroup(memcg))); + atomic_long_add(nr_pages, &lruvec->nonresident_age); + } while ((lruvec = parent_lruvec(lruvec))); } /** @@ -254,12 +261,11 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) VM_BUG_ON_PAGE(page_count(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page); - advance_inactive_age(page_memcg(page), pgdat); - lruvec = mem_cgroup_lruvec(target_memcg, pgdat); + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); /* XXX: target_memcg can be NULL, go through lruvec */ memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); - eviction = atomic_long_read(&lruvec->inactive_age); + eviction = atomic_long_read(&lruvec->nonresident_age); return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); } @@ -309,20 +315,20 @@ void workingset_refault(struct page *page, void *shadow) if (!mem_cgroup_disabled() && !eviction_memcg) goto out; eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); - refault = atomic_long_read(&eviction_lruvec->inactive_age); + refault = atomic_long_read(&eviction_lruvec->nonresident_age); /* * Calculate the refault distance * * The unsigned subtraction here gives an accurate distance - * across inactive_age overflows in most cases. There is a + * across nonresident_age overflows in most cases. There is a * special case: usually, shadow entries have a short lifetime * and are either refaulted or reclaimed along with the inode * before they get too old. But it is not impossible for the - * inactive_age to lap a shadow entry in the field, which can - * then result in a false small refault distance, leading to a - * false activation should this old entry actually refault - * again. However, earlier kernels used to deactivate + * nonresident_age to lap a shadow entry in the field, which + * can then result in a false small refault distance, leading + * to a false activation should this old entry actually + * refault again. However, earlier kernels used to deactivate * unconditionally with *every* reclaim invocation for the * longest time, so the occasional inappropriate activation * leading to pressure on the active list is not a problem. @@ -359,7 +365,7 @@ void workingset_refault(struct page *page, void *shadow) goto out; SetPageActive(page); - advance_inactive_age(memcg, pgdat); + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); /* Page was active prior to eviction */ @@ -382,6 +388,7 @@ void workingset_refault(struct page *page, void *shadow) void workingset_activation(struct page *page) { struct mem_cgroup *memcg; + struct lruvec *lruvec; rcu_read_lock(); /* @@ -394,7 +401,8 @@ void workingset_activation(struct page *page) memcg = page_memcg_rcu(page); if (!mem_cgroup_disabled() && !memcg) goto out; - advance_inactive_age(memcg, page_pgdat(page)); + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); + workingset_age_nonresident(lruvec, hpage_nr_pages(page)); out: rcu_read_unlock(); }
2020년 6월 15일 (월) 오후 10:41, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > On Fri, Jun 12, 2020 at 12:19:58PM +0900, Joonsoo Kim wrote: > > 2020년 6월 5일 (금) 오전 12:06, Johannes Weiner <hannes@cmpxchg.org>님이 작성: > > > > > > On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote: > > > > On 6/1/20 10:44 PM, Johannes Weiner wrote: > > > > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001 > > > > > From: Johannes Weiner <hannes@cmpxchg.org> > > > > > Date: Mon, 1 Jun 2020 14:04:09 -0400 > > > > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix > > > > > > > > Looks like the whole series is now in mainline (that was quick!) without this > > > > followup? As such it won't be squashed so perhaps the subject should be more > > > > "standalone" now, description referencing commit 34e58cac6d8f ("mm: workingset: > > > > let cache workingset challenge anon"), possibly with Fixes: tag etc... > > > > > > Yep, I'll send a stand-alone version of this. It was a bit fat to get > > > squashed last minute anyway, and I don't mind having a bit of extra > > > time to quantify the actual impact of this delta. > > > > Hello, Johannes. > > > > Now, a week has passed. I hope that this patch is merged as soon as possible, > > since my WIP patchset depends on this patch. How about trying to merge > > this patch now? If you don't mind, I could send it on your behalf. > > I didn't realize you directly depended on it, my apologies. No problem. > Do you depend on it code-wise or do you have test case that shows a > the difference? Code-wise. My patchset also requires activation counting for the anon pages and conflict could occur in this case. > Here is the latest version again, you can include it in your patch > series until we get it merged. But I think we should be able to merge > it as a follow-up fix into 5.8 still. Yes. I will send it separately for 5.8. And, I will send some minor fixes, too. It's appreciated if you review them. Thanks.
diff --git a/mm/workingset.c b/mm/workingset.c index 474186b76ced..e69865739539 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -277,8 +277,8 @@ void workingset_refault(struct page *page, void *shadow) struct mem_cgroup *eviction_memcg; struct lruvec *eviction_lruvec; unsigned long refault_distance; + unsigned long workingset_size; struct pglist_data *pgdat; - unsigned long active_file; struct mem_cgroup *memcg; unsigned long eviction; struct lruvec *lruvec; @@ -310,7 +310,6 @@ void workingset_refault(struct page *page, void *shadow) goto out; eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); refault = atomic_long_read(&eviction_lruvec->inactive_age); - active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); /* * Calculate the refault distance @@ -345,10 +344,18 @@ void workingset_refault(struct page *page, void *shadow) /* * Compare the distance to the existing workingset size. We - * don't act on pages that couldn't stay resident even if all - * the memory was available to the page cache. + * don't activate pages that couldn't stay resident even if + * all the memory was available to the page cache. Whether + * cache can compete with anon or not depends on having swap. */ - if (refault_distance > active_file) + workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); + if (mem_cgroup_get_nr_swap_pages(memcg) > 0) { + workingset_size += lruvec_page_state(eviction_lruvec, + NR_INACTIVE_ANON); + workingset_size += lruvec_page_state(eviction_lruvec, + NR_ACTIVE_ANON); + } + if (refault_distance > workingset_size) goto out; SetPageActive(page);
We activate cache refaults with reuse distances in pages smaller than the size of the total cache. This allows new pages with competitive access frequencies to establish themselves, as well as challenge and potentially displace pages on the active list that have gone cold. However, that assumes that active cache can only replace other active cache in a competition for the hottest memory. This is not a great default assumption. The page cache might be thrashing while there are enough completely cold and unused anonymous pages sitting around that we'd only have to write to swap once to stop all IO from the cache. Activate cache refaults when their reuse distance in pages is smaller than the total userspace workingset, including anonymous pages. Reclaim can still decide how to balance pressure among the two LRUs depending on the IO situation. Rotational drives will prefer avoiding random IO from swap and go harder after cache. But fundamentally, hot cache should be able to compete with anon pages for a place in RAM. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/workingset.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)