mm: account lazily freed anon pages in NR_FILE_PAGES

Message ID	20201105131012.82457-1-laoar.shao@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=qThi=EL=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CB9A220786 From: Yafang Shao <laoar.shao@gmail.com> To: akpm@linux-foundation.org, mhocko@suse.com, minchan@kernel.org, hannes@cmpxchg.org Cc: linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com> Subject: [PATCH] mm: account lazily freed anon pages in NR_FILE_PAGES Date: Thu, 5 Nov 2020 21:10:12 +0800 Message-Id: <20201105131012.82457-1-laoar.shao@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: account lazily freed anon pages in NR_FILE_PAGES \| expand mm: account lazily freed anon pages in NR_FILE_PAGES

Yafang Shao Nov. 5, 2020, 1:10 p.m. UTC

The memory utilization (Used / Total) is used to monitor the memory
pressure by us. If it is too high, it means the system may be OOM sooner
or later when swap is off, then we will make adjustment on this system.

However, this method is broken since MADV_FREE is introduced, because
these lazily free anonymous can be reclaimed under memory pressure while
they are still accounted in NR_ANON_MAPPED.

Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
from anon lru list into file lru list. That means
(Inactive(file) + Active(file)) may be much larger than Cached in
/proc/meminfo. That makes our users confused.

So we'd better account the lazily freed anonoymous pages in
NR_FILE_PAGES as well.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 11 +++++++++--
 mm/rmap.c       | 26 ++++++++++++++++++--------
 mm/swap.c       |  2 ++
 mm/vmscan.c     |  2 ++
 4 files changed, 31 insertions(+), 10 deletions(-)

Michal Hocko Nov. 5, 2020, 1:35 p.m. UTC | #1

On Thu 05-11-20 21:10:12, Yafang Shao wrote:
> The memory utilization (Used / Total) is used to monitor the memory
> pressure by us. If it is too high, it means the system may be OOM sooner
> or later when swap is off, then we will make adjustment on this system.
> 
> However, this method is broken since MADV_FREE is introduced, because
> these lazily free anonymous can be reclaimed under memory pressure while
> they are still accounted in NR_ANON_MAPPED.
> 
> Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
> LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
> from anon lru list into file lru list. That means
> (Inactive(file) + Active(file)) may be much larger than Cached in
> /proc/meminfo. That makes our users confused.
> 
> So we'd better account the lazily freed anonoymous pages in
> NR_FILE_PAGES as well.

Can you simply subtract lazyfree pages in the userspace? I am afraid your
patch just makes the situation even more muddy. NR_ANON_MAPPED is really
meant to tell how many anonymous pages are mapped. And MADV_FREE pages
are mapped until they are freed. NR_*_FILE are reflecting size of LRU
lists and NR_FILE_PAGES reflects the number of page cache pages but
madvfree pages are not a page cache. They are aged together with file
pages but they are not the same thing. Same like shmem pages are page
cache that is living on anon LRUs.

Confusing? Tricky? Yes, likely. But I do not think we want to bend those
counters even further.

> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 11 +++++++++--
>  mm/rmap.c       | 26 ++++++++++++++++++--------
>  mm/swap.c       |  2 ++
>  mm/vmscan.c     |  2 ++
>  4 files changed, 31 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3dcbf24d2227..217a6f10fa8d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5659,8 +5659,15 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	if (PageAnon(page)) {
>  		if (page_mapped(page)) {
> -			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> -			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> +			if (!PageSwapBacked(page) && !PageSwapCache(page) &&
> +			    !PageUnevictable(page)) {
> +				__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
> +				__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
> +			} else {
> +				__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> +				__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> +			}
> +
>  			if (PageTransHuge(page)) {
>  				__mod_lruvec_state(from_vec, NR_ANON_THPS,
>  						   -nr_pages);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1b84945d655c..690ca7ff2392 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1312,8 +1312,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
>  	if (unlikely(PageMlocked(page)))
>  		clear_page_mlock(page);
>  
> -	if (nr)
> -		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> +	if (nr) {
> +		if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> +		    !PageSwapCache(page) && !PageUnevictable(page))
> +			__mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
> +		else
> +			__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> +	}
>  }
>  
>  /**
> @@ -1341,12 +1346,17 @@ void page_remove_rmap(struct page *page, bool compound)
>  	if (!atomic_add_negative(-1, &page->_mapcount))
>  		goto out;
>  
> -	/*
> -	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> -	 * these counters are not modified in interrupt context, and
> -	 * pte lock(a spinlock) is held, which implies preemption disabled.
> -	 */
> -	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
> +	if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> +	    !PageSwapCache(page) && !PageUnevictable(page)) {
> +		__dec_lruvec_page_state(page, NR_FILE_PAGES);
> +	} else {
> +		/*
> +		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> +		 * these counters are not modified in interrupt context, and
> +		 * pte lock(a spinlock) is held, which implies preemption disabled.
> +		 */
> +		__dec_lruvec_page_state(page, NR_ANON_MAPPED);
> +	}
>  
>  	if (unlikely(PageMlocked(page)))
>  		clear_page_mlock(page);
> diff --git a/mm/swap.c b/mm/swap.c
> index 47a47681c86b..340c5276a0f3 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -601,6 +601,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>  
>  		del_page_from_lru_list(page, lruvec,
>  				       LRU_INACTIVE_ANON + active);
> +		__mod_lruvec_state(lruvec, NR_ANON_MAPPED, -nr_pages);
>  		ClearPageActive(page);
>  		ClearPageReferenced(page);
>  		/*
> @@ -610,6 +611,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>  		 */
>  		ClearPageSwapBacked(page);
>  		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
> +		__mod_lruvec_state(lruvec, NR_FILE_PAGES, nr_pages);
>  
>  		__count_vm_events(PGLAZYFREE, nr_pages);
>  		__count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1b8f0e059767..4821124c70f7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1428,6 +1428,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			}
>  
> +			mod_lruvec_page_state(page, NR_ANON_MAPPED, nr_pages);
> +			mod_lruvec_page_state(page, NR_FILE_PAGES, -nr_pages);
>  			count_vm_event(PGLAZYFREED);
>  			count_memcg_page_event(page, PGLAZYFREED);
>  		} else if (!mapping || !__remove_mapping(mapping, page, true,
> -- 
> 2.18.4
>

Yafang Shao Nov. 5, 2020, 2:16 p.m. UTC | #2

On Thu, Nov 5, 2020 at 9:35 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 05-11-20 21:10:12, Yafang Shao wrote:
> > The memory utilization (Used / Total) is used to monitor the memory
> > pressure by us. If it is too high, it means the system may be OOM sooner
> > or later when swap is off, then we will make adjustment on this system.
> >
> > However, this method is broken since MADV_FREE is introduced, because
> > these lazily free anonymous can be reclaimed under memory pressure while
> > they are still accounted in NR_ANON_MAPPED.
> >
> > Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
> > LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
> > from anon lru list into file lru list. That means
> > (Inactive(file) + Active(file)) may be much larger than Cached in
> > /proc/meminfo. That makes our users confused.
> >
> > So we'd better account the lazily freed anonoymous pages in
> > NR_FILE_PAGES as well.
>
> Can you simply subtract lazyfree pages in the userspace?

Could you pls. tell me how to subtract lazyfree pages in the userspace?
Pls. note that we can't use (pglazyfree - pglazyfreed) because
pglazyfreed is only counted in the regular reclaim path while the
process exit path is not counted, that means we have to introduce
another counter like LazyPage....

> I am afraid your
> patch just makes the situation even more muddy. NR_ANON_MAPPED is really
> meant to tell how many anonymous pages are mapped. And MADV_FREE pages
> are mapped until they are freed. NR_*_FILE are reflecting size of LRU
> lists and NR_FILE_PAGES reflects the number of page cache pages but
> madvfree pages are not a page cache. They are aged together with file
> pages but they are not the same thing. Same like shmem pages are page
> cache that is living on anon LRUs.
>
> Confusing? Tricky? Yes, likely. But I do not think we want to bend those
> counters even further.
>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/memcontrol.c | 11 +++++++++--
> >  mm/rmap.c       | 26 ++++++++++++++++++--------
> >  mm/swap.c       |  2 ++
> >  mm/vmscan.c     |  2 ++
> >  4 files changed, 31 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 3dcbf24d2227..217a6f10fa8d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5659,8 +5659,15 @@ static int mem_cgroup_move_account(struct page *page,
> >
> >       if (PageAnon(page)) {
> >               if (page_mapped(page)) {
> > -                     __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> > -                     __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> > +                     if (!PageSwapBacked(page) && !PageSwapCache(page) &&
> > +                         !PageUnevictable(page)) {
> > +                             __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
> > +                             __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
> > +                     } else {
> > +                             __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> > +                             __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> > +                     }
> > +
> >                       if (PageTransHuge(page)) {
> >                               __mod_lruvec_state(from_vec, NR_ANON_THPS,
> >                                                  -nr_pages);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 1b84945d655c..690ca7ff2392 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1312,8 +1312,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
> >       if (unlikely(PageMlocked(page)))
> >               clear_page_mlock(page);
> >
> > -     if (nr)
> > -             __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> > +     if (nr) {
> > +             if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> > +                 !PageSwapCache(page) && !PageUnevictable(page))
> > +                     __mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
> > +             else
> > +                     __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> > +     }
> >  }
> >
> >  /**
> > @@ -1341,12 +1346,17 @@ void page_remove_rmap(struct page *page, bool compound)
> >       if (!atomic_add_negative(-1, &page->_mapcount))
> >               goto out;
> >
> > -     /*
> > -      * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> > -      * these counters are not modified in interrupt context, and
> > -      * pte lock(a spinlock) is held, which implies preemption disabled.
> > -      */
> > -     __dec_lruvec_page_state(page, NR_ANON_MAPPED);
> > +     if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> > +         !PageSwapCache(page) && !PageUnevictable(page)) {
> > +             __dec_lruvec_page_state(page, NR_FILE_PAGES);
> > +     } else {
> > +             /*
> > +              * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> > +              * these counters are not modified in interrupt context, and
> > +              * pte lock(a spinlock) is held, which implies preemption disabled.
> > +              */
> > +             __dec_lruvec_page_state(page, NR_ANON_MAPPED);
> > +     }
> >
> >       if (unlikely(PageMlocked(page)))
> >               clear_page_mlock(page);
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 47a47681c86b..340c5276a0f3 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -601,6 +601,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
> >
> >               del_page_from_lru_list(page, lruvec,
> >                                      LRU_INACTIVE_ANON + active);
> > +             __mod_lruvec_state(lruvec, NR_ANON_MAPPED, -nr_pages);
> >               ClearPageActive(page);
> >               ClearPageReferenced(page);
> >               /*
> > @@ -610,6 +611,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
> >                */
> >               ClearPageSwapBacked(page);
> >               add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
> > +             __mod_lruvec_state(lruvec, NR_FILE_PAGES, nr_pages);
> >
> >               __count_vm_events(PGLAZYFREE, nr_pages);
> >               __count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE,
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 1b8f0e059767..4821124c70f7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1428,6 +1428,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> >                               goto keep_locked;
> >                       }
> >
> > +                     mod_lruvec_page_state(page, NR_ANON_MAPPED, nr_pages);
> > +                     mod_lruvec_page_state(page, NR_FILE_PAGES, -nr_pages);
> >                       count_vm_event(PGLAZYFREED);
> >                       count_memcg_page_event(page, PGLAZYFREED);
> >               } else if (!mapping || !__remove_mapping(mapping, page, true,
> > --
> > 2.18.4
> >
>
> --
> Michal Hocko
> SUSE Labs

Vlastimil Babka Nov. 5, 2020, 3:18 p.m. UTC | #3

On 11/5/20 2:10 PM, Yafang Shao wrote:
> The memory utilization (Used / Total) is used to monitor the memory
> pressure by us. If it is too high, it means the system may be OOM sooner
> or later when swap is off, then we will make adjustment on this system.

Hmm I would say that any system looking just at memory utilization (Used / 
Total) and not looking at file lru size is flawed.
There's a reason MemAvailable exists, and does count file lru sizes.

> However, this method is broken since MADV_FREE is introduced, because
> these lazily free anonymous can be reclaimed under memory pressure while
> they are still accounted in NR_ANON_MAPPED.
> 
> Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
> LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
> from anon lru list into file lru list. That means
> (Inactive(file) + Active(file)) may be much larger than Cached in
> /proc/meminfo. That makes our users confused.

Yeah the counters are tricky for multiple reasons as Michal said...

> So we'd better account the lazily freed anonoymous pages in
> NR_FILE_PAGES as well.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>
> ---
>   mm/memcontrol.c | 11 +++++++++--
>   mm/rmap.c       | 26 ++++++++++++++++++--------
>   mm/swap.c       |  2 ++
>   mm/vmscan.c     |  2 ++
>   4 files changed, 31 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3dcbf24d2227..217a6f10fa8d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5659,8 +5659,15 @@ static int mem_cgroup_move_account(struct page *page,
>   
>   	if (PageAnon(page)) {
>   		if (page_mapped(page)) {
> -			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> -			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> +			if (!PageSwapBacked(page) && !PageSwapCache(page) &&
> +			    !PageUnevictable(page)) {
> +				__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
> +				__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
> +			} else {
> +				__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> +				__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> +			}
> +
>   			if (PageTransHuge(page)) {
>   				__mod_lruvec_state(from_vec, NR_ANON_THPS,
>   						   -nr_pages);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1b84945d655c..690ca7ff2392 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1312,8 +1312,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
>   	if (unlikely(PageMlocked(page)))
>   		clear_page_mlock(page);
>   
> -	if (nr)
> -		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> +	if (nr) {
> +		if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> +		    !PageSwapCache(page) && !PageUnevictable(page))
> +			__mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
> +		else
> +			__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> +	}
>   }
>   
>   /**
> @@ -1341,12 +1346,17 @@ void page_remove_rmap(struct page *page, bool compound)
>   	if (!atomic_add_negative(-1, &page->_mapcount))
>   		goto out;
>   
> -	/*
> -	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> -	 * these counters are not modified in interrupt context, and
> -	 * pte lock(a spinlock) is held, which implies preemption disabled.
> -	 */
> -	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
> +	if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> +	    !PageSwapCache(page) && !PageUnevictable(page)) {
> +		__dec_lruvec_page_state(page, NR_FILE_PAGES);
> +	} else {
> +		/*
> +		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> +		 * these counters are not modified in interrupt context, and
> +		 * pte lock(a spinlock) is held, which implies preemption disabled.
> +		 */
> +		__dec_lruvec_page_state(page, NR_ANON_MAPPED);
> +	}
>   
>   	if (unlikely(PageMlocked(page)))
>   		clear_page_mlock(page);
> diff --git a/mm/swap.c b/mm/swap.c
> index 47a47681c86b..340c5276a0f3 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -601,6 +601,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>   
>   		del_page_from_lru_list(page, lruvec,
>   				       LRU_INACTIVE_ANON + active);
> +		__mod_lruvec_state(lruvec, NR_ANON_MAPPED, -nr_pages);
>   		ClearPageActive(page);
>   		ClearPageReferenced(page);
>   		/*
> @@ -610,6 +611,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>   		 */
>   		ClearPageSwapBacked(page);
>   		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
> +		__mod_lruvec_state(lruvec, NR_FILE_PAGES, nr_pages);
>   
>   		__count_vm_events(PGLAZYFREE, nr_pages);
>   		__count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1b8f0e059767..4821124c70f7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1428,6 +1428,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>   				goto keep_locked;
>   			}
>   
> +			mod_lruvec_page_state(page, NR_ANON_MAPPED, nr_pages);
> +			mod_lruvec_page_state(page, NR_FILE_PAGES, -nr_pages);
>   			count_vm_event(PGLAZYFREED);
>   			count_memcg_page_event(page, PGLAZYFREED);
>   		} else if (!mapping || !__remove_mapping(mapping, page, true,
>

Michal Hocko Nov. 5, 2020, 3:22 p.m. UTC | #4

On Thu 05-11-20 22:16:10, Yafang Shao wrote:
> On Thu, Nov 5, 2020 at 9:35 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 05-11-20 21:10:12, Yafang Shao wrote:
> > > The memory utilization (Used / Total) is used to monitor the memory
> > > pressure by us. If it is too high, it means the system may be OOM sooner
> > > or later when swap is off, then we will make adjustment on this system.
> > >
> > > However, this method is broken since MADV_FREE is introduced, because
> > > these lazily free anonymous can be reclaimed under memory pressure while
> > > they are still accounted in NR_ANON_MAPPED.
> > >
> > > Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
> > > LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
> > > from anon lru list into file lru list. That means
> > > (Inactive(file) + Active(file)) may be much larger than Cached in
> > > /proc/meminfo. That makes our users confused.
> > >
> > > So we'd better account the lazily freed anonoymous pages in
> > > NR_FILE_PAGES as well.
> >
> > Can you simply subtract lazyfree pages in the userspace?
> 
> Could you pls. tell me how to subtract lazyfree pages in the userspace?
> Pls. note that we can't use (pglazyfree - pglazyfreed) because
> pglazyfreed is only counted in the regular reclaim path while the
> process exit path is not counted, that means we have to introduce
> another counter like LazyPage....

OK, I see your concern. I thought that we do update counters on a
regular unmap. I do not see any reason why we shouldn't. It is indeed
bad that we cannot tell the current number of lazy free pages by no
means. Was this deliberate Minchan?

Johannes Weiner Nov. 5, 2020, 4:22 p.m. UTC | #5

On Thu, Nov 05, 2020 at 09:10:12PM +0800, Yafang Shao wrote:
> The memory utilization (Used / Total) is used to monitor the memory
> pressure by us. If it is too high, it means the system may be OOM sooner
> or later when swap is off, then we will make adjustment on this system.
> 
> However, this method is broken since MADV_FREE is introduced, because
> these lazily free anonymous can be reclaimed under memory pressure while
> they are still accounted in NR_ANON_MAPPED.
> 
> Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
> LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
> from anon lru list into file lru list. That means
> (Inactive(file) + Active(file)) may be much larger than Cached in
> /proc/meminfo. That makes our users confused.
> 
> So we'd better account the lazily freed anonoymous pages in
> NR_FILE_PAGES as well.

What about the share of pages that have been reused? After all, the
idea behind deferred reclaim is cheap reuse of already allocated and
faulted in pages.

Anywhere between 0% and 100% of MADV_FREEd pages may be dirty and need
swap-out to reclaim. That means even after this patch, your formula
would still have an error margin of 100%.

The tradeoff with saving the reuse fault and relying on the MMU is
that the kernel simply *cannot do* lazy free accounting. Userspace
needs to do it. E.g. if a malloc implementation or similar uses
MADV_FREE, it has to keep track of what is and isn't used and make
those stats available.

If that's not practical, I don't see an alternative to trapping minor
faults upon page reuse, eating the additional TLB flush, and doing the
accounting properly inside the kernel.

> @@ -1312,8 +1312,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
>  	if (unlikely(PageMlocked(page)))
>  		clear_page_mlock(page);
>  
> -	if (nr)
> -		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> +	if (nr) {
> +		if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> +		    !PageSwapCache(page) && !PageUnevictable(page))
> +			__mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
> +		else
> +			__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);

I don't think this would work. The page can be temporarily off-LRU for
compaction, migration, reclaim etc. and then you'd misaccount it here.

Michal Hocko Nov. 5, 2020, 5:47 p.m. UTC | #6

On Thu 05-11-20 16:22:52, Michal Hocko wrote:
> On Thu 05-11-20 22:16:10, Yafang Shao wrote:
> > On Thu, Nov 5, 2020 at 9:35 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 05-11-20 21:10:12, Yafang Shao wrote:
> > > > The memory utilization (Used / Total) is used to monitor the memory
> > > > pressure by us. If it is too high, it means the system may be OOM sooner
> > > > or later when swap is off, then we will make adjustment on this system.
> > > >
> > > > However, this method is broken since MADV_FREE is introduced, because
> > > > these lazily free anonymous can be reclaimed under memory pressure while
> > > > they are still accounted in NR_ANON_MAPPED.
> > > >
> > > > Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
> > > > LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
> > > > from anon lru list into file lru list. That means
> > > > (Inactive(file) + Active(file)) may be much larger than Cached in
> > > > /proc/meminfo. That makes our users confused.
> > > >
> > > > So we'd better account the lazily freed anonoymous pages in
> > > > NR_FILE_PAGES as well.
> > >
> > > Can you simply subtract lazyfree pages in the userspace?
> > 
> > Could you pls. tell me how to subtract lazyfree pages in the userspace?
> > Pls. note that we can't use (pglazyfree - pglazyfreed) because
> > pglazyfreed is only counted in the regular reclaim path while the
> > process exit path is not counted, that means we have to introduce
> > another counter like LazyPage....
> 
> OK, I see your concern. I thought that we do update counters on a
> regular unmap. I do not see any reason why we shouldn't. It is indeed
> bad that we cannot tell the current number of lazy free pages by no
> means. Was this deliberate Minchan?

http://lkml.kernel.org/r/20201105162219.GG744831@cmpxchg.org has
explained this. I completely forgot about the fact that those pages can
be reused and lose their madvise status and we will learn about that
only when checking the page.

Thanks Johannes for clarification.

Yafang Shao Nov. 6, 2020, 1:57 a.m. UTC | #7

On Thu, Nov 5, 2020 at 11:18 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 11/5/20 2:10 PM, Yafang Shao wrote:
> > The memory utilization (Used / Total) is used to monitor the memory
> > pressure by us. If it is too high, it means the system may be OOM sooner
> > or later when swap is off, then we will make adjustment on this system.
>
> Hmm I would say that any system looking just at memory utilization (Used /
> Total) and not looking at file lru size is flawed.
> There's a reason MemAvailable exists, and does count file lru sizes.
>

Right, the file lru size is counted in MemAvailable. MemAvailable and
Used are two different metrics used  by us. Both of them are useful,
but the Used is not reliable anymore...

> > However, this method is broken since MADV_FREE is introduced, because
> > these lazily free anonymous can be reclaimed under memory pressure while
> > they are still accounted in NR_ANON_MAPPED.
> >
> > Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
> > LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
> > from anon lru list into file lru list. That means
> > (Inactive(file) + Active(file)) may be much larger than Cached in
> > /proc/meminfo. That makes our users confused.
>
> Yeah the counters are tricky for multiple reasons as Michal said...
>
> > So we'd better account the lazily freed anonoymous pages in
> > NR_FILE_PAGES as well.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > ---
> >   mm/memcontrol.c | 11 +++++++++--
> >   mm/rmap.c       | 26 ++++++++++++++++++--------
> >   mm/swap.c       |  2 ++
> >   mm/vmscan.c     |  2 ++
> >   4 files changed, 31 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 3dcbf24d2227..217a6f10fa8d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5659,8 +5659,15 @@ static int mem_cgroup_move_account(struct page *page,
> >
> >       if (PageAnon(page)) {
> >               if (page_mapped(page)) {
> > -                     __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> > -                     __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> > +                     if (!PageSwapBacked(page) && !PageSwapCache(page) &&
> > +                         !PageUnevictable(page)) {
> > +                             __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
> > +                             __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
> > +                     } else {
> > +                             __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> > +                             __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> > +                     }
> > +
> >                       if (PageTransHuge(page)) {
> >                               __mod_lruvec_state(from_vec, NR_ANON_THPS,
> >                                                  -nr_pages);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 1b84945d655c..690ca7ff2392 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1312,8 +1312,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
> >       if (unlikely(PageMlocked(page)))
> >               clear_page_mlock(page);
> >
> > -     if (nr)
> > -             __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> > +     if (nr) {
> > +             if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> > +                 !PageSwapCache(page) && !PageUnevictable(page))
> > +                     __mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
> > +             else
> > +                     __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> > +     }
> >   }
> >
> >   /**
> > @@ -1341,12 +1346,17 @@ void page_remove_rmap(struct page *page, bool compound)
> >       if (!atomic_add_negative(-1, &page->_mapcount))
> >               goto out;
> >
> > -     /*
> > -      * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> > -      * these counters are not modified in interrupt context, and
> > -      * pte lock(a spinlock) is held, which implies preemption disabled.
> > -      */
> > -     __dec_lruvec_page_state(page, NR_ANON_MAPPED);
> > +     if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> > +         !PageSwapCache(page) && !PageUnevictable(page)) {
> > +             __dec_lruvec_page_state(page, NR_FILE_PAGES);
> > +     } else {
> > +             /*
> > +              * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> > +              * these counters are not modified in interrupt context, and
> > +              * pte lock(a spinlock) is held, which implies preemption disabled.
> > +              */
> > +             __dec_lruvec_page_state(page, NR_ANON_MAPPED);
> > +     }
> >
> >       if (unlikely(PageMlocked(page)))
> >               clear_page_mlock(page);
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 47a47681c86b..340c5276a0f3 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -601,6 +601,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
> >
> >               del_page_from_lru_list(page, lruvec,
> >                                      LRU_INACTIVE_ANON + active);
> > +             __mod_lruvec_state(lruvec, NR_ANON_MAPPED, -nr_pages);
> >               ClearPageActive(page);
> >               ClearPageReferenced(page);
> >               /*
> > @@ -610,6 +611,7 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
> >                */
> >               ClearPageSwapBacked(page);
> >               add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
> > +             __mod_lruvec_state(lruvec, NR_FILE_PAGES, nr_pages);
> >
> >               __count_vm_events(PGLAZYFREE, nr_pages);
> >               __count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE,
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 1b8f0e059767..4821124c70f7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1428,6 +1428,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> >                               goto keep_locked;
> >                       }
> >
> > +                     mod_lruvec_page_state(page, NR_ANON_MAPPED, nr_pages);
> > +                     mod_lruvec_page_state(page, NR_FILE_PAGES, -nr_pages);
> >                       count_vm_event(PGLAZYFREED);
> >                       count_memcg_page_event(page, PGLAZYFREED);
> >               } else if (!mapping || !__remove_mapping(mapping, page, true,
> >
>

Yafang Shao Nov. 6, 2020, 2:09 a.m. UTC | #8

On Fri, Nov 6, 2020 at 12:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Nov 05, 2020 at 09:10:12PM +0800, Yafang Shao wrote:
> > The memory utilization (Used / Total) is used to monitor the memory
> > pressure by us. If it is too high, it means the system may be OOM sooner
> > or later when swap is off, then we will make adjustment on this system.
> >
> > However, this method is broken since MADV_FREE is introduced, because
> > these lazily free anonymous can be reclaimed under memory pressure while
> > they are still accounted in NR_ANON_MAPPED.
> >
> > Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
> > LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
> > from anon lru list into file lru list. That means
> > (Inactive(file) + Active(file)) may be much larger than Cached in
> > /proc/meminfo. That makes our users confused.
> >
> > So we'd better account the lazily freed anonoymous pages in
> > NR_FILE_PAGES as well.
>
> What about the share of pages that have been reused? After all, the
> idea behind deferred reclaim is cheap reuse of already allocated and
> faulted in pages.
>

I missed the reuse case. Thanks for the explanation.

> Anywhere between 0% and 100% of MADV_FREEd pages may be dirty and need
> swap-out to reclaim. That means even after this patch, your formula
> would still have an error margin of 100%.
>
> The tradeoff with saving the reuse fault and relying on the MMU is
> that the kernel simply *cannot do* lazy free accounting. Userspace
> needs to do it. E.g. if a malloc implementation or similar uses
> MADV_FREE, it has to keep track of what is and isn't used and make
> those stats available.
>
> If that's not practical,

That is not practical. The process which uses MADV_FREE can keep track
of it, but other processes like monitor tools have no easier way to
keep track of it. We can't give the userspace trouble.

> I don't see an alternative to trapping minor
> faults upon page reuse, eating the additional TLB flush, and doing the
> accounting properly inside the kernel.
>

I will try to analyze the details and find whether there is some way
to track it in the kernel.

> > @@ -1312,8 +1312,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
> >       if (unlikely(PageMlocked(page)))
> >               clear_page_mlock(page);
> >
> > -     if (nr)
> > -             __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> > +     if (nr) {
> > +             if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> > +                 !PageSwapCache(page) && !PageUnevictable(page))
> > +                     __mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
> > +             else
> > +                     __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
>
> I don't think this would work. The page can be temporarily off-LRU for
> compaction, migration, reclaim etc. and then you'd misaccount it here.

Right, thanks for the clarification.

mm: account lazily freed anon pages in NR_FILE_PAGES

Commit Message

Comments

Patch