diff mbox series

[4/5] mm: workingset: move the stats flush into workingset_test_recent()

Message ID 20230921081057.3440885-5-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series mm: memcg: subtree stats flushing and thresholds | expand

Commit Message

Yosry Ahmed Sept. 21, 2023, 8:10 a.m. UTC
The workingset code flushes the stats in workingset_refault() to get
accurate stats of the eviction memcg. In preparation for more scoped
flushed and passing the eviction memcg to the flush call, move the call
to workingset_test_recent() where we have a pointer to the eviction
memcg.

The flush call is sleepable, and cannot be made in an rcu read section.
Hence, minimize the rcu read section by also moving it into
workingset_test_recent(). Furthermore, instead of holding the rcu read
lock throughout workingset_test_recent(), only hold it briefly to get a
ref on the eviction memcg. This allows us to make the flush call after
we get the eviction memcg.

As for workingset_refault(), nothing else there appears to be protected
by rcu. The memcg of the faulted folio (which is not necessarily the
same as the eviction memcg) is protected by the folio lock, which is
held from all callsites. Add a VM_BUG_ON() to make sure this doesn't
change from under us.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/workingset.c | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

Comments

kernel test robot Sept. 21, 2023, 11:05 a.m. UTC | #1
Hi Yosry,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com
patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()
config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309211829.Efuqg8NE-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/workingset.c: In function 'workingset_test_recent':
>> mm/workingset.c:461:32: error: invalid use of undefined type 'struct mem_cgroup'
     461 |         css_get(&eviction_memcg->css);
         |                                ^~


vim +461 mm/workingset.c

   405	
   406	/**
   407	 * workingset_test_recent - tests if the shadow entry is for a folio that was
   408	 * recently evicted. Also fills in @workingset with the value unpacked from
   409	 * shadow.
   410	 * @shadow: the shadow entry to be tested.
   411	 * @file: whether the corresponding folio is from the file lru.
   412	 * @workingset: where the workingset value unpacked from shadow should
   413	 * be stored.
   414	 *
   415	 * Return: true if the shadow is for a recently evicted folio; false otherwise.
   416	 */
   417	bool workingset_test_recent(void *shadow, bool file, bool *workingset)
   418	{
   419		struct mem_cgroup *eviction_memcg;
   420		struct lruvec *eviction_lruvec;
   421		unsigned long refault_distance;
   422		unsigned long workingset_size;
   423		unsigned long refault;
   424		int memcgid;
   425		struct pglist_data *pgdat;
   426		unsigned long eviction;
   427	
   428		rcu_read_lock();
   429	
   430		if (lru_gen_enabled()) {
   431			bool recent = lru_gen_test_recent(shadow, file,
   432							  &eviction_lruvec, &eviction,
   433							  workingset);
   434			rcu_read_unlock();
   435			return recent;
   436		}
   437	
   438		unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
   439		eviction <<= bucket_order;
   440	
   441		/*
   442		 * Look up the memcg associated with the stored ID. It might
   443		 * have been deleted since the folio's eviction.
   444		 *
   445		 * Note that in rare events the ID could have been recycled
   446		 * for a new cgroup that refaults a shared folio. This is
   447		 * impossible to tell from the available data. However, this
   448		 * should be a rare and limited disturbance, and activations
   449		 * are always speculative anyway. Ultimately, it's the aging
   450		 * algorithm's job to shake out the minimum access frequency
   451		 * for the active cache.
   452		 *
   453		 * XXX: On !CONFIG_MEMCG, this will always return NULL; it
   454		 * would be better if the root_mem_cgroup existed in all
   455		 * configurations instead.
   456		 */
   457		eviction_memcg = mem_cgroup_from_id(memcgid);
   458		if (!mem_cgroup_disabled() && !eviction_memcg)
   459			return false;
   460	
 > 461		css_get(&eviction_memcg->css);
   462		rcu_read_unlock();
   463	
   464		/* Flush stats (and potentially sleep) outside the RCU read section */
   465		mem_cgroup_flush_stats_ratelimited();
   466	
   467		eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
   468		refault = atomic_long_read(&eviction_lruvec->nonresident_age);
   469	
   470		/*
   471		 * Calculate the refault distance
   472		 *
   473		 * The unsigned subtraction here gives an accurate distance
   474		 * across nonresident_age overflows in most cases. There is a
   475		 * special case: usually, shadow entries have a short lifetime
   476		 * and are either refaulted or reclaimed along with the inode
   477		 * before they get too old.  But it is not impossible for the
   478		 * nonresident_age to lap a shadow entry in the field, which
   479		 * can then result in a false small refault distance, leading
   480		 * to a false activation should this old entry actually
   481		 * refault again.  However, earlier kernels used to deactivate
   482		 * unconditionally with *every* reclaim invocation for the
   483		 * longest time, so the occasional inappropriate activation
   484		 * leading to pressure on the active list is not a problem.
   485		 */
   486		refault_distance = (refault - eviction) & EVICTION_MASK;
   487	
   488		/*
   489		 * Compare the distance to the existing workingset size. We
   490		 * don't activate pages that couldn't stay resident even if
   491		 * all the memory was available to the workingset. Whether
   492		 * workingset competition needs to consider anon or not depends
   493		 * on having free swap space.
   494		 */
   495		workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
   496		if (!file) {
   497			workingset_size += lruvec_page_state(eviction_lruvec,
   498							     NR_INACTIVE_FILE);
   499		}
   500		if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
   501			workingset_size += lruvec_page_state(eviction_lruvec,
   502							     NR_ACTIVE_ANON);
   503			if (file) {
   504				workingset_size += lruvec_page_state(eviction_lruvec,
   505							     NR_INACTIVE_ANON);
   506			}
   507		}
   508	
   509		mem_cgroup_put(eviction_memcg);
   510		return refault_distance <= workingset_size;
   511	}
   512
Yosry Ahmed Sept. 21, 2023, 9 p.m. UTC | #2
On Thu, Sep 21, 2023 at 4:06 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Yosry,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com
> patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()
> config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/config)
> compiler: powerpc-linux-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202309211829.Efuqg8NE-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
>    mm/workingset.c: In function 'workingset_test_recent':
> >> mm/workingset.c:461:32: error: invalid use of undefined type 'struct mem_cgroup'
>      461 |         css_get(&eviction_memcg->css);
>          |                                ^~
>

Ah yes, I cannot dereference the memcg pointer here. I think we want
mem_cgroup_get_from_id (a getter version of mem_cgroup_from_id) or
mem_cgroup_get (similar to mem_cgroup_put) to have stubs when
!CONFIG_MEMCG. I will do this change in the next version, but I'll
wait for feedback on this version first.

>
> vim +461 mm/workingset.c
>
>    405
>    406  /**
>    407   * workingset_test_recent - tests if the shadow entry is for a folio that was
>    408   * recently evicted. Also fills in @workingset with the value unpacked from
>    409   * shadow.
>    410   * @shadow: the shadow entry to be tested.
>    411   * @file: whether the corresponding folio is from the file lru.
>    412   * @workingset: where the workingset value unpacked from shadow should
>    413   * be stored.
>    414   *
>    415   * Return: true if the shadow is for a recently evicted folio; false otherwise.
>    416   */
>    417  bool workingset_test_recent(void *shadow, bool file, bool *workingset)
>    418  {
>    419          struct mem_cgroup *eviction_memcg;
>    420          struct lruvec *eviction_lruvec;
>    421          unsigned long refault_distance;
>    422          unsigned long workingset_size;
>    423          unsigned long refault;
>    424          int memcgid;
>    425          struct pglist_data *pgdat;
>    426          unsigned long eviction;
>    427
>    428          rcu_read_lock();
>    429
>    430          if (lru_gen_enabled()) {
>    431                  bool recent = lru_gen_test_recent(shadow, file,
>    432                                                    &eviction_lruvec, &eviction,
>    433                                                    workingset);
>    434                  rcu_read_unlock();
>    435                  return recent;
>    436          }
>    437
>    438          unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
>    439          eviction <<= bucket_order;
>    440
>    441          /*
>    442           * Look up the memcg associated with the stored ID. It might
>    443           * have been deleted since the folio's eviction.
>    444           *
>    445           * Note that in rare events the ID could have been recycled
>    446           * for a new cgroup that refaults a shared folio. This is
>    447           * impossible to tell from the available data. However, this
>    448           * should be a rare and limited disturbance, and activations
>    449           * are always speculative anyway. Ultimately, it's the aging
>    450           * algorithm's job to shake out the minimum access frequency
>    451           * for the active cache.
>    452           *
>    453           * XXX: On !CONFIG_MEMCG, this will always return NULL; it
>    454           * would be better if the root_mem_cgroup existed in all
>    455           * configurations instead.
>    456           */
>    457          eviction_memcg = mem_cgroup_from_id(memcgid);
>    458          if (!mem_cgroup_disabled() && !eviction_memcg)
>    459                  return false;
>    460
>  > 461          css_get(&eviction_memcg->css);
>    462          rcu_read_unlock();
>    463
>    464          /* Flush stats (and potentially sleep) outside the RCU read section */
>    465          mem_cgroup_flush_stats_ratelimited();
>    466
>    467          eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
>    468          refault = atomic_long_read(&eviction_lruvec->nonresident_age);
>    469
>    470          /*
>    471           * Calculate the refault distance
>    472           *
>    473           * The unsigned subtraction here gives an accurate distance
>    474           * across nonresident_age overflows in most cases. There is a
>    475           * special case: usually, shadow entries have a short lifetime
>    476           * and are either refaulted or reclaimed along with the inode
>    477           * before they get too old.  But it is not impossible for the
>    478           * nonresident_age to lap a shadow entry in the field, which
>    479           * can then result in a false small refault distance, leading
>    480           * to a false activation should this old entry actually
>    481           * refault again.  However, earlier kernels used to deactivate
>    482           * unconditionally with *every* reclaim invocation for the
>    483           * longest time, so the occasional inappropriate activation
>    484           * leading to pressure on the active list is not a problem.
>    485           */
>    486          refault_distance = (refault - eviction) & EVICTION_MASK;
>    487
>    488          /*
>    489           * Compare the distance to the existing workingset size. We
>    490           * don't activate pages that couldn't stay resident even if
>    491           * all the memory was available to the workingset. Whether
>    492           * workingset competition needs to consider anon or not depends
>    493           * on having free swap space.
>    494           */
>    495          workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
>    496          if (!file) {
>    497                  workingset_size += lruvec_page_state(eviction_lruvec,
>    498                                                       NR_INACTIVE_FILE);
>    499          }
>    500          if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
>    501                  workingset_size += lruvec_page_state(eviction_lruvec,
>    502                                                       NR_ACTIVE_ANON);
>    503                  if (file) {
>    504                          workingset_size += lruvec_page_state(eviction_lruvec,
>    505                                                       NR_INACTIVE_ANON);
>    506                  }
>    507          }
>    508
>    509          mem_cgroup_put(eviction_memcg);
>    510          return refault_distance <= workingset_size;
>    511  }
>    512
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
kernel test robot Sept. 21, 2023, 11:29 p.m. UTC | #3
Hi Yosry,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com
patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent()
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20230922/202309220704.NF0GCRP2-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230922/202309220704.NF0GCRP2-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309220704.NF0GCRP2-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/workingset.c: In function 'workingset_test_recent':
>> mm/workingset.c:461:25: error: dereferencing pointer to incomplete type 'struct mem_cgroup'
     css_get(&eviction_memcg->css);
                            ^~


vim +461 mm/workingset.c

   405	
   406	/**
   407	 * workingset_test_recent - tests if the shadow entry is for a folio that was
   408	 * recently evicted. Also fills in @workingset with the value unpacked from
   409	 * shadow.
   410	 * @shadow: the shadow entry to be tested.
   411	 * @file: whether the corresponding folio is from the file lru.
   412	 * @workingset: where the workingset value unpacked from shadow should
   413	 * be stored.
   414	 *
   415	 * Return: true if the shadow is for a recently evicted folio; false otherwise.
   416	 */
   417	bool workingset_test_recent(void *shadow, bool file, bool *workingset)
   418	{
   419		struct mem_cgroup *eviction_memcg;
   420		struct lruvec *eviction_lruvec;
   421		unsigned long refault_distance;
   422		unsigned long workingset_size;
   423		unsigned long refault;
   424		int memcgid;
   425		struct pglist_data *pgdat;
   426		unsigned long eviction;
   427	
   428		rcu_read_lock();
   429	
   430		if (lru_gen_enabled()) {
   431			bool recent = lru_gen_test_recent(shadow, file,
   432							  &eviction_lruvec, &eviction,
   433							  workingset);
   434			rcu_read_unlock();
   435			return recent;
   436		}
   437	
   438		unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
   439		eviction <<= bucket_order;
   440	
   441		/*
   442		 * Look up the memcg associated with the stored ID. It might
   443		 * have been deleted since the folio's eviction.
   444		 *
   445		 * Note that in rare events the ID could have been recycled
   446		 * for a new cgroup that refaults a shared folio. This is
   447		 * impossible to tell from the available data. However, this
   448		 * should be a rare and limited disturbance, and activations
   449		 * are always speculative anyway. Ultimately, it's the aging
   450		 * algorithm's job to shake out the minimum access frequency
   451		 * for the active cache.
   452		 *
   453		 * XXX: On !CONFIG_MEMCG, this will always return NULL; it
   454		 * would be better if the root_mem_cgroup existed in all
   455		 * configurations instead.
   456		 */
   457		eviction_memcg = mem_cgroup_from_id(memcgid);
   458		if (!mem_cgroup_disabled() && !eviction_memcg)
   459			return false;
   460	
 > 461		css_get(&eviction_memcg->css);
   462		rcu_read_unlock();
   463	
   464		/* Flush stats (and potentially sleep) outside the RCU read section */
   465		mem_cgroup_flush_stats_ratelimited();
   466	
   467		eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
   468		refault = atomic_long_read(&eviction_lruvec->nonresident_age);
   469	
   470		/*
   471		 * Calculate the refault distance
   472		 *
   473		 * The unsigned subtraction here gives an accurate distance
   474		 * across nonresident_age overflows in most cases. There is a
   475		 * special case: usually, shadow entries have a short lifetime
   476		 * and are either refaulted or reclaimed along with the inode
   477		 * before they get too old.  But it is not impossible for the
   478		 * nonresident_age to lap a shadow entry in the field, which
   479		 * can then result in a false small refault distance, leading
   480		 * to a false activation should this old entry actually
   481		 * refault again.  However, earlier kernels used to deactivate
   482		 * unconditionally with *every* reclaim invocation for the
   483		 * longest time, so the occasional inappropriate activation
   484		 * leading to pressure on the active list is not a problem.
   485		 */
   486		refault_distance = (refault - eviction) & EVICTION_MASK;
   487	
   488		/*
   489		 * Compare the distance to the existing workingset size. We
   490		 * don't activate pages that couldn't stay resident even if
   491		 * all the memory was available to the workingset. Whether
   492		 * workingset competition needs to consider anon or not depends
   493		 * on having free swap space.
   494		 */
   495		workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
   496		if (!file) {
   497			workingset_size += lruvec_page_state(eviction_lruvec,
   498							     NR_INACTIVE_FILE);
   499		}
   500		if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
   501			workingset_size += lruvec_page_state(eviction_lruvec,
   502							     NR_ACTIVE_ANON);
   503			if (file) {
   504				workingset_size += lruvec_page_state(eviction_lruvec,
   505							     NR_INACTIVE_ANON);
   506			}
   507		}
   508	
   509		mem_cgroup_put(eviction_memcg);
   510		return refault_distance <= workingset_size;
   511	}
   512
diff mbox series

Patch

diff --git a/mm/workingset.c b/mm/workingset.c
index b192e44a0e7c..79b338996088 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -425,8 +425,15 @@  bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 	struct pglist_data *pgdat;
 	unsigned long eviction;
 
-	if (lru_gen_enabled())
-		return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
+	rcu_read_lock();
+
+	if (lru_gen_enabled()) {
+		bool recent = lru_gen_test_recent(shadow, file,
+						  &eviction_lruvec, &eviction,
+						  workingset);
+		rcu_read_unlock();
+		return recent;
+	}
 
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
 	eviction <<= bucket_order;
@@ -451,6 +458,12 @@  bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		return false;
 
+	css_get(&eviction_memcg->css);
+	rcu_read_unlock();
+
+	/* Flush stats (and potentially sleep) outside the RCU read section */
+	mem_cgroup_flush_stats_ratelimited();
+
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
 	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
@@ -493,6 +506,7 @@  bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 		}
 	}
 
+	mem_cgroup_put(eviction_memcg);
 	return refault_distance <= workingset_size;
 }
 
@@ -519,19 +533,16 @@  void workingset_refault(struct folio *folio, void *shadow)
 		return;
 	}
 
-	/* Flush stats (and potentially sleep) before holding RCU read lock */
-	mem_cgroup_flush_stats_ratelimited();
-
-	rcu_read_lock();
-
 	/*
 	 * The activation decision for this folio is made at the level
 	 * where the eviction occurred, as that is where the LRU order
 	 * during folio reclaim is being determined.
 	 *
 	 * However, the cgroup that will own the folio is the one that
-	 * is actually experiencing the refault event.
+	 * is actually experiencing the refault event. Make sure the folio is
+	 * locked to guarantee folio_memcg() stability throughout.
 	 */
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	nr = folio_nr_pages(folio);
 	memcg = folio_memcg(folio);
 	pgdat = folio_pgdat(folio);
@@ -540,7 +551,7 @@  void workingset_refault(struct folio *folio, void *shadow)
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
 	if (!workingset_test_recent(shadow, file, &workingset))
-		goto out;
+		return;
 
 	folio_set_active(folio);
 	workingset_age_nonresident(lruvec, nr);
@@ -556,8 +567,6 @@  void workingset_refault(struct folio *folio, void *shadow)
 		lru_note_cost_refault(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
 	}
-out:
-	rcu_read_unlock();
 }
 
 /**