Message ID | 20230921081057.3440885-5-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: memcg: subtree stats flushing and thresholds | expand |
Hi Yosry, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent() config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/config) compiler: powerpc-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202309211829.Efuqg8NE-lkp@intel.com/ All errors (new ones prefixed by >>): mm/workingset.c: In function 'workingset_test_recent': >> mm/workingset.c:461:32: error: invalid use of undefined type 'struct mem_cgroup' 461 | css_get(&eviction_memcg->css); | ^~ vim +461 mm/workingset.c 405 406 /** 407 * workingset_test_recent - tests if the shadow entry is for a folio that was 408 * recently evicted. Also fills in @workingset with the value unpacked from 409 * shadow. 410 * @shadow: the shadow entry to be tested. 411 * @file: whether the corresponding folio is from the file lru. 412 * @workingset: where the workingset value unpacked from shadow should 413 * be stored. 414 * 415 * Return: true if the shadow is for a recently evicted folio; false otherwise. 416 */ 417 bool workingset_test_recent(void *shadow, bool file, bool *workingset) 418 { 419 struct mem_cgroup *eviction_memcg; 420 struct lruvec *eviction_lruvec; 421 unsigned long refault_distance; 422 unsigned long workingset_size; 423 unsigned long refault; 424 int memcgid; 425 struct pglist_data *pgdat; 426 unsigned long eviction; 427 428 rcu_read_lock(); 429 430 if (lru_gen_enabled()) { 431 bool recent = lru_gen_test_recent(shadow, file, 432 &eviction_lruvec, &eviction, 433 workingset); 434 rcu_read_unlock(); 435 return recent; 436 } 437 438 unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset); 439 eviction <<= bucket_order; 440 441 /* 442 * Look up the memcg associated with the stored ID. It might 443 * have been deleted since the folio's eviction. 444 * 445 * Note that in rare events the ID could have been recycled 446 * for a new cgroup that refaults a shared folio. This is 447 * impossible to tell from the available data. However, this 448 * should be a rare and limited disturbance, and activations 449 * are always speculative anyway. Ultimately, it's the aging 450 * algorithm's job to shake out the minimum access frequency 451 * for the active cache. 452 * 453 * XXX: On !CONFIG_MEMCG, this will always return NULL; it 454 * would be better if the root_mem_cgroup existed in all 455 * configurations instead. 456 */ 457 eviction_memcg = mem_cgroup_from_id(memcgid); 458 if (!mem_cgroup_disabled() && !eviction_memcg) 459 return false; 460 > 461 css_get(&eviction_memcg->css); 462 rcu_read_unlock(); 463 464 /* Flush stats (and potentially sleep) outside the RCU read section */ 465 mem_cgroup_flush_stats_ratelimited(); 466 467 eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); 468 refault = atomic_long_read(&eviction_lruvec->nonresident_age); 469 470 /* 471 * Calculate the refault distance 472 * 473 * The unsigned subtraction here gives an accurate distance 474 * across nonresident_age overflows in most cases. There is a 475 * special case: usually, shadow entries have a short lifetime 476 * and are either refaulted or reclaimed along with the inode 477 * before they get too old. But it is not impossible for the 478 * nonresident_age to lap a shadow entry in the field, which 479 * can then result in a false small refault distance, leading 480 * to a false activation should this old entry actually 481 * refault again. However, earlier kernels used to deactivate 482 * unconditionally with *every* reclaim invocation for the 483 * longest time, so the occasional inappropriate activation 484 * leading to pressure on the active list is not a problem. 485 */ 486 refault_distance = (refault - eviction) & EVICTION_MASK; 487 488 /* 489 * Compare the distance to the existing workingset size. We 490 * don't activate pages that couldn't stay resident even if 491 * all the memory was available to the workingset. Whether 492 * workingset competition needs to consider anon or not depends 493 * on having free swap space. 494 */ 495 workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); 496 if (!file) { 497 workingset_size += lruvec_page_state(eviction_lruvec, 498 NR_INACTIVE_FILE); 499 } 500 if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) { 501 workingset_size += lruvec_page_state(eviction_lruvec, 502 NR_ACTIVE_ANON); 503 if (file) { 504 workingset_size += lruvec_page_state(eviction_lruvec, 505 NR_INACTIVE_ANON); 506 } 507 } 508 509 mem_cgroup_put(eviction_memcg); 510 return refault_distance <= workingset_size; 511 } 512
On Thu, Sep 21, 2023 at 4:06 AM kernel test robot <lkp@intel.com> wrote: > > Hi Yosry, > > kernel test robot noticed the following build errors: > > [auto build test ERROR on akpm-mm/mm-everything] > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246 > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything > patch link: https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com > patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent() > config: powerpc-allnoconfig (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/config) > compiler: powerpc-linux-gcc (GCC) 13.2.0 > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230921/202309211829.Efuqg8NE-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202309211829.Efuqg8NE-lkp@intel.com/ > > All errors (new ones prefixed by >>): > > mm/workingset.c: In function 'workingset_test_recent': > >> mm/workingset.c:461:32: error: invalid use of undefined type 'struct mem_cgroup' > 461 | css_get(&eviction_memcg->css); > | ^~ > Ah yes, I cannot dereference the memcg pointer here. I think we want mem_cgroup_get_from_id (a getter version of mem_cgroup_from_id) or mem_cgroup_get (similar to mem_cgroup_put) to have stubs when !CONFIG_MEMCG. I will do this change in the next version, but I'll wait for feedback on this version first. > > vim +461 mm/workingset.c > > 405 > 406 /** > 407 * workingset_test_recent - tests if the shadow entry is for a folio that was > 408 * recently evicted. Also fills in @workingset with the value unpacked from > 409 * shadow. > 410 * @shadow: the shadow entry to be tested. > 411 * @file: whether the corresponding folio is from the file lru. > 412 * @workingset: where the workingset value unpacked from shadow should > 413 * be stored. > 414 * > 415 * Return: true if the shadow is for a recently evicted folio; false otherwise. > 416 */ > 417 bool workingset_test_recent(void *shadow, bool file, bool *workingset) > 418 { > 419 struct mem_cgroup *eviction_memcg; > 420 struct lruvec *eviction_lruvec; > 421 unsigned long refault_distance; > 422 unsigned long workingset_size; > 423 unsigned long refault; > 424 int memcgid; > 425 struct pglist_data *pgdat; > 426 unsigned long eviction; > 427 > 428 rcu_read_lock(); > 429 > 430 if (lru_gen_enabled()) { > 431 bool recent = lru_gen_test_recent(shadow, file, > 432 &eviction_lruvec, &eviction, > 433 workingset); > 434 rcu_read_unlock(); > 435 return recent; > 436 } > 437 > 438 unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset); > 439 eviction <<= bucket_order; > 440 > 441 /* > 442 * Look up the memcg associated with the stored ID. It might > 443 * have been deleted since the folio's eviction. > 444 * > 445 * Note that in rare events the ID could have been recycled > 446 * for a new cgroup that refaults a shared folio. This is > 447 * impossible to tell from the available data. However, this > 448 * should be a rare and limited disturbance, and activations > 449 * are always speculative anyway. Ultimately, it's the aging > 450 * algorithm's job to shake out the minimum access frequency > 451 * for the active cache. > 452 * > 453 * XXX: On !CONFIG_MEMCG, this will always return NULL; it > 454 * would be better if the root_mem_cgroup existed in all > 455 * configurations instead. > 456 */ > 457 eviction_memcg = mem_cgroup_from_id(memcgid); > 458 if (!mem_cgroup_disabled() && !eviction_memcg) > 459 return false; > 460 > > 461 css_get(&eviction_memcg->css); > 462 rcu_read_unlock(); > 463 > 464 /* Flush stats (and potentially sleep) outside the RCU read section */ > 465 mem_cgroup_flush_stats_ratelimited(); > 466 > 467 eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); > 468 refault = atomic_long_read(&eviction_lruvec->nonresident_age); > 469 > 470 /* > 471 * Calculate the refault distance > 472 * > 473 * The unsigned subtraction here gives an accurate distance > 474 * across nonresident_age overflows in most cases. There is a > 475 * special case: usually, shadow entries have a short lifetime > 476 * and are either refaulted or reclaimed along with the inode > 477 * before they get too old. But it is not impossible for the > 478 * nonresident_age to lap a shadow entry in the field, which > 479 * can then result in a false small refault distance, leading > 480 * to a false activation should this old entry actually > 481 * refault again. However, earlier kernels used to deactivate > 482 * unconditionally with *every* reclaim invocation for the > 483 * longest time, so the occasional inappropriate activation > 484 * leading to pressure on the active list is not a problem. > 485 */ > 486 refault_distance = (refault - eviction) & EVICTION_MASK; > 487 > 488 /* > 489 * Compare the distance to the existing workingset size. We > 490 * don't activate pages that couldn't stay resident even if > 491 * all the memory was available to the workingset. Whether > 492 * workingset competition needs to consider anon or not depends > 493 * on having free swap space. > 494 */ > 495 workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); > 496 if (!file) { > 497 workingset_size += lruvec_page_state(eviction_lruvec, > 498 NR_INACTIVE_FILE); > 499 } > 500 if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) { > 501 workingset_size += lruvec_page_state(eviction_lruvec, > 502 NR_ACTIVE_ANON); > 503 if (file) { > 504 workingset_size += lruvec_page_state(eviction_lruvec, > 505 NR_INACTIVE_ANON); > 506 } > 507 } > 508 > 509 mem_cgroup_put(eviction_memcg); > 510 return refault_distance <= workingset_size; > 511 } > 512 > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki
Hi Yosry, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20230921-161246 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20230921081057.3440885-5-yosryahmed%40google.com patch subject: [PATCH 4/5] mm: workingset: move the stats flush into workingset_test_recent() config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20230922/202309220704.NF0GCRP2-lkp@intel.com/config) compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230922/202309220704.NF0GCRP2-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202309220704.NF0GCRP2-lkp@intel.com/ All errors (new ones prefixed by >>): mm/workingset.c: In function 'workingset_test_recent': >> mm/workingset.c:461:25: error: dereferencing pointer to incomplete type 'struct mem_cgroup' css_get(&eviction_memcg->css); ^~ vim +461 mm/workingset.c 405 406 /** 407 * workingset_test_recent - tests if the shadow entry is for a folio that was 408 * recently evicted. Also fills in @workingset with the value unpacked from 409 * shadow. 410 * @shadow: the shadow entry to be tested. 411 * @file: whether the corresponding folio is from the file lru. 412 * @workingset: where the workingset value unpacked from shadow should 413 * be stored. 414 * 415 * Return: true if the shadow is for a recently evicted folio; false otherwise. 416 */ 417 bool workingset_test_recent(void *shadow, bool file, bool *workingset) 418 { 419 struct mem_cgroup *eviction_memcg; 420 struct lruvec *eviction_lruvec; 421 unsigned long refault_distance; 422 unsigned long workingset_size; 423 unsigned long refault; 424 int memcgid; 425 struct pglist_data *pgdat; 426 unsigned long eviction; 427 428 rcu_read_lock(); 429 430 if (lru_gen_enabled()) { 431 bool recent = lru_gen_test_recent(shadow, file, 432 &eviction_lruvec, &eviction, 433 workingset); 434 rcu_read_unlock(); 435 return recent; 436 } 437 438 unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset); 439 eviction <<= bucket_order; 440 441 /* 442 * Look up the memcg associated with the stored ID. It might 443 * have been deleted since the folio's eviction. 444 * 445 * Note that in rare events the ID could have been recycled 446 * for a new cgroup that refaults a shared folio. This is 447 * impossible to tell from the available data. However, this 448 * should be a rare and limited disturbance, and activations 449 * are always speculative anyway. Ultimately, it's the aging 450 * algorithm's job to shake out the minimum access frequency 451 * for the active cache. 452 * 453 * XXX: On !CONFIG_MEMCG, this will always return NULL; it 454 * would be better if the root_mem_cgroup existed in all 455 * configurations instead. 456 */ 457 eviction_memcg = mem_cgroup_from_id(memcgid); 458 if (!mem_cgroup_disabled() && !eviction_memcg) 459 return false; 460 > 461 css_get(&eviction_memcg->css); 462 rcu_read_unlock(); 463 464 /* Flush stats (and potentially sleep) outside the RCU read section */ 465 mem_cgroup_flush_stats_ratelimited(); 466 467 eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); 468 refault = atomic_long_read(&eviction_lruvec->nonresident_age); 469 470 /* 471 * Calculate the refault distance 472 * 473 * The unsigned subtraction here gives an accurate distance 474 * across nonresident_age overflows in most cases. There is a 475 * special case: usually, shadow entries have a short lifetime 476 * and are either refaulted or reclaimed along with the inode 477 * before they get too old. But it is not impossible for the 478 * nonresident_age to lap a shadow entry in the field, which 479 * can then result in a false small refault distance, leading 480 * to a false activation should this old entry actually 481 * refault again. However, earlier kernels used to deactivate 482 * unconditionally with *every* reclaim invocation for the 483 * longest time, so the occasional inappropriate activation 484 * leading to pressure on the active list is not a problem. 485 */ 486 refault_distance = (refault - eviction) & EVICTION_MASK; 487 488 /* 489 * Compare the distance to the existing workingset size. We 490 * don't activate pages that couldn't stay resident even if 491 * all the memory was available to the workingset. Whether 492 * workingset competition needs to consider anon or not depends 493 * on having free swap space. 494 */ 495 workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); 496 if (!file) { 497 workingset_size += lruvec_page_state(eviction_lruvec, 498 NR_INACTIVE_FILE); 499 } 500 if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) { 501 workingset_size += lruvec_page_state(eviction_lruvec, 502 NR_ACTIVE_ANON); 503 if (file) { 504 workingset_size += lruvec_page_state(eviction_lruvec, 505 NR_INACTIVE_ANON); 506 } 507 } 508 509 mem_cgroup_put(eviction_memcg); 510 return refault_distance <= workingset_size; 511 } 512
diff --git a/mm/workingset.c b/mm/workingset.c index b192e44a0e7c..79b338996088 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -425,8 +425,15 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset) struct pglist_data *pgdat; unsigned long eviction; - if (lru_gen_enabled()) - return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset); + rcu_read_lock(); + + if (lru_gen_enabled()) { + bool recent = lru_gen_test_recent(shadow, file, + &eviction_lruvec, &eviction, + workingset); + rcu_read_unlock(); + return recent; + } unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset); eviction <<= bucket_order; @@ -451,6 +458,12 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset) if (!mem_cgroup_disabled() && !eviction_memcg) return false; + css_get(&eviction_memcg->css); + rcu_read_unlock(); + + /* Flush stats (and potentially sleep) outside the RCU read section */ + mem_cgroup_flush_stats_ratelimited(); + eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); refault = atomic_long_read(&eviction_lruvec->nonresident_age); @@ -493,6 +506,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset) } } + mem_cgroup_put(eviction_memcg); return refault_distance <= workingset_size; } @@ -519,19 +533,16 @@ void workingset_refault(struct folio *folio, void *shadow) return; } - /* Flush stats (and potentially sleep) before holding RCU read lock */ - mem_cgroup_flush_stats_ratelimited(); - - rcu_read_lock(); - /* * The activation decision for this folio is made at the level * where the eviction occurred, as that is where the LRU order * during folio reclaim is being determined. * * However, the cgroup that will own the folio is the one that - * is actually experiencing the refault event. + * is actually experiencing the refault event. Make sure the folio is + * locked to guarantee folio_memcg() stability throughout. */ + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); nr = folio_nr_pages(folio); memcg = folio_memcg(folio); pgdat = folio_pgdat(folio); @@ -540,7 +551,7 @@ void workingset_refault(struct folio *folio, void *shadow) mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); if (!workingset_test_recent(shadow, file, &workingset)) - goto out; + return; folio_set_active(folio); workingset_age_nonresident(lruvec, nr); @@ -556,8 +567,6 @@ void workingset_refault(struct folio *folio, void *shadow) lru_note_cost_refault(folio); mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr); } -out: - rcu_read_unlock(); } /**
The workingset code flushes the stats in workingset_refault() to get accurate stats of the eviction memcg. In preparation for more scoped flushed and passing the eviction memcg to the flush call, move the call to workingset_test_recent() where we have a pointer to the eviction memcg. The flush call is sleepable, and cannot be made in an rcu read section. Hence, minimize the rcu read section by also moving it into workingset_test_recent(). Furthermore, instead of holding the rcu read lock throughout workingset_test_recent(), only hold it briefly to get a ref on the eviction memcg. This allows us to make the flush call after we get the eviction memcg. As for workingset_refault(), nothing else there appears to be protected by rcu. The memcg of the faulted folio (which is not necessarily the same as the eviction memcg) is protected by the folio lock, which is held from all callsites. Add a VM_BUG_ON() to make sure this doesn't change from under us. No functional change intended. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- mm/workingset.c | 31 ++++++++++++++++++++----------- 1 file changed, 20 insertions(+), 11 deletions(-)