diff mbox series

memcg: async flush memcg stats from perf sensitive codepaths

Message ID 20220226002412.113819-1-shakeelb@google.com (mailing list archive)
State New
Headers show
Series memcg: async flush memcg stats from perf sensitive codepaths | expand

Commit Message

Shakeel Butt Feb. 26, 2022, 12:24 a.m. UTC
Daniel Dao has reported [1] a regression on workloads that may trigger
a lot of refaults (anon and file). The underlying issue is that flushing
rstat is expensive. Although rstat flush are batched with (nr_cpus *
MEMCG_BATCH) stat updates, it seems like there are workloads which
genuinely do stat updates larger than batch value within short amount of
time. Since the rstat flush can happen in the performance critical
codepaths like page faults, such workload can suffer greatly.

The easiest fix for now is for performance critical codepaths trigger
the rstat flush asynchronously. This patch converts the refault codepath
to use async rstat flush. In addition, this patch has premptively
converted mem_cgroup_wb_stats and shrink_node to also use the async
rstat flush as they may also similar performance regressions.

Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault")
Reported-by: Daniel Dao <dqminh@cloudflare.com>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 10 +++++++++-
 mm/vmscan.c                |  2 +-
 mm/workingset.c            |  2 +-
 4 files changed, 12 insertions(+), 3 deletions(-)

Comments

Andrew Morton Feb. 26, 2022, 12:58 a.m. UTC | #1
On Fri, 25 Feb 2022 16:24:12 -0800 Shakeel Butt <shakeelb@google.com> wrote:

> Daniel Dao has reported [1] a regression on workloads that may trigger
> a lot of refaults (anon and file). The underlying issue is that flushing
> rstat is expensive. Although rstat flush are batched with (nr_cpus *
> MEMCG_BATCH) stat updates, it seems like there are workloads which
> genuinely do stat updates larger than batch value within short amount of
> time. Since the rstat flush can happen in the performance critical
> codepaths like page faults, such workload can suffer greatly.
> 
> The easiest fix for now is for performance critical codepaths trigger
> the rstat flush asynchronously. This patch converts the refault codepath
> to use async rstat flush. In addition, this patch has premptively
> converted mem_cgroup_wb_stats and shrink_node to also use the async
> rstat flush as they may also similar performance regressions.

Gee we do this trick a lot and gee I don't like it :(

a) if we're doing too much work then we're doing too much work. 
   Punting that work over to a different CPU or thread doesn't alter
   that - it in fact adds more work.

b) there's an assumption here that the flusher is able to keep up
   with the producer.  What happens if that isn't the case?  Do we
   simply wind up the deferred items until the system goes oom?

   What happens if there's a producer running on each CPU?  Can the
   flushers keep up?

   Pathologically, what happens if the producer is running
   task_is_realtime() on a single-CPU system?  Or if there's a
   task_is_realtime() producer running on every CPU?  The flusher never
   gets to run and we're dead?


An obvious fix is to limit the permissible amount of windup (to what?)
and at some point, do the flushing synchronously anyway.

Or we just don't do any this at all and put up with the cost of the
current code.  I mean, this "fix" is kind of fake anyway, isn't it? 
Pushing the 4-10ms delay onto a different CPU will just disrupt
something else which wanted to run on that CPU.  The overall effect is
to hide the impact from one particular testcase, but is the benefit
really a real one?
Andrew Morton Feb. 26, 2022, 1:20 a.m. UTC | #2
On Fri, 25 Feb 2022 16:58:42 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 25 Feb 2022 16:24:12 -0800 Shakeel Butt <shakeelb@google.com> wrote:
> 
> > Daniel Dao has reported [1] a regression on workloads that may trigger
> > a lot of refaults (anon and file). The underlying issue is that flushing
> > rstat is expensive. Although rstat flush are batched with (nr_cpus *
> > MEMCG_BATCH) stat updates, it seems like there are workloads which
> > genuinely do stat updates larger than batch value within short amount of
> > time. Since the rstat flush can happen in the performance critical
> > codepaths like page faults, such workload can suffer greatly.
> > 
> > The easiest fix for now is for performance critical codepaths trigger
> > the rstat flush asynchronously. This patch converts the refault codepath
> > to use async rstat flush. In addition, this patch has premptively
> > converted mem_cgroup_wb_stats and shrink_node to also use the async
> > rstat flush as they may also similar performance regressions.
> 
> Gee we do this trick a lot and gee I don't like it :(
> 
> a) if we're doing too much work then we're doing too much work. 
>    Punting that work over to a different CPU or thread doesn't alter
>    that - it in fact adds more work.
> 
> b) there's an assumption here that the flusher is able to keep up
>    with the producer.  What happens if that isn't the case?  Do we
>    simply wind up the deferred items until the system goes oom?
> 
>    What happens if there's a producer running on each CPU?  Can the
>    flushers keep up?
> 
>    Pathologically, what happens if the producer is running
>    task_is_realtime() on a single-CPU system?  Or if there's a
>    task_is_realtime() producer running on every CPU?  The flusher never
>    gets to run and we're dead?

Not some theoretical thing, btw.  See how __read_swap_cache_async()
just got its sins exposed by real-time tasks:
https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
Shakeel Butt Feb. 26, 2022, 1:42 a.m. UTC | #3
On Fri, Feb 25, 2022 at 4:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 25 Feb 2022 16:24:12 -0800 Shakeel Butt <shakeelb@google.com> wrote:
>
> > Daniel Dao has reported [1] a regression on workloads that may trigger
> > a lot of refaults (anon and file). The underlying issue is that flushing
> > rstat is expensive. Although rstat flush are batched with (nr_cpus *
> > MEMCG_BATCH) stat updates, it seems like there are workloads which
> > genuinely do stat updates larger than batch value within short amount of
> > time. Since the rstat flush can happen in the performance critical
> > codepaths like page faults, such workload can suffer greatly.
> >
> > The easiest fix for now is for performance critical codepaths trigger
> > the rstat flush asynchronously. This patch converts the refault codepath
> > to use async rstat flush. In addition, this patch has premptively
> > converted mem_cgroup_wb_stats and shrink_node to also use the async
> > rstat flush as they may also similar performance regressions.
>
> Gee we do this trick a lot and gee I don't like it :(
>
> a) if we're doing too much work then we're doing too much work.
>    Punting that work over to a different CPU or thread doesn't alter
>    that - it in fact adds more work.
>

Please note that we already have the async worker running every 2
seconds. Normally no consumer would need to flush the stats but if
there were too many stat updates from producers in a short amount of
time then one of the consumers will have to pay the price of the
flush.

We have two types of consumers i.e. performance critical (e.g.
refault) and non-performance critical (e.g. memory.stat or num_stat
readers). I think we can let the performance critical consumer skip
the synchronous flushing and the async worker do the work for
performance reasons.

> b) there's an assumption here that the flusher is able to keep up
>    with the producer.  What happens if that isn't the case?  Do we
>    simply wind up the deferred items until the system goes oom?
>

Without a consumer nothing bad can happen even if flusher is slow (or
it has too much work) or too many stats are being updated by many
producers.

With a couple of consumers, in the current kernel, one of them may
have to pay the cost of synch flush. With this patch, we will have two
types of consumers. First, who are ok to pay the price of sync flush
to get the accurate stats and second who are ok with out of sync stats
but bounded by 2 seconds (yes assuming flusher runs every 2 seconds).

BTW there is no concern of the system going into oom due to reading a
bit older stats.

>    What happens if there's a producer running on each CPU?  Can the
>    flushers keep up?
>
>    Pathologically, what happens if the producer is running
>    task_is_realtime() on a single-CPU system?  Or if there's a
>    task_is_realtime() producer running on every CPU?  The flusher never
>    gets to run and we're dead?
>

I think it has to be a mix of (stat) producers and (stat) consumers
which are hogging CPU forever and no, we will not be dead. At worst
the consumers might be making some wrong decisions due to consuming
older stats.

One can argue that since one consumer is reclaim code, some reclaim
heuristic can get messed up due to older stats. Yes, that can happen.

>
> An obvious fix is to limit the permissible amount of windup (to what?)
> and at some point, do the flushing synchronously anyway.
>

That is what we are currently doing. The limit being nr_cpus * MEMCG_BATCH.

> Or we just don't do any this at all and put up with the cost of the
> current code.  I mean, this "fix" is kind of fake anyway, isn't it?
> Pushing the 4-10ms delay onto a different CPU will just disrupt
> something else which wanted to run on that CPU.  The overall effect is
> to hide the impact from one particular testcase, but is the benefit
> really a real one?
>

Yes, the right fix would be to optimize the flushing code (but that
would require more work/time). However I still think letting
performance critical code paths to skip the sync flush would be good
in general. So, if the current patch is not to your liking we can
remove mem_cgroup_flush_stats() from workingset_refault().
kernel test robot Feb. 26, 2022, 2:32 a.m. UTC | #4
Hi Shakeel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on linus/master v5.17-rc5 next-20220224]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444
base:   https://github.com/hnaz/linux-mm master
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220226/202202261045.FAsMZlyD-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/5dffeb24975bc4cbe99af650d833eb0183a4882f
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444
        git checkout 5dffeb24975bc4cbe99af650d833eb0183a4882f
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/vmscan.c: In function 'shrink_node':
>> mm/vmscan.c:3191:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async'; did you mean 'mem_cgroup_flush_stats'? [-Werror=implicit-function-declaration]
    3191 |  mem_cgroup_flush_stats_async();
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |  mem_cgroup_flush_stats
   cc1: some warnings being treated as errors
--
   mm/workingset.c: In function 'workingset_refault':
>> mm/workingset.c:358:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async'; did you mean 'mem_cgroup_flush_stats'? [-Werror=implicit-function-declaration]
     358 |  mem_cgroup_flush_stats_async();
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |  mem_cgroup_flush_stats
   cc1: some warnings being treated as errors


vim +3191 mm/vmscan.c

  3175	
  3176	static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
  3177	{
  3178		struct reclaim_state *reclaim_state = current->reclaim_state;
  3179		unsigned long nr_reclaimed, nr_scanned;
  3180		struct lruvec *target_lruvec;
  3181		bool reclaimable = false;
  3182		unsigned long file;
  3183	
  3184		target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
  3185	
  3186	again:
  3187		/*
  3188		 * Flush the memory cgroup stats, so that we read accurate per-memcg
  3189		 * lruvec stats for heuristics.
  3190		 */
> 3191		mem_cgroup_flush_stats_async();
  3192	
  3193		memset(&sc->nr, 0, sizeof(sc->nr));
  3194	
  3195		nr_reclaimed = sc->nr_reclaimed;
  3196		nr_scanned = sc->nr_scanned;
  3197	
  3198		/*
  3199		 * Determine the scan balance between anon and file LRUs.
  3200		 */
  3201		spin_lock_irq(&target_lruvec->lru_lock);
  3202		sc->anon_cost = target_lruvec->anon_cost;
  3203		sc->file_cost = target_lruvec->file_cost;
  3204		spin_unlock_irq(&target_lruvec->lru_lock);
  3205	
  3206		/*
  3207		 * Target desirable inactive:active list ratios for the anon
  3208		 * and file LRU lists.
  3209		 */
  3210		if (!sc->force_deactivate) {
  3211			unsigned long refaults;
  3212	
  3213			refaults = lruvec_page_state(target_lruvec,
  3214					WORKINGSET_ACTIVATE_ANON);
  3215			if (refaults != target_lruvec->refaults[0] ||
  3216				inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
  3217				sc->may_deactivate |= DEACTIVATE_ANON;
  3218			else
  3219				sc->may_deactivate &= ~DEACTIVATE_ANON;
  3220	
  3221			/*
  3222			 * When refaults are being observed, it means a new
  3223			 * workingset is being established. Deactivate to get
  3224			 * rid of any stale active pages quickly.
  3225			 */
  3226			refaults = lruvec_page_state(target_lruvec,
  3227					WORKINGSET_ACTIVATE_FILE);
  3228			if (refaults != target_lruvec->refaults[1] ||
  3229			    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
  3230				sc->may_deactivate |= DEACTIVATE_FILE;
  3231			else
  3232				sc->may_deactivate &= ~DEACTIVATE_FILE;
  3233		} else
  3234			sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
  3235	
  3236		/*
  3237		 * If we have plenty of inactive file pages that aren't
  3238		 * thrashing, try to reclaim those first before touching
  3239		 * anonymous pages.
  3240		 */
  3241		file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
  3242		if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
  3243			sc->cache_trim_mode = 1;
  3244		else
  3245			sc->cache_trim_mode = 0;
  3246	
  3247		/*
  3248		 * Prevent the reclaimer from falling into the cache trap: as
  3249		 * cache pages start out inactive, every cache fault will tip
  3250		 * the scan balance towards the file LRU.  And as the file LRU
  3251		 * shrinks, so does the window for rotation from references.
  3252		 * This means we have a runaway feedback loop where a tiny
  3253		 * thrashing file LRU becomes infinitely more attractive than
  3254		 * anon pages.  Try to detect this based on file LRU size.
  3255		 */
  3256		if (!cgroup_reclaim(sc)) {
  3257			unsigned long total_high_wmark = 0;
  3258			unsigned long free, anon;
  3259			int z;
  3260	
  3261			free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
  3262			file = node_page_state(pgdat, NR_ACTIVE_FILE) +
  3263				   node_page_state(pgdat, NR_INACTIVE_FILE);
  3264	
  3265			for (z = 0; z < MAX_NR_ZONES; z++) {
  3266				struct zone *zone = &pgdat->node_zones[z];
  3267				if (!managed_zone(zone))
  3268					continue;
  3269	
  3270				total_high_wmark += high_wmark_pages(zone);
  3271			}
  3272	
  3273			/*
  3274			 * Consider anon: if that's low too, this isn't a
  3275			 * runaway file reclaim problem, but rather just
  3276			 * extreme pressure. Reclaim as per usual then.
  3277			 */
  3278			anon = node_page_state(pgdat, NR_INACTIVE_ANON);
  3279	
  3280			sc->file_is_tiny =
  3281				file + free <= total_high_wmark &&
  3282				!(sc->may_deactivate & DEACTIVATE_ANON) &&
  3283				anon >> sc->priority;
  3284		}
  3285	
  3286		shrink_node_memcgs(pgdat, sc);
  3287	
  3288		if (reclaim_state) {
  3289			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
  3290			reclaim_state->reclaimed_slab = 0;
  3291		}
  3292	
  3293		/* Record the subtree's reclaim efficiency */
  3294		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
  3295			   sc->nr_scanned - nr_scanned,
  3296			   sc->nr_reclaimed - nr_reclaimed);
  3297	
  3298		if (sc->nr_reclaimed - nr_reclaimed)
  3299			reclaimable = true;
  3300	
  3301		if (current_is_kswapd()) {
  3302			/*
  3303			 * If reclaim is isolating dirty pages under writeback,
  3304			 * it implies that the long-lived page allocation rate
  3305			 * is exceeding the page laundering rate. Either the
  3306			 * global limits are not being effective at throttling
  3307			 * processes due to the page distribution throughout
  3308			 * zones or there is heavy usage of a slow backing
  3309			 * device. The only option is to throttle from reclaim
  3310			 * context which is not ideal as there is no guarantee
  3311			 * the dirtying process is throttled in the same way
  3312			 * balance_dirty_pages() manages.
  3313			 *
  3314			 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
  3315			 * count the number of pages under pages flagged for
  3316			 * immediate reclaim and stall if any are encountered
  3317			 * in the nr_immediate check below.
  3318			 */
  3319			if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
  3320				set_bit(PGDAT_WRITEBACK, &pgdat->flags);
  3321	
  3322			/* Allow kswapd to start writing pages during reclaim.*/
  3323			if (sc->nr.unqueued_dirty == sc->nr.file_taken)
  3324				set_bit(PGDAT_DIRTY, &pgdat->flags);
  3325	
  3326			/*
  3327			 * If kswapd scans pages marked for immediate
  3328			 * reclaim and under writeback (nr_immediate), it
  3329			 * implies that pages are cycling through the LRU
  3330			 * faster than they are written so forcibly stall
  3331			 * until some pages complete writeback.
  3332			 */
  3333			if (sc->nr.immediate)
  3334				reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
  3335		}
  3336	
  3337		/*
  3338		 * Tag a node/memcg as congested if all the dirty pages were marked
  3339		 * for writeback and immediate reclaim (counted in nr.congested).
  3340		 *
  3341		 * Legacy memcg will stall in page writeback so avoid forcibly
  3342		 * stalling in reclaim_throttle().
  3343		 */
  3344		if ((current_is_kswapd() ||
  3345		     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
  3346		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
  3347			set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
  3348	
  3349		/*
  3350		 * Stall direct reclaim for IO completions if the lruvec is
  3351		 * node is congested. Allow kswapd to continue until it
  3352		 * starts encountering unqueued dirty pages or cycling through
  3353		 * the LRU too quickly.
  3354		 */
  3355		if (!current_is_kswapd() && current_may_throttle() &&
  3356		    !sc->hibernation_mode &&
  3357		    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
  3358			reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
  3359	
  3360		if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
  3361					    sc))
  3362			goto again;
  3363	
  3364		/*
  3365		 * Kswapd gives up on balancing particular nodes after too
  3366		 * many failures to reclaim anything from them and goes to
  3367		 * sleep. On reclaim progress, reset the failure counter. A
  3368		 * successful direct reclaim run will revive a dormant kswapd.
  3369		 */
  3370		if (reclaimable)
  3371			pgdat->kswapd_failures = 0;
  3372	}
  3373	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot Feb. 26, 2022, 12:43 p.m. UTC | #5
Hi Shakeel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on linus/master v5.17-rc5 next-20220225]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444
base:   https://github.com/hnaz/linux-mm master
config: hexagon-randconfig-r023-20220226 (https://download.01.org/0day-ci/archive/20220226/202202262037.x6q4f1Yk-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project d271fc04d5b97b12e6b797c6067d3c96a8d7470e)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/5dffeb24975bc4cbe99af650d833eb0183a4882f
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444
        git checkout 5dffeb24975bc4cbe99af650d833eb0183a4882f
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> mm/vmscan.c:3191:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async' [-Werror,-Wimplicit-function-declaration]
           mem_cgroup_flush_stats_async();
           ^
   mm/vmscan.c:3191:2: note: did you mean 'mem_cgroup_flush_stats'?
   include/linux/memcontrol.h:1441:20: note: 'mem_cgroup_flush_stats' declared here
   static inline void mem_cgroup_flush_stats(void)
                      ^
   1 error generated.
--
>> mm/workingset.c:358:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async' [-Werror,-Wimplicit-function-declaration]
           mem_cgroup_flush_stats_async();
           ^
   mm/workingset.c:358:2: note: did you mean 'mem_cgroup_flush_stats'?
   include/linux/memcontrol.h:1441:20: note: 'mem_cgroup_flush_stats' declared here
   static inline void mem_cgroup_flush_stats(void)
                      ^
   1 error generated.


vim +/mem_cgroup_flush_stats_async +3191 mm/vmscan.c

  3175	
  3176	static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
  3177	{
  3178		struct reclaim_state *reclaim_state = current->reclaim_state;
  3179		unsigned long nr_reclaimed, nr_scanned;
  3180		struct lruvec *target_lruvec;
  3181		bool reclaimable = false;
  3182		unsigned long file;
  3183	
  3184		target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
  3185	
  3186	again:
  3187		/*
  3188		 * Flush the memory cgroup stats, so that we read accurate per-memcg
  3189		 * lruvec stats for heuristics.
  3190		 */
> 3191		mem_cgroup_flush_stats_async();
  3192	
  3193		memset(&sc->nr, 0, sizeof(sc->nr));
  3194	
  3195		nr_reclaimed = sc->nr_reclaimed;
  3196		nr_scanned = sc->nr_scanned;
  3197	
  3198		/*
  3199		 * Determine the scan balance between anon and file LRUs.
  3200		 */
  3201		spin_lock_irq(&target_lruvec->lru_lock);
  3202		sc->anon_cost = target_lruvec->anon_cost;
  3203		sc->file_cost = target_lruvec->file_cost;
  3204		spin_unlock_irq(&target_lruvec->lru_lock);
  3205	
  3206		/*
  3207		 * Target desirable inactive:active list ratios for the anon
  3208		 * and file LRU lists.
  3209		 */
  3210		if (!sc->force_deactivate) {
  3211			unsigned long refaults;
  3212	
  3213			refaults = lruvec_page_state(target_lruvec,
  3214					WORKINGSET_ACTIVATE_ANON);
  3215			if (refaults != target_lruvec->refaults[0] ||
  3216				inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
  3217				sc->may_deactivate |= DEACTIVATE_ANON;
  3218			else
  3219				sc->may_deactivate &= ~DEACTIVATE_ANON;
  3220	
  3221			/*
  3222			 * When refaults are being observed, it means a new
  3223			 * workingset is being established. Deactivate to get
  3224			 * rid of any stale active pages quickly.
  3225			 */
  3226			refaults = lruvec_page_state(target_lruvec,
  3227					WORKINGSET_ACTIVATE_FILE);
  3228			if (refaults != target_lruvec->refaults[1] ||
  3229			    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
  3230				sc->may_deactivate |= DEACTIVATE_FILE;
  3231			else
  3232				sc->may_deactivate &= ~DEACTIVATE_FILE;
  3233		} else
  3234			sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
  3235	
  3236		/*
  3237		 * If we have plenty of inactive file pages that aren't
  3238		 * thrashing, try to reclaim those first before touching
  3239		 * anonymous pages.
  3240		 */
  3241		file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
  3242		if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
  3243			sc->cache_trim_mode = 1;
  3244		else
  3245			sc->cache_trim_mode = 0;
  3246	
  3247		/*
  3248		 * Prevent the reclaimer from falling into the cache trap: as
  3249		 * cache pages start out inactive, every cache fault will tip
  3250		 * the scan balance towards the file LRU.  And as the file LRU
  3251		 * shrinks, so does the window for rotation from references.
  3252		 * This means we have a runaway feedback loop where a tiny
  3253		 * thrashing file LRU becomes infinitely more attractive than
  3254		 * anon pages.  Try to detect this based on file LRU size.
  3255		 */
  3256		if (!cgroup_reclaim(sc)) {
  3257			unsigned long total_high_wmark = 0;
  3258			unsigned long free, anon;
  3259			int z;
  3260	
  3261			free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
  3262			file = node_page_state(pgdat, NR_ACTIVE_FILE) +
  3263				   node_page_state(pgdat, NR_INACTIVE_FILE);
  3264	
  3265			for (z = 0; z < MAX_NR_ZONES; z++) {
  3266				struct zone *zone = &pgdat->node_zones[z];
  3267				if (!managed_zone(zone))
  3268					continue;
  3269	
  3270				total_high_wmark += high_wmark_pages(zone);
  3271			}
  3272	
  3273			/*
  3274			 * Consider anon: if that's low too, this isn't a
  3275			 * runaway file reclaim problem, but rather just
  3276			 * extreme pressure. Reclaim as per usual then.
  3277			 */
  3278			anon = node_page_state(pgdat, NR_INACTIVE_ANON);
  3279	
  3280			sc->file_is_tiny =
  3281				file + free <= total_high_wmark &&
  3282				!(sc->may_deactivate & DEACTIVATE_ANON) &&
  3283				anon >> sc->priority;
  3284		}
  3285	
  3286		shrink_node_memcgs(pgdat, sc);
  3287	
  3288		if (reclaim_state) {
  3289			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
  3290			reclaim_state->reclaimed_slab = 0;
  3291		}
  3292	
  3293		/* Record the subtree's reclaim efficiency */
  3294		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
  3295			   sc->nr_scanned - nr_scanned,
  3296			   sc->nr_reclaimed - nr_reclaimed);
  3297	
  3298		if (sc->nr_reclaimed - nr_reclaimed)
  3299			reclaimable = true;
  3300	
  3301		if (current_is_kswapd()) {
  3302			/*
  3303			 * If reclaim is isolating dirty pages under writeback,
  3304			 * it implies that the long-lived page allocation rate
  3305			 * is exceeding the page laundering rate. Either the
  3306			 * global limits are not being effective at throttling
  3307			 * processes due to the page distribution throughout
  3308			 * zones or there is heavy usage of a slow backing
  3309			 * device. The only option is to throttle from reclaim
  3310			 * context which is not ideal as there is no guarantee
  3311			 * the dirtying process is throttled in the same way
  3312			 * balance_dirty_pages() manages.
  3313			 *
  3314			 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
  3315			 * count the number of pages under pages flagged for
  3316			 * immediate reclaim and stall if any are encountered
  3317			 * in the nr_immediate check below.
  3318			 */
  3319			if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
  3320				set_bit(PGDAT_WRITEBACK, &pgdat->flags);
  3321	
  3322			/* Allow kswapd to start writing pages during reclaim.*/
  3323			if (sc->nr.unqueued_dirty == sc->nr.file_taken)
  3324				set_bit(PGDAT_DIRTY, &pgdat->flags);
  3325	
  3326			/*
  3327			 * If kswapd scans pages marked for immediate
  3328			 * reclaim and under writeback (nr_immediate), it
  3329			 * implies that pages are cycling through the LRU
  3330			 * faster than they are written so forcibly stall
  3331			 * until some pages complete writeback.
  3332			 */
  3333			if (sc->nr.immediate)
  3334				reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
  3335		}
  3336	
  3337		/*
  3338		 * Tag a node/memcg as congested if all the dirty pages were marked
  3339		 * for writeback and immediate reclaim (counted in nr.congested).
  3340		 *
  3341		 * Legacy memcg will stall in page writeback so avoid forcibly
  3342		 * stalling in reclaim_throttle().
  3343		 */
  3344		if ((current_is_kswapd() ||
  3345		     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
  3346		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
  3347			set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
  3348	
  3349		/*
  3350		 * Stall direct reclaim for IO completions if the lruvec is
  3351		 * node is congested. Allow kswapd to continue until it
  3352		 * starts encountering unqueued dirty pages or cycling through
  3353		 * the LRU too quickly.
  3354		 */
  3355		if (!current_is_kswapd() && current_may_throttle() &&
  3356		    !sc->hibernation_mode &&
  3357		    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
  3358			reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
  3359	
  3360		if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
  3361					    sc))
  3362			goto again;
  3363	
  3364		/*
  3365		 * Kswapd gives up on balancing particular nodes after too
  3366		 * many failures to reclaim anything from them and goes to
  3367		 * sleep. On reclaim progress, reset the failure counter. A
  3368		 * successful direct reclaim run will revive a dormant kswapd.
  3369		 */
  3370		if (reclaimable)
  3371			pgdat->kswapd_failures = 0;
  3372	}
  3373	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Michal Koutný Feb. 28, 2022, 6:46 p.m. UTC | #6
On Fri, Feb 25, 2022 at 05:42:57PM -0800, Shakeel Butt <shakeelb@google.com> wrote:
> Yes, the right fix would be to optimize the flushing code (but that
> would require more work/time). However I still think letting
> performance critical code paths to skip the sync flush would be good
> in general. So, if the current patch is not to your liking we can
> remove mem_cgroup_flush_stats() from workingset_refault().

What about flushing just the subtree of the memcg where the refault
happens?
It doesn't reduce the overall work and there's still full-tree
cgroup_rstat_lock but it should make the chunks of work smaller
durations more regular.


Michal
Shakeel Butt Feb. 28, 2022, 10:46 p.m. UTC | #7
On Mon, Feb 28, 2022 at 10:46 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Feb 25, 2022 at 05:42:57PM -0800, Shakeel Butt <shakeelb@google.com> wrote:
> > Yes, the right fix would be to optimize the flushing code (but that
> > would require more work/time). However I still think letting
> > performance critical code paths to skip the sync flush would be good
> > in general. So, if the current patch is not to your liking we can
> > remove mem_cgroup_flush_stats() from workingset_refault().
>
> What about flushing just the subtree of the memcg where the refault
> happens?
> It doesn't reduce the overall work and there's still full-tree
> cgroup_rstat_lock but it should make the chunks of work smaller
> durations more regular.
>

We can try that and I will send a patch to Ivan and Daniel to try on
their workload to see the real impact of targeted memcg flushing.
However I am not very optimistic about it.
Michal Hocko March 1, 2022, 9:05 a.m. UTC | #8
On Fri 25-02-22 16:24:12, Shakeel Butt wrote:
> Daniel Dao has reported [1] a regression on workloads that may trigger
> a lot of refaults (anon and file). The underlying issue is that flushing
> rstat is expensive. Although rstat flush are batched with (nr_cpus *
> MEMCG_BATCH) stat updates, it seems like there are workloads which
> genuinely do stat updates larger than batch value within short amount of
> time. Since the rstat flush can happen in the performance critical
> codepaths like page faults, such workload can suffer greatly.
> 
> The easiest fix for now is for performance critical codepaths trigger
> the rstat flush asynchronously. This patch converts the refault codepath
> to use async rstat flush. In addition, this patch has premptively
> converted mem_cgroup_wb_stats and shrink_node to also use the async
> rstat flush as they may also similar performance regressions.

Why do we need to trigger flushing in the first place from those paths.
Later in the thread you are saying there is a regular flushing done
every 2 seconds. What would happen if these paths didn't flush at all?
Also please note that WQ context can be overwhelmed by other work so
these flushes can happen much much later.

So in other words why does async work (that can happen at any time
without any control) make more sense than no flushing?
Shakeel Butt March 1, 2022, 5:21 p.m. UTC | #9
On Tue, Mar 1, 2022 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 25-02-22 16:24:12, Shakeel Butt wrote:
> > Daniel Dao has reported [1] a regression on workloads that may trigger
> > a lot of refaults (anon and file). The underlying issue is that flushing
> > rstat is expensive. Although rstat flush are batched with (nr_cpus *
> > MEMCG_BATCH) stat updates, it seems like there are workloads which
> > genuinely do stat updates larger than batch value within short amount of
> > time. Since the rstat flush can happen in the performance critical
> > codepaths like page faults, such workload can suffer greatly.
> >
> > The easiest fix for now is for performance critical codepaths trigger
> > the rstat flush asynchronously. This patch converts the refault codepath
> > to use async rstat flush. In addition, this patch has premptively
> > converted mem_cgroup_wb_stats and shrink_node to also use the async
> > rstat flush as they may also similar performance regressions.
>
> Why do we need to trigger flushing in the first place from those paths.
> Later in the thread you are saying there is a regular flushing done
> every 2 seconds. What would happen if these paths didn't flush at all?
> Also please note that WQ context can be overwhelmed by other work so
> these flushes can happen much much later.
>
> So in other words why does async work (that can happen at any time
> without any control) make more sense than no flushing?
> --

Without flushing the worst that can happen in the refault path is
false (or missed) activations of the refaulted page. For reclaim code,
some heuristics (like deactivating active LRU or cache-trim) may act
on old information.

However I don't think these are too much concerning as the kernel can
already missed or do false activations on refault. For the reclaim
code, the kernel does force deactivation if it has skipped it in the
initial iterations, so, not much to worry.

Now, coming to your question, yes, we can remove the flushing from
these performance critical codepaths as the stats at most will be 2
second old due to periodic flush. Now for the worst case scenario
where that periodic flush (WQ) is not getting CPU, I think it is
reasonable to put a sync flush if periodic flush has not happened for,
let's say, 10 seconds.
Michal Koutný March 1, 2022, 5:57 p.m. UTC | #10
Making decisions based on up to 2 s old information.

On Tue, Mar 01, 2022 at 09:21:12AM -0800, Shakeel Butt <shakeelb@google.com> wrote:
> Without flushing the worst that can happen in the refault path is
> false (or missed) activations of the refaulted page.

Yeah, this may under- or overestimate workingset size (when it's
changing), the result is likely only less efficient reclaim.

> For reclaim code, some heuristics (like deactivating active LRU or
> cache-trim) may act on old information.

Here, I'd be more careful whether such a delay cannot introduce some
unstable behavior (permanent oscillation in the worst case).

> Now, coming to your question, yes, we can remove the flushing from
> these performance critical codepaths as the stats at most will be 2
> second old due to periodic flush. 

Another aspect is that people will notice and report such a narrowly
located performance regression more easily than reduced/less predictable
reclaim behavior. (IMO the former is better, OTOH, it can also be
interpreted that noone notices (is able to notice).)

Michal
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ef4b445392a9..bfdd48be60ff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -998,6 +998,7 @@  static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 }
 
 void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_async(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c695608c521c..4338e8d779b2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -690,6 +690,14 @@  void mem_cgroup_flush_stats(void)
 		__mem_cgroup_flush_stats();
 }
 
+void mem_cgroup_flush_stats_async(void)
+{
+	if (atomic_read(&stats_flush_threshold) > num_online_cpus()) {
+		atomic_set(&stats_flush_threshold, 0);
+		mod_delayed_work(system_unbound_wq, &stats_flush_dwork, 0);
+	}
+}
+
 static void flush_memcg_stats_dwork(struct work_struct *w)
 {
 	__mem_cgroup_flush_stats();
@@ -4522,7 +4530,7 @@  void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_async();
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6f77e3e6d59..b6c6b165c1ef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3188,7 +3188,7 @@  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_async();
 
 	memset(&sc->nr, 0, sizeof(sc->nr));
 
diff --git a/mm/workingset.c b/mm/workingset.c
index b717eae4e0dd..a4f2b1aa5bcc 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -355,7 +355,7 @@  void workingset_refault(struct folio *folio, void *shadow)
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_async();
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't activate pages that couldn't stay resident even if