Message ID | 20220226002412.113819-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | memcg: async flush memcg stats from perf sensitive codepaths | expand |
On Fri, 25 Feb 2022 16:24:12 -0800 Shakeel Butt <shakeelb@google.com> wrote: > Daniel Dao has reported [1] a regression on workloads that may trigger > a lot of refaults (anon and file). The underlying issue is that flushing > rstat is expensive. Although rstat flush are batched with (nr_cpus * > MEMCG_BATCH) stat updates, it seems like there are workloads which > genuinely do stat updates larger than batch value within short amount of > time. Since the rstat flush can happen in the performance critical > codepaths like page faults, such workload can suffer greatly. > > The easiest fix for now is for performance critical codepaths trigger > the rstat flush asynchronously. This patch converts the refault codepath > to use async rstat flush. In addition, this patch has premptively > converted mem_cgroup_wb_stats and shrink_node to also use the async > rstat flush as they may also similar performance regressions. Gee we do this trick a lot and gee I don't like it :( a) if we're doing too much work then we're doing too much work. Punting that work over to a different CPU or thread doesn't alter that - it in fact adds more work. b) there's an assumption here that the flusher is able to keep up with the producer. What happens if that isn't the case? Do we simply wind up the deferred items until the system goes oom? What happens if there's a producer running on each CPU? Can the flushers keep up? Pathologically, what happens if the producer is running task_is_realtime() on a single-CPU system? Or if there's a task_is_realtime() producer running on every CPU? The flusher never gets to run and we're dead? An obvious fix is to limit the permissible amount of windup (to what?) and at some point, do the flushing synchronously anyway. Or we just don't do any this at all and put up with the cost of the current code. I mean, this "fix" is kind of fake anyway, isn't it? Pushing the 4-10ms delay onto a different CPU will just disrupt something else which wanted to run on that CPU. The overall effect is to hide the impact from one particular testcase, but is the benefit really a real one?
On Fri, 25 Feb 2022 16:58:42 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 25 Feb 2022 16:24:12 -0800 Shakeel Butt <shakeelb@google.com> wrote: > > > Daniel Dao has reported [1] a regression on workloads that may trigger > > a lot of refaults (anon and file). The underlying issue is that flushing > > rstat is expensive. Although rstat flush are batched with (nr_cpus * > > MEMCG_BATCH) stat updates, it seems like there are workloads which > > genuinely do stat updates larger than batch value within short amount of > > time. Since the rstat flush can happen in the performance critical > > codepaths like page faults, such workload can suffer greatly. > > > > The easiest fix for now is for performance critical codepaths trigger > > the rstat flush asynchronously. This patch converts the refault codepath > > to use async rstat flush. In addition, this patch has premptively > > converted mem_cgroup_wb_stats and shrink_node to also use the async > > rstat flush as they may also similar performance regressions. > > Gee we do this trick a lot and gee I don't like it :( > > a) if we're doing too much work then we're doing too much work. > Punting that work over to a different CPU or thread doesn't alter > that - it in fact adds more work. > > b) there's an assumption here that the flusher is able to keep up > with the producer. What happens if that isn't the case? Do we > simply wind up the deferred items until the system goes oom? > > What happens if there's a producer running on each CPU? Can the > flushers keep up? > > Pathologically, what happens if the producer is running > task_is_realtime() on a single-CPU system? Or if there's a > task_is_realtime() producer running on every CPU? The flusher never > gets to run and we're dead? Not some theoretical thing, btw. See how __read_swap_cache_async() just got its sins exposed by real-time tasks: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
On Fri, Feb 25, 2022 at 4:58 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 25 Feb 2022 16:24:12 -0800 Shakeel Butt <shakeelb@google.com> wrote: > > > Daniel Dao has reported [1] a regression on workloads that may trigger > > a lot of refaults (anon and file). The underlying issue is that flushing > > rstat is expensive. Although rstat flush are batched with (nr_cpus * > > MEMCG_BATCH) stat updates, it seems like there are workloads which > > genuinely do stat updates larger than batch value within short amount of > > time. Since the rstat flush can happen in the performance critical > > codepaths like page faults, such workload can suffer greatly. > > > > The easiest fix for now is for performance critical codepaths trigger > > the rstat flush asynchronously. This patch converts the refault codepath > > to use async rstat flush. In addition, this patch has premptively > > converted mem_cgroup_wb_stats and shrink_node to also use the async > > rstat flush as they may also similar performance regressions. > > Gee we do this trick a lot and gee I don't like it :( > > a) if we're doing too much work then we're doing too much work. > Punting that work over to a different CPU or thread doesn't alter > that - it in fact adds more work. > Please note that we already have the async worker running every 2 seconds. Normally no consumer would need to flush the stats but if there were too many stat updates from producers in a short amount of time then one of the consumers will have to pay the price of the flush. We have two types of consumers i.e. performance critical (e.g. refault) and non-performance critical (e.g. memory.stat or num_stat readers). I think we can let the performance critical consumer skip the synchronous flushing and the async worker do the work for performance reasons. > b) there's an assumption here that the flusher is able to keep up > with the producer. What happens if that isn't the case? Do we > simply wind up the deferred items until the system goes oom? > Without a consumer nothing bad can happen even if flusher is slow (or it has too much work) or too many stats are being updated by many producers. With a couple of consumers, in the current kernel, one of them may have to pay the cost of synch flush. With this patch, we will have two types of consumers. First, who are ok to pay the price of sync flush to get the accurate stats and second who are ok with out of sync stats but bounded by 2 seconds (yes assuming flusher runs every 2 seconds). BTW there is no concern of the system going into oom due to reading a bit older stats. > What happens if there's a producer running on each CPU? Can the > flushers keep up? > > Pathologically, what happens if the producer is running > task_is_realtime() on a single-CPU system? Or if there's a > task_is_realtime() producer running on every CPU? The flusher never > gets to run and we're dead? > I think it has to be a mix of (stat) producers and (stat) consumers which are hogging CPU forever and no, we will not be dead. At worst the consumers might be making some wrong decisions due to consuming older stats. One can argue that since one consumer is reclaim code, some reclaim heuristic can get messed up due to older stats. Yes, that can happen. > > An obvious fix is to limit the permissible amount of windup (to what?) > and at some point, do the flushing synchronously anyway. > That is what we are currently doing. The limit being nr_cpus * MEMCG_BATCH. > Or we just don't do any this at all and put up with the cost of the > current code. I mean, this "fix" is kind of fake anyway, isn't it? > Pushing the 4-10ms delay onto a different CPU will just disrupt > something else which wanted to run on that CPU. The overall effect is > to hide the impact from one particular testcase, but is the benefit > really a real one? > Yes, the right fix would be to optimize the flushing code (but that would require more work/time). However I still think letting performance critical code paths to skip the sync flush would be good in general. So, if the current patch is not to your liking we can remove mem_cgroup_flush_stats() from workingset_refault().
Hi Shakeel, Thank you for the patch! Yet something to improve: [auto build test ERROR on hnaz-mm/master] [also build test ERROR on linus/master v5.17-rc5 next-20220224] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444 base: https://github.com/hnaz/linux-mm master config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220226/202202261045.FAsMZlyD-lkp@intel.com/config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce (this is a W=1 build): # https://github.com/0day-ci/linux/commit/5dffeb24975bc4cbe99af650d833eb0183a4882f git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444 git checkout 5dffeb24975bc4cbe99af650d833eb0183a4882f # save the config file to linux build tree mkdir build_dir make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): mm/vmscan.c: In function 'shrink_node': >> mm/vmscan.c:3191:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async'; did you mean 'mem_cgroup_flush_stats'? [-Werror=implicit-function-declaration] 3191 | mem_cgroup_flush_stats_async(); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ | mem_cgroup_flush_stats cc1: some warnings being treated as errors -- mm/workingset.c: In function 'workingset_refault': >> mm/workingset.c:358:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async'; did you mean 'mem_cgroup_flush_stats'? [-Werror=implicit-function-declaration] 358 | mem_cgroup_flush_stats_async(); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ | mem_cgroup_flush_stats cc1: some warnings being treated as errors vim +3191 mm/vmscan.c 3175 3176 static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) 3177 { 3178 struct reclaim_state *reclaim_state = current->reclaim_state; 3179 unsigned long nr_reclaimed, nr_scanned; 3180 struct lruvec *target_lruvec; 3181 bool reclaimable = false; 3182 unsigned long file; 3183 3184 target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); 3185 3186 again: 3187 /* 3188 * Flush the memory cgroup stats, so that we read accurate per-memcg 3189 * lruvec stats for heuristics. 3190 */ > 3191 mem_cgroup_flush_stats_async(); 3192 3193 memset(&sc->nr, 0, sizeof(sc->nr)); 3194 3195 nr_reclaimed = sc->nr_reclaimed; 3196 nr_scanned = sc->nr_scanned; 3197 3198 /* 3199 * Determine the scan balance between anon and file LRUs. 3200 */ 3201 spin_lock_irq(&target_lruvec->lru_lock); 3202 sc->anon_cost = target_lruvec->anon_cost; 3203 sc->file_cost = target_lruvec->file_cost; 3204 spin_unlock_irq(&target_lruvec->lru_lock); 3205 3206 /* 3207 * Target desirable inactive:active list ratios for the anon 3208 * and file LRU lists. 3209 */ 3210 if (!sc->force_deactivate) { 3211 unsigned long refaults; 3212 3213 refaults = lruvec_page_state(target_lruvec, 3214 WORKINGSET_ACTIVATE_ANON); 3215 if (refaults != target_lruvec->refaults[0] || 3216 inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) 3217 sc->may_deactivate |= DEACTIVATE_ANON; 3218 else 3219 sc->may_deactivate &= ~DEACTIVATE_ANON; 3220 3221 /* 3222 * When refaults are being observed, it means a new 3223 * workingset is being established. Deactivate to get 3224 * rid of any stale active pages quickly. 3225 */ 3226 refaults = lruvec_page_state(target_lruvec, 3227 WORKINGSET_ACTIVATE_FILE); 3228 if (refaults != target_lruvec->refaults[1] || 3229 inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) 3230 sc->may_deactivate |= DEACTIVATE_FILE; 3231 else 3232 sc->may_deactivate &= ~DEACTIVATE_FILE; 3233 } else 3234 sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; 3235 3236 /* 3237 * If we have plenty of inactive file pages that aren't 3238 * thrashing, try to reclaim those first before touching 3239 * anonymous pages. 3240 */ 3241 file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); 3242 if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) 3243 sc->cache_trim_mode = 1; 3244 else 3245 sc->cache_trim_mode = 0; 3246 3247 /* 3248 * Prevent the reclaimer from falling into the cache trap: as 3249 * cache pages start out inactive, every cache fault will tip 3250 * the scan balance towards the file LRU. And as the file LRU 3251 * shrinks, so does the window for rotation from references. 3252 * This means we have a runaway feedback loop where a tiny 3253 * thrashing file LRU becomes infinitely more attractive than 3254 * anon pages. Try to detect this based on file LRU size. 3255 */ 3256 if (!cgroup_reclaim(sc)) { 3257 unsigned long total_high_wmark = 0; 3258 unsigned long free, anon; 3259 int z; 3260 3261 free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); 3262 file = node_page_state(pgdat, NR_ACTIVE_FILE) + 3263 node_page_state(pgdat, NR_INACTIVE_FILE); 3264 3265 for (z = 0; z < MAX_NR_ZONES; z++) { 3266 struct zone *zone = &pgdat->node_zones[z]; 3267 if (!managed_zone(zone)) 3268 continue; 3269 3270 total_high_wmark += high_wmark_pages(zone); 3271 } 3272 3273 /* 3274 * Consider anon: if that's low too, this isn't a 3275 * runaway file reclaim problem, but rather just 3276 * extreme pressure. Reclaim as per usual then. 3277 */ 3278 anon = node_page_state(pgdat, NR_INACTIVE_ANON); 3279 3280 sc->file_is_tiny = 3281 file + free <= total_high_wmark && 3282 !(sc->may_deactivate & DEACTIVATE_ANON) && 3283 anon >> sc->priority; 3284 } 3285 3286 shrink_node_memcgs(pgdat, sc); 3287 3288 if (reclaim_state) { 3289 sc->nr_reclaimed += reclaim_state->reclaimed_slab; 3290 reclaim_state->reclaimed_slab = 0; 3291 } 3292 3293 /* Record the subtree's reclaim efficiency */ 3294 vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true, 3295 sc->nr_scanned - nr_scanned, 3296 sc->nr_reclaimed - nr_reclaimed); 3297 3298 if (sc->nr_reclaimed - nr_reclaimed) 3299 reclaimable = true; 3300 3301 if (current_is_kswapd()) { 3302 /* 3303 * If reclaim is isolating dirty pages under writeback, 3304 * it implies that the long-lived page allocation rate 3305 * is exceeding the page laundering rate. Either the 3306 * global limits are not being effective at throttling 3307 * processes due to the page distribution throughout 3308 * zones or there is heavy usage of a slow backing 3309 * device. The only option is to throttle from reclaim 3310 * context which is not ideal as there is no guarantee 3311 * the dirtying process is throttled in the same way 3312 * balance_dirty_pages() manages. 3313 * 3314 * Once a node is flagged PGDAT_WRITEBACK, kswapd will 3315 * count the number of pages under pages flagged for 3316 * immediate reclaim and stall if any are encountered 3317 * in the nr_immediate check below. 3318 */ 3319 if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken) 3320 set_bit(PGDAT_WRITEBACK, &pgdat->flags); 3321 3322 /* Allow kswapd to start writing pages during reclaim.*/ 3323 if (sc->nr.unqueued_dirty == sc->nr.file_taken) 3324 set_bit(PGDAT_DIRTY, &pgdat->flags); 3325 3326 /* 3327 * If kswapd scans pages marked for immediate 3328 * reclaim and under writeback (nr_immediate), it 3329 * implies that pages are cycling through the LRU 3330 * faster than they are written so forcibly stall 3331 * until some pages complete writeback. 3332 */ 3333 if (sc->nr.immediate) 3334 reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); 3335 } 3336 3337 /* 3338 * Tag a node/memcg as congested if all the dirty pages were marked 3339 * for writeback and immediate reclaim (counted in nr.congested). 3340 * 3341 * Legacy memcg will stall in page writeback so avoid forcibly 3342 * stalling in reclaim_throttle(). 3343 */ 3344 if ((current_is_kswapd() || 3345 (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) && 3346 sc->nr.dirty && sc->nr.dirty == sc->nr.congested) 3347 set_bit(LRUVEC_CONGESTED, &target_lruvec->flags); 3348 3349 /* 3350 * Stall direct reclaim for IO completions if the lruvec is 3351 * node is congested. Allow kswapd to continue until it 3352 * starts encountering unqueued dirty pages or cycling through 3353 * the LRU too quickly. 3354 */ 3355 if (!current_is_kswapd() && current_may_throttle() && 3356 !sc->hibernation_mode && 3357 test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) 3358 reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); 3359 3360 if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, 3361 sc)) 3362 goto again; 3363 3364 /* 3365 * Kswapd gives up on balancing particular nodes after too 3366 * many failures to reclaim anything from them and goes to 3367 * sleep. On reclaim progress, reset the failure counter. A 3368 * successful direct reclaim run will revive a dormant kswapd. 3369 */ 3370 if (reclaimable) 3371 pgdat->kswapd_failures = 0; 3372 } 3373 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi Shakeel, Thank you for the patch! Yet something to improve: [auto build test ERROR on hnaz-mm/master] [also build test ERROR on linus/master v5.17-rc5 next-20220225] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444 base: https://github.com/hnaz/linux-mm master config: hexagon-randconfig-r023-20220226 (https://download.01.org/0day-ci/archive/20220226/202202262037.x6q4f1Yk-lkp@intel.com/config) compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project d271fc04d5b97b12e6b797c6067d3c96a8d7470e) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/5dffeb24975bc4cbe99af650d833eb0183a4882f git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Shakeel-Butt/memcg-async-flush-memcg-stats-from-perf-sensitive-codepaths/20220226-082444 git checkout 5dffeb24975bc4cbe99af650d833eb0183a4882f # save the config file to linux build tree mkdir build_dir COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> mm/vmscan.c:3191:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async' [-Werror,-Wimplicit-function-declaration] mem_cgroup_flush_stats_async(); ^ mm/vmscan.c:3191:2: note: did you mean 'mem_cgroup_flush_stats'? include/linux/memcontrol.h:1441:20: note: 'mem_cgroup_flush_stats' declared here static inline void mem_cgroup_flush_stats(void) ^ 1 error generated. -- >> mm/workingset.c:358:2: error: implicit declaration of function 'mem_cgroup_flush_stats_async' [-Werror,-Wimplicit-function-declaration] mem_cgroup_flush_stats_async(); ^ mm/workingset.c:358:2: note: did you mean 'mem_cgroup_flush_stats'? include/linux/memcontrol.h:1441:20: note: 'mem_cgroup_flush_stats' declared here static inline void mem_cgroup_flush_stats(void) ^ 1 error generated. vim +/mem_cgroup_flush_stats_async +3191 mm/vmscan.c 3175 3176 static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) 3177 { 3178 struct reclaim_state *reclaim_state = current->reclaim_state; 3179 unsigned long nr_reclaimed, nr_scanned; 3180 struct lruvec *target_lruvec; 3181 bool reclaimable = false; 3182 unsigned long file; 3183 3184 target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); 3185 3186 again: 3187 /* 3188 * Flush the memory cgroup stats, so that we read accurate per-memcg 3189 * lruvec stats for heuristics. 3190 */ > 3191 mem_cgroup_flush_stats_async(); 3192 3193 memset(&sc->nr, 0, sizeof(sc->nr)); 3194 3195 nr_reclaimed = sc->nr_reclaimed; 3196 nr_scanned = sc->nr_scanned; 3197 3198 /* 3199 * Determine the scan balance between anon and file LRUs. 3200 */ 3201 spin_lock_irq(&target_lruvec->lru_lock); 3202 sc->anon_cost = target_lruvec->anon_cost; 3203 sc->file_cost = target_lruvec->file_cost; 3204 spin_unlock_irq(&target_lruvec->lru_lock); 3205 3206 /* 3207 * Target desirable inactive:active list ratios for the anon 3208 * and file LRU lists. 3209 */ 3210 if (!sc->force_deactivate) { 3211 unsigned long refaults; 3212 3213 refaults = lruvec_page_state(target_lruvec, 3214 WORKINGSET_ACTIVATE_ANON); 3215 if (refaults != target_lruvec->refaults[0] || 3216 inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) 3217 sc->may_deactivate |= DEACTIVATE_ANON; 3218 else 3219 sc->may_deactivate &= ~DEACTIVATE_ANON; 3220 3221 /* 3222 * When refaults are being observed, it means a new 3223 * workingset is being established. Deactivate to get 3224 * rid of any stale active pages quickly. 3225 */ 3226 refaults = lruvec_page_state(target_lruvec, 3227 WORKINGSET_ACTIVATE_FILE); 3228 if (refaults != target_lruvec->refaults[1] || 3229 inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) 3230 sc->may_deactivate |= DEACTIVATE_FILE; 3231 else 3232 sc->may_deactivate &= ~DEACTIVATE_FILE; 3233 } else 3234 sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; 3235 3236 /* 3237 * If we have plenty of inactive file pages that aren't 3238 * thrashing, try to reclaim those first before touching 3239 * anonymous pages. 3240 */ 3241 file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); 3242 if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) 3243 sc->cache_trim_mode = 1; 3244 else 3245 sc->cache_trim_mode = 0; 3246 3247 /* 3248 * Prevent the reclaimer from falling into the cache trap: as 3249 * cache pages start out inactive, every cache fault will tip 3250 * the scan balance towards the file LRU. And as the file LRU 3251 * shrinks, so does the window for rotation from references. 3252 * This means we have a runaway feedback loop where a tiny 3253 * thrashing file LRU becomes infinitely more attractive than 3254 * anon pages. Try to detect this based on file LRU size. 3255 */ 3256 if (!cgroup_reclaim(sc)) { 3257 unsigned long total_high_wmark = 0; 3258 unsigned long free, anon; 3259 int z; 3260 3261 free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); 3262 file = node_page_state(pgdat, NR_ACTIVE_FILE) + 3263 node_page_state(pgdat, NR_INACTIVE_FILE); 3264 3265 for (z = 0; z < MAX_NR_ZONES; z++) { 3266 struct zone *zone = &pgdat->node_zones[z]; 3267 if (!managed_zone(zone)) 3268 continue; 3269 3270 total_high_wmark += high_wmark_pages(zone); 3271 } 3272 3273 /* 3274 * Consider anon: if that's low too, this isn't a 3275 * runaway file reclaim problem, but rather just 3276 * extreme pressure. Reclaim as per usual then. 3277 */ 3278 anon = node_page_state(pgdat, NR_INACTIVE_ANON); 3279 3280 sc->file_is_tiny = 3281 file + free <= total_high_wmark && 3282 !(sc->may_deactivate & DEACTIVATE_ANON) && 3283 anon >> sc->priority; 3284 } 3285 3286 shrink_node_memcgs(pgdat, sc); 3287 3288 if (reclaim_state) { 3289 sc->nr_reclaimed += reclaim_state->reclaimed_slab; 3290 reclaim_state->reclaimed_slab = 0; 3291 } 3292 3293 /* Record the subtree's reclaim efficiency */ 3294 vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true, 3295 sc->nr_scanned - nr_scanned, 3296 sc->nr_reclaimed - nr_reclaimed); 3297 3298 if (sc->nr_reclaimed - nr_reclaimed) 3299 reclaimable = true; 3300 3301 if (current_is_kswapd()) { 3302 /* 3303 * If reclaim is isolating dirty pages under writeback, 3304 * it implies that the long-lived page allocation rate 3305 * is exceeding the page laundering rate. Either the 3306 * global limits are not being effective at throttling 3307 * processes due to the page distribution throughout 3308 * zones or there is heavy usage of a slow backing 3309 * device. The only option is to throttle from reclaim 3310 * context which is not ideal as there is no guarantee 3311 * the dirtying process is throttled in the same way 3312 * balance_dirty_pages() manages. 3313 * 3314 * Once a node is flagged PGDAT_WRITEBACK, kswapd will 3315 * count the number of pages under pages flagged for 3316 * immediate reclaim and stall if any are encountered 3317 * in the nr_immediate check below. 3318 */ 3319 if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken) 3320 set_bit(PGDAT_WRITEBACK, &pgdat->flags); 3321 3322 /* Allow kswapd to start writing pages during reclaim.*/ 3323 if (sc->nr.unqueued_dirty == sc->nr.file_taken) 3324 set_bit(PGDAT_DIRTY, &pgdat->flags); 3325 3326 /* 3327 * If kswapd scans pages marked for immediate 3328 * reclaim and under writeback (nr_immediate), it 3329 * implies that pages are cycling through the LRU 3330 * faster than they are written so forcibly stall 3331 * until some pages complete writeback. 3332 */ 3333 if (sc->nr.immediate) 3334 reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); 3335 } 3336 3337 /* 3338 * Tag a node/memcg as congested if all the dirty pages were marked 3339 * for writeback and immediate reclaim (counted in nr.congested). 3340 * 3341 * Legacy memcg will stall in page writeback so avoid forcibly 3342 * stalling in reclaim_throttle(). 3343 */ 3344 if ((current_is_kswapd() || 3345 (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) && 3346 sc->nr.dirty && sc->nr.dirty == sc->nr.congested) 3347 set_bit(LRUVEC_CONGESTED, &target_lruvec->flags); 3348 3349 /* 3350 * Stall direct reclaim for IO completions if the lruvec is 3351 * node is congested. Allow kswapd to continue until it 3352 * starts encountering unqueued dirty pages or cycling through 3353 * the LRU too quickly. 3354 */ 3355 if (!current_is_kswapd() && current_may_throttle() && 3356 !sc->hibernation_mode && 3357 test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) 3358 reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); 3359 3360 if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, 3361 sc)) 3362 goto again; 3363 3364 /* 3365 * Kswapd gives up on balancing particular nodes after too 3366 * many failures to reclaim anything from them and goes to 3367 * sleep. On reclaim progress, reset the failure counter. A 3368 * successful direct reclaim run will revive a dormant kswapd. 3369 */ 3370 if (reclaimable) 3371 pgdat->kswapd_failures = 0; 3372 } 3373 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On Fri, Feb 25, 2022 at 05:42:57PM -0800, Shakeel Butt <shakeelb@google.com> wrote: > Yes, the right fix would be to optimize the flushing code (but that > would require more work/time). However I still think letting > performance critical code paths to skip the sync flush would be good > in general. So, if the current patch is not to your liking we can > remove mem_cgroup_flush_stats() from workingset_refault(). What about flushing just the subtree of the memcg where the refault happens? It doesn't reduce the overall work and there's still full-tree cgroup_rstat_lock but it should make the chunks of work smaller durations more regular. Michal
On Mon, Feb 28, 2022 at 10:46 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Fri, Feb 25, 2022 at 05:42:57PM -0800, Shakeel Butt <shakeelb@google.com> wrote: > > Yes, the right fix would be to optimize the flushing code (but that > > would require more work/time). However I still think letting > > performance critical code paths to skip the sync flush would be good > > in general. So, if the current patch is not to your liking we can > > remove mem_cgroup_flush_stats() from workingset_refault(). > > What about flushing just the subtree of the memcg where the refault > happens? > It doesn't reduce the overall work and there's still full-tree > cgroup_rstat_lock but it should make the chunks of work smaller > durations more regular. > We can try that and I will send a patch to Ivan and Daniel to try on their workload to see the real impact of targeted memcg flushing. However I am not very optimistic about it.
On Fri 25-02-22 16:24:12, Shakeel Butt wrote: > Daniel Dao has reported [1] a regression on workloads that may trigger > a lot of refaults (anon and file). The underlying issue is that flushing > rstat is expensive. Although rstat flush are batched with (nr_cpus * > MEMCG_BATCH) stat updates, it seems like there are workloads which > genuinely do stat updates larger than batch value within short amount of > time. Since the rstat flush can happen in the performance critical > codepaths like page faults, such workload can suffer greatly. > > The easiest fix for now is for performance critical codepaths trigger > the rstat flush asynchronously. This patch converts the refault codepath > to use async rstat flush. In addition, this patch has premptively > converted mem_cgroup_wb_stats and shrink_node to also use the async > rstat flush as they may also similar performance regressions. Why do we need to trigger flushing in the first place from those paths. Later in the thread you are saying there is a regular flushing done every 2 seconds. What would happen if these paths didn't flush at all? Also please note that WQ context can be overwhelmed by other work so these flushes can happen much much later. So in other words why does async work (that can happen at any time without any control) make more sense than no flushing?
On Tue, Mar 1, 2022 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 25-02-22 16:24:12, Shakeel Butt wrote: > > Daniel Dao has reported [1] a regression on workloads that may trigger > > a lot of refaults (anon and file). The underlying issue is that flushing > > rstat is expensive. Although rstat flush are batched with (nr_cpus * > > MEMCG_BATCH) stat updates, it seems like there are workloads which > > genuinely do stat updates larger than batch value within short amount of > > time. Since the rstat flush can happen in the performance critical > > codepaths like page faults, such workload can suffer greatly. > > > > The easiest fix for now is for performance critical codepaths trigger > > the rstat flush asynchronously. This patch converts the refault codepath > > to use async rstat flush. In addition, this patch has premptively > > converted mem_cgroup_wb_stats and shrink_node to also use the async > > rstat flush as they may also similar performance regressions. > > Why do we need to trigger flushing in the first place from those paths. > Later in the thread you are saying there is a regular flushing done > every 2 seconds. What would happen if these paths didn't flush at all? > Also please note that WQ context can be overwhelmed by other work so > these flushes can happen much much later. > > So in other words why does async work (that can happen at any time > without any control) make more sense than no flushing? > -- Without flushing the worst that can happen in the refault path is false (or missed) activations of the refaulted page. For reclaim code, some heuristics (like deactivating active LRU or cache-trim) may act on old information. However I don't think these are too much concerning as the kernel can already missed or do false activations on refault. For the reclaim code, the kernel does force deactivation if it has skipped it in the initial iterations, so, not much to worry. Now, coming to your question, yes, we can remove the flushing from these performance critical codepaths as the stats at most will be 2 second old due to periodic flush. Now for the worst case scenario where that periodic flush (WQ) is not getting CPU, I think it is reasonable to put a sync flush if periodic flush has not happened for, let's say, 10 seconds.
Making decisions based on up to 2 s old information. On Tue, Mar 01, 2022 at 09:21:12AM -0800, Shakeel Butt <shakeelb@google.com> wrote: > Without flushing the worst that can happen in the refault path is > false (or missed) activations of the refaulted page. Yeah, this may under- or overestimate workingset size (when it's changing), the result is likely only less efficient reclaim. > For reclaim code, some heuristics (like deactivating active LRU or > cache-trim) may act on old information. Here, I'd be more careful whether such a delay cannot introduce some unstable behavior (permanent oscillation in the worst case). > Now, coming to your question, yes, we can remove the flushing from > these performance critical codepaths as the stats at most will be 2 > second old due to periodic flush. Another aspect is that people will notice and report such a narrowly located performance regression more easily than reduced/less predictable reclaim behavior. (IMO the former is better, OTOH, it can also be interpreted that noone notices (is able to notice).) Michal
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ef4b445392a9..bfdd48be60ff 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -998,6 +998,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, } void mem_cgroup_flush_stats(void); +void mem_cgroup_flush_stats_async(void); void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c695608c521c..4338e8d779b2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -690,6 +690,14 @@ void mem_cgroup_flush_stats(void) __mem_cgroup_flush_stats(); } +void mem_cgroup_flush_stats_async(void) +{ + if (atomic_read(&stats_flush_threshold) > num_online_cpus()) { + atomic_set(&stats_flush_threshold, 0); + mod_delayed_work(system_unbound_wq, &stats_flush_dwork, 0); + } +} + static void flush_memcg_stats_dwork(struct work_struct *w) { __mem_cgroup_flush_stats(); @@ -4522,7 +4530,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_async(); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); diff --git a/mm/vmscan.c b/mm/vmscan.c index c6f77e3e6d59..b6c6b165c1ef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3188,7 +3188,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) * Flush the memory cgroup stats, so that we read accurate per-memcg * lruvec stats for heuristics. */ - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_async(); memset(&sc->nr, 0, sizeof(sc->nr)); diff --git a/mm/workingset.c b/mm/workingset.c index b717eae4e0dd..a4f2b1aa5bcc 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -355,7 +355,7 @@ void workingset_refault(struct folio *folio, void *shadow) mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_async(); /* * Compare the distance to the existing workingset size. We * don't activate pages that couldn't stay resident even if
Daniel Dao has reported [1] a regression on workloads that may trigger a lot of refaults (anon and file). The underlying issue is that flushing rstat is expensive. Although rstat flush are batched with (nr_cpus * MEMCG_BATCH) stat updates, it seems like there are workloads which genuinely do stat updates larger than batch value within short amount of time. Since the rstat flush can happen in the performance critical codepaths like page faults, such workload can suffer greatly. The easiest fix for now is for performance critical codepaths trigger the rstat flush asynchronously. This patch converts the refault codepath to use async rstat flush. In addition, this patch has premptively converted mem_cgroup_wb_stats and shrink_node to also use the async rstat flush as they may also similar performance regressions. Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1] Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault") Reported-by: Daniel Dao <dqminh@cloudflare.com> Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> --- include/linux/memcontrol.h | 1 + mm/memcontrol.c | 10 +++++++++- mm/vmscan.c | 2 +- mm/workingset.c | 2 +- 4 files changed, 12 insertions(+), 3 deletions(-)