From patchwork Sun Dec 1 01:55:56 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 11268401 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EA4BD14B7 for ; Sun, 1 Dec 2019 01:55:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A9BB62084D for ; Sun, 1 Dec 2019 01:55:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="wGihkg5t" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A9BB62084D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6B9C96B033B; Sat, 30 Nov 2019 20:55:58 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 643526B033E; Sat, 30 Nov 2019 20:55:58 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 532706B033F; Sat, 30 Nov 2019 20:55:58 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0021.hostedemail.com [216.40.44.21]) by kanga.kvack.org (Postfix) with ESMTP id 36A766B033B for ; Sat, 30 Nov 2019 20:55:58 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id ED3C1824999B for ; Sun, 1 Dec 2019 01:55:57 +0000 (UTC) X-FDA: 76214906754.16.camp35_340a77ba77f63 X-Spam-Summary: 2,0,0,ff67abacdc28feb1,d41d8cd98f00b204,akpm@linux-foundation.org,:akpm@linux-foundation.org:aryabinin@virtuozzo.com:hannes@cmpxchg.org::mhocko@suse.com:mm-commits@vger.kernel.org:riel@surriel.com:shakeelb@google.com:surenb@google.com:torvalds@linux-foundation.org,RULES_HIT:2:41:69:355:379:800:960:966:967:973:988:989:1260:1263:1345:1381:1431:1437:1535:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2525:2559:2563:2682:2685:2689:2692:2693:2731:2859:2897:2898:2902:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4051:4120:4250:4321:4385:4470:4605:5007:6119:6261:6653:7576:7875:7903:7904:8599:8603:8957:9010:9025:9036:9545:9592:10004:10241:10913:11026:11473:11658:11914:12043:12048:12291:12295:12296:12297:12438:12517:12519:12555:12663:12679:12683:12783:12986:13161:13172:13180:13229:13846:13870:21080:21450:21451:21627:21740:21790:21795:21939:30012:30051:30 054:3007 X-HE-Tag: camp35_340a77ba77f63 X-Filterd-Recvd-Size: 9467 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf19.hostedemail.com (Postfix) with ESMTP for ; Sun, 1 Dec 2019 01:55:57 +0000 (UTC) Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 9BF12205ED; Sun, 1 Dec 2019 01:55:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1575165357; bh=JUkc753ioEqpAX9oyBs9UZGZ8T9Y4gB0yoVtI26WD00=; h=Date:From:To:Subject:From; b=wGihkg5t9o5Y/WVI7J9yluDjN/otFm6tbkyiaKedUezmsxdwNiQRUhQzrjQnAMamS HgsGbhHD+Wu0F+8sK3PCwQEdkCKYcO2CepJvVqDobmm/nF6ivb67zEa5Ot19nxky9f ewDMGHeV662mCp/M/sFWYAfBnCqE0HfkKGp8bYHI= Date: Sat, 30 Nov 2019 17:55:56 -0800 From: akpm@linux-foundation.org To: akpm@linux-foundation.org, aryabinin@virtuozzo.com, hannes@cmpxchg.org, linux-mm@kvack.org, mhocko@suse.com, mm-commits@vger.kernel.org, riel@surriel.com, shakeelb@google.com, surenb@google.com, torvalds@linux-foundation.org Subject: [patch 112/158] mm: vmscan: move file exhaustion detection to the node level Message-ID: <20191201015556.HzCjHml6V%akpm@linux-foundation.org> User-Agent: s-nail v14.8.16 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Johannes Weiner Subject: mm: vmscan: move file exhaustion detection to the node level Patch series "mm: fix page aging across multiple cgroups". When applications are put into unconfigured cgroups for memory accounting purposes, the cgrouping itself should not change the behavior of the page reclaim code. We expect the VM to reclaim the coldest pages in the system. But right now the VM can reclaim hot pages in one cgroup while there is eligible cold cache in others. This is because one part of the reclaim algorithm isn't truly cgroup hierarchy aware: the inactive/active list balancing. That is the part that is supposed to protect hot cache data from one-off streaming IO. The recursive cgroup reclaim scheme will scan and rotate the physical LRU lists of each eligible cgroup at the same rate in a round-robin fashion, thereby establishing a relative order among the pages of all those cgroups. However, the inactive/active balancing decisions are made locally within each cgroup, so when a cgroup is running low on cold pages, its hot pages will get reclaimed - even when sibling cgroups have plenty of cold cache eligible in the same reclaim run. For example: [root@ham ~]# head -n1 /proc/meminfo MemTotal: 1016336 kB [root@ham ~]# ./reclaimtest2.sh Establishing 50M active files in cgroup A... Hot pages cached: 12800/12800 workingset-a Linearly scanning through 18G of file data in cgroup B: real 0m4.269s user 0m0.051s sys 0m4.182s Hot pages cached: 134/12800 workingset-a The streaming IO in B, which doesn't benefit from caching at all, pushes out most of the workingset in A. Solution This series fixes the problem by elevating inactive/active balancing decisions to the toplevel of the reclaim run. This is either a cgroup that hit its limit, or straight-up global reclaim if there is physical memory pressure. From there, it takes a recursive view of the cgroup subtree to decide whether page deactivation is necessary. In the test above, the VM will then recognize that cgroup B has plenty of eligible cold cache, and that the hot pages in A can be spared: [root@ham ~]# ./reclaimtest2.sh Establishing 50M active files in cgroup A... Hot pages cached: 12800/12800 workingset-a Linearly scanning through 18G of file data in cgroup B: real 0m4.244s user 0m0.064s sys 0m4.177s Hot pages cached: 12800/12800 workingset-a Implementation Whether active pages can be deactivated or not is influenced by two factors: the inactive list dropping below a minimum size relative to the active list, and the occurence of refaults. This patch series first moves refault detection to the reclaim root, then enforces the minimum inactive size based on a recursive view of the cgroup tree's LRUs. History Note that this actually never worked correctly in Linux cgroups. In the past it worked for global reclaim and leaf limit reclaim only (we used to have two physical LRU linkages per page), but it never worked for intermediate limit reclaim over multiple leaf cgroups. We're noticing this now because 1) we're putting everything into cgroups for accounting, not just the things we want to control and 2) we're moving away from leaf limits that invoke reclaim on individual cgroups, toward large tree reclaim, triggered by high-level limits, or physical memory pressure that is influenced by local protections such as memory.low and memory.min instead. This patch (of 3): When file pages are lower than the watermark on a node, we try to force scan anonymous pages to counter-act the balancing algorithms preference for new file pages when they are likely thrashing. This is a node-level decision, but it's currently made each time we look at an lruvec. This is unnecessarily expensive and also a layering violation that makes the code harder to understand. Clean this up by making the check once per node and setting a flag in the scan_control. Link: http://lkml.kernel.org/r/20191107205334.158354-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Shakeel Butt Reviewed-by: Suren Baghdasaryan Cc: Andrey Ryabinin Cc: Michal Hocko Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 82 ++++++++++++++++++++++++++------------------------ 1 file changed, 43 insertions(+), 39 deletions(-) --- a/mm/vmscan.c~mm-vmscan-move-file-exhaustion-detection-to-the-node-level +++ a/mm/vmscan.c @@ -101,6 +101,9 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + /* The file pages on the current node are dangerously low */ + unsigned int file_is_tiny:1; + /* Allocation order */ s8 order; @@ -2289,45 +2292,16 @@ static void get_scan_count(struct lruvec } /* - * Prevent the reclaimer from falling into the cache trap: as - * cache pages start out inactive, every cache fault will tip - * the scan balance towards the file LRU. And as the file LRU - * shrinks, so does the window for rotation from references. - * This means we have a runaway feedback loop where a tiny - * thrashing file LRU becomes infinitely more attractive than - * anon pages. Try to detect this based on file LRU size. - */ - if (!cgroup_reclaim(sc)) { - unsigned long pgdatfile; - unsigned long pgdatfree; - int z; - unsigned long total_high_wmark = 0; - - pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); - pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) + - node_page_state(pgdat, NR_INACTIVE_FILE); - - for (z = 0; z < MAX_NR_ZONES; z++) { - struct zone *zone = &pgdat->node_zones[z]; - if (!managed_zone(zone)) - continue; - - total_high_wmark += high_wmark_pages(zone); - } - - if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) { - /* - * Force SCAN_ANON if there are enough inactive - * anonymous pages on the LRU in eligible zones. - * Otherwise, the small LRU gets thrashed. - */ - if (!inactive_list_is_low(lruvec, false, sc, false) && - lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, sc->reclaim_idx) - >> sc->priority) { - scan_balance = SCAN_ANON; - goto out; - } - } + * If the system is almost out of file pages, force-scan anon. + * But only if there are enough inactive anonymous pages on + * the LRU. Otherwise, the small LRU gets thrashed. + */ + if (sc->file_is_tiny && + !inactive_list_is_low(lruvec, false, sc, false) && + lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, + sc->reclaim_idx) >> sc->priority) { + scan_balance = SCAN_ANON; + goto out; } /* @@ -2754,6 +2728,36 @@ again: nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; + /* + * Prevent the reclaimer from falling into the cache trap: as + * cache pages start out inactive, every cache fault will tip + * the scan balance towards the file LRU. And as the file LRU + * shrinks, so does the window for rotation from references. + * This means we have a runaway feedback loop where a tiny + * thrashing file LRU becomes infinitely more attractive than + * anon pages. Try to detect this based on file LRU size. + */ + if (!cgroup_reclaim(sc)) { + unsigned long file; + unsigned long free; + int z; + unsigned long total_high_wmark = 0; + + free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); + file = node_page_state(pgdat, NR_ACTIVE_FILE) + + node_page_state(pgdat, NR_INACTIVE_FILE); + + for (z = 0; z < MAX_NR_ZONES; z++) { + struct zone *zone = &pgdat->node_zones[z]; + if (!managed_zone(zone)) + continue; + + total_high_wmark += high_wmark_pages(zone); + } + + sc->file_is_tiny = file + free <= total_high_wmark; + } + shrink_node_memcgs(pgdat, sc); if (reclaim_state) {