From patchwork Sun Sep 18 08:00:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12979374 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B221DC32771 for ; Sun, 18 Sep 2022 09:17:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References: Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=I9zKCUQDz/q9DzprAqq5VYPfau0xvLovpcn9bYBjHoE=; b=4Gcw8FRbsHyK6Otx5KQTGllv7V WV+gXopw+6CHRvmuEqiM4msVirOUsVlHnuQAMNcDKM67UdZ4uRk3cA7Pl+ZCYU2Ohlr1tYkw3YttI Ml5kl0WyLUxDS/wnNjmNlrRTjwra5TrmHNfUooB6yJpP6Tw2/m1baZqA32QpDIZtrIIjy9TYsgY9/ 86InUGQKuyISWopfROASGHbGJIa4jLAUrK8B+8KPiWCacXTBo8f8bLAvmuFEwdGvr+8BUzAv4gbJi X2EIve8DjXQ2H9xZ5iCqlLiaFWXF5MhC3KVlqOMNGjo0Gi2Jvte687Seo2ir5oi3f73Us7fZ7dJ3l jEUKb5JA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1oZqPY-00ECKn-Uf; Sun, 18 Sep 2022 09:16:25 +0000 Received: from mail-il1-x149.google.com ([2607:f8b0:4864:20::149]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1oZpEq-00Dbi4-Pz for linux-arm-kernel@lists.infradead.org; Sun, 18 Sep 2022 08:01:21 +0000 Received: by mail-il1-x149.google.com with SMTP id h9-20020a056e021b8900b002f19c2a1836so17494086ili.23 for ; Sun, 18 Sep 2022 01:01:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date; bh=h4SdvsYg/JB9VfCa4Q/GjK3m7bZ9yQOY/aR/7T2edN8=; b=GrOBfoG6IGt527MD+Qhts5mNgmw2t8iWycbGeq58mgHLTwuYeAUmJjVfwOwB8PxxNT x/Dk+i4uhMUXIBWVDYldc7wR+Kq1G/bxriTCzu02mDpuJTZx4tR2NxglscQjzdmIDLrE iVi3KbYmAVCzZgy/jgz/L5ervIKqEvk8rFm6I31DuEU1OR4lN/IR42zKGII5gGMfQK/u eVNea34Ydpf0omdlIIvTCPoZC/zX591VPpJ1bWn+drQvjqmgyAIUJb0f+f1Ix/cAylCu 3zsC+d4yZr3djUzZ/YdzJlOQ1k92Ts3CLoxTyE8pTJiCTryrT+Y3NSSCt3qZpsHU4SH/ +ktg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date; bh=h4SdvsYg/JB9VfCa4Q/GjK3m7bZ9yQOY/aR/7T2edN8=; b=nvJZIaWKIwWROSsimOhhlpZ09IORnSlEzKk7KTQcMXbQttxllyLt6ienrokrKNfiee BZ5ZdtW2S0opUwDJ8kZfYvyQ1bV/Y2Wy172jv4kDTXOG0cOH/kVrbxFZ+QRSqhdFAjFS VL2DaQB2TLMmSsLu8vzmEA+aitugqWQee9HHh0SVO4ROa836lp5YJrCTuFQF7YfjhHWr HL75X6WkaHqHs6ZDQMimjQRnOk8AqwY9ZogiKZ2V+3F+uTaRf/TJf1HBBNDrncF+761E SdT2L1ahYCKd8xcuDFekcEZgm4VDZYKr7Jx4Cxsugd6dXcAUzEzMjQqoPVbSCXDO6mp7 i+og== X-Gm-Message-State: ACrzQf3cLzlByEV3K964LaUB7nbpkq0X/x77dQiDAK50ffOYoz7905uj XyDmP3JrlMKOcV8a4ip+rCc48AFww4k= X-Google-Smtp-Source: AMsMyM4zwP84C+VuganW4axl59gQX5IMXkXmt49n4tsuARME7pR0+j3uMAK5GLN+Vn9594BY3IVAaEdBUPY= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:c05a:2e99:29cd:d157]) (user=yuzhao job=sendgmr) by 2002:a05:6e02:1a63:b0:2f5:80db:c5c1 with SMTP id w3-20020a056e021a6300b002f580dbc5c1mr836566ilv.204.1663488074815; Sun, 18 Sep 2022 01:01:14 -0700 (PDT) Date: Sun, 18 Sep 2022 02:00:06 -0600 In-Reply-To: <20220918080010.2920238-1-yuzhao@google.com> Message-Id: <20220918080010.2920238-10-yuzhao@google.com> Mime-Version: 1.0 References: <20220918080010.2920238-1-yuzhao@google.com> X-Mailer: git-send-email 2.37.3.968.ga6b4b080e4-goog Subject: [PATCH mm-unstable v15 09/14] mm: multi-gen LRU: optimize multiple memcgs From: Yu Zhao To: Andrew Morton Cc: Andi Kleen , Aneesh Kumar , Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Johannes Weiner , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Mel Gorman , Michael Larabel , Michal Hocko , Mike Rapoport , Peter Zijlstra , Tejun Heo , Vlastimil Babka , Will Deacon , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, page-reclaim@google.com, Yu Zhao , Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , " =?utf-8?q?Holger_Hoffst=C3=A4tte?= " , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh , Vaibhav Jain X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220918_010116_961477_6D6CA684 X-CRM114-Status: GOOD ( 23.78 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org When multiple memcgs are available, it is possible to use generations as a frame of reference to make better choices and improve overall performance under global memory pressure. This patch adds a basic optimization to select memcgs that can drop single-use unmapped clean pages first. Doing so reduces the chance of going into the aging path or swapping, which can be costly. A typical example that benefits from this optimization is a server running mixed types of workloads, e.g., heavy anon workload in one memcg and heavy buffered I/O workload in the other. Though this optimization can be applied to both kswapd and direct reclaim, it is only added to kswapd to keep the patchset manageable. Later improvements may cover the direct reclaim path. While ensuring certain fairness to all eligible memcgs, proportional scans of individual memcgs also require proper backoff to avoid overshooting their aggregate reclaim target by too much. Otherwise it can cause high direct reclaim latency. The conditions for backoff are: 1. At low priorities, for direct reclaim, if aging fairness or direct reclaim latency is at risk, i.e., aging one memcg multiple times or swapping after the target is met. 2. At high priorities, for global reclaim, if per-zone free pages are above respective watermarks. Server benchmark results: Mixed workloads: fio (buffered I/O): +[19, 21]% IOPS BW patch1-8: 1880k 7343MiB/s patch1-9: 2252k 8796MiB/s memcached (anon): +[119, 123]% Ops/sec KB/sec patch1-8: 862768.65 33514.68 patch1-9: 1911022.12 74234.54 Mixed workloads: fio (buffered I/O): +[75, 77]% IOPS BW 5.19-rc1: 1279k 4996MiB/s patch1-9: 2252k 8796MiB/s memcached (anon): +[13, 15]% Ops/sec KB/sec 5.19-rc1: 1673524.04 65008.87 patch1-9: 1911022.12 74234.54 Configurations: (changes since patch 6) cat mixed.sh modprobe brd rd_nr=2 rd_size=56623104 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 mkfs.ext4 /dev/ram1 mount -t ext4 /dev/ram1 /mnt memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=90m --group_reporting & pid=$! sleep 200 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed kill -INT $pid wait Client benchmark results: no change (CONFIG_MEMCG=n) Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Signed-off-by: Yu Zhao --- mm/vmscan.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 96 insertions(+), 9 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c579b254fec7..3f83325fdc71 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -131,6 +131,12 @@ struct scan_control { /* Always discard instead of demoting to lower tier memory */ unsigned int no_demotion:1; +#ifdef CONFIG_LRU_GEN + /* help kswapd make better choices among multiple memcgs */ + unsigned int memcgs_need_aging:1; + unsigned long last_reclaimed; +#endif + /* Allocation order */ s8 order; @@ -4431,6 +4437,19 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) VM_WARN_ON_ONCE(!current_is_kswapd()); + sc->last_reclaimed = sc->nr_reclaimed; + + /* + * To reduce the chance of going into the aging path, which can be + * costly, optimistically skip it if the flag below was cleared in the + * eviction path. This improves the overall performance when multiple + * memcgs are available. + */ + if (!sc->memcgs_need_aging) { + sc->memcgs_need_aging = true; + return; + } + set_mm_walk(pgdat); memcg = mem_cgroup_iter(NULL, NULL, NULL); @@ -4842,7 +4861,8 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw return scanned; } -static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness) +static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness, + bool *need_swapping) { int type; int scanned; @@ -4905,6 +4925,9 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap sc->nr_reclaimed += reclaimed; + if (need_swapping && type == LRU_GEN_ANON) + *need_swapping = true; + return scanned; } @@ -4914,9 +4937,8 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap * reclaim. */ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, - bool can_swap) + bool can_swap, bool *need_aging) { - bool need_aging; unsigned long nr_to_scan; struct mem_cgroup *memcg = lruvec_memcg(lruvec); DEFINE_MAX_SEQ(lruvec); @@ -4926,8 +4948,8 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) return 0; - need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan); - if (!need_aging) + *need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan); + if (!*need_aging) return nr_to_scan; /* skip the aging path at the default priority */ @@ -4944,10 +4966,68 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0; } +static bool should_abort_scan(struct lruvec *lruvec, unsigned long seq, + struct scan_control *sc, bool need_swapping) +{ + int i; + DEFINE_MAX_SEQ(lruvec); + + if (!current_is_kswapd()) { + /* age each memcg once to ensure fairness */ + if (max_seq - seq > 1) + return true; + + /* over-swapping can increase allocation latency */ + if (sc->nr_reclaimed >= sc->nr_to_reclaim && need_swapping) + return true; + + /* give this thread a chance to exit and free its memory */ + if (fatal_signal_pending(current)) { + sc->nr_reclaimed += MIN_LRU_BATCH; + return true; + } + + if (cgroup_reclaim(sc)) + return false; + } else if (sc->nr_reclaimed - sc->last_reclaimed < sc->nr_to_reclaim) + return false; + + /* keep scanning at low priorities to ensure fairness */ + if (sc->priority > DEF_PRIORITY - 2) + return false; + + /* + * A minimum amount of work was done under global memory pressure. For + * kswapd, it may be overshooting. For direct reclaim, the target isn't + * met, and yet the allocation may still succeed, since kswapd may have + * caught up. In either case, it's better to stop now, and restart if + * necessary. + */ + for (i = 0; i <= sc->reclaim_idx; i++) { + unsigned long wmark; + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i; + + if (!managed_zone(zone)) + continue; + + wmark = current_is_kswapd() ? high_wmark_pages(zone) : low_wmark_pages(zone); + if (wmark > zone_page_state(zone, NR_FREE_PAGES)) + return false; + } + + sc->nr_reclaimed += MIN_LRU_BATCH; + + return true; +} + static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) { struct blk_plug plug; + bool need_aging = false; + bool need_swapping = false; unsigned long scanned = 0; + unsigned long reclaimed = sc->nr_reclaimed; + DEFINE_MAX_SEQ(lruvec); lru_add_drain(); @@ -4967,21 +5047,28 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc else swappiness = 0; - nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness); + nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness, &need_aging); if (!nr_to_scan) - break; + goto done; - delta = evict_folios(lruvec, sc, swappiness); + delta = evict_folios(lruvec, sc, swappiness, &need_swapping); if (!delta) - break; + goto done; scanned += delta; if (scanned >= nr_to_scan) break; + if (should_abort_scan(lruvec, max_seq, sc, need_swapping)) + break; + cond_resched(); } + /* see the comment in lru_gen_age_node() */ + if (sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH && !need_aging) + sc->memcgs_need_aging = false; +done: clear_mm_walk(); blk_finish_plug(&plug);