From patchwork Sun Sep 18 20:47:54 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12979592 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53DACC6FA86 for ; Sun, 18 Sep 2022 20:48:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A8CFD940010; Sun, 18 Sep 2022 16:48:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 950E5940008; Sun, 18 Sep 2022 16:48:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 755A2940010; Sun, 18 Sep 2022 16:48:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 65551940008 for ; Sun, 18 Sep 2022 16:48:13 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 3ABB8C012A for ; Sun, 18 Sep 2022 20:48:13 +0000 (UTC) X-FDA: 79926393666.23.3574AC0 Received: from mail-il1-f201.google.com (mail-il1-f201.google.com [209.85.166.201]) by imf04.hostedemail.com (Postfix) with ESMTP id E912040007 for ; Sun, 18 Sep 2022 20:48:12 +0000 (UTC) Received: by mail-il1-f201.google.com with SMTP id o2-20020a056e0214c200b002eb8acbd27cso18515814ilk.22 for ; Sun, 18 Sep 2022 13:48:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date; bh=kHKKwMzpUGjXWL0kFP6rN6rEGcUyiSLgiCEgbGar96E=; b=Yj+AkXfSoZoe96HOU5B3jM2wroc05eNPTFeByQdkSPci8jc82UE/hkeJOA86FOTSw2 SOT1uCHPydR6W2ZR1i0jChC02Z/KBCWcRMgziQEBqrx0BIWQ35vtCxjE+q+RwuOi4+JX CwmgN0zCIiTAzy78nikwhCws2lIiHQNXJVHwIHjKyayxrtqIs5vty+hmIVHBOU+yCJzQ yWWDWrJVnfmUcW6zfTASlyNMfNmD+U92FDjdaLh81bjsMKoNrw92KGZ5574L3wiywFZL RAGCt5nwTQ6Kv/f25o3MPz8lOxtisXcR0HWROg8Tk6ACKMf+7h4QJVkeMYf+InAh75Au F2bw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date; bh=kHKKwMzpUGjXWL0kFP6rN6rEGcUyiSLgiCEgbGar96E=; b=DCl987mqBW0ipZq6SIWjXZkzCgQeUm5uUFP9kF1U+heVbG8LdjTGl/kZJpjZxoksn0 LBp447kwPR8ZPv1buiQ4QV8CoQKTCpWAYlaoRnQ8HOzydRx/Xlpy98OXgxeiDIo4poUv ruC83Pp1hfen3mDIzB0AeNj9ULbxpV53+Pq2jmgWrHqmU5O3/HOhZGXXRvWN2TzD8Plh 1DKTeui2Klk32lRUWcx+EDUFKaTn5jp4ctqrc3OfiuEp4UWf7G1vSuP/Dl4VQ1EBDkLC ZgAtIBBefj9Mx2jg2klGZlY4eVhHONpY/Sxm07mTZvmcKnenL64amkGyL0X6JCwA9kHA 7rgQ== X-Gm-Message-State: ACrzQf0L08L+wYSVbSIapIYc+YsgQOPW/5bCanOJrrIoQmx1CPi7i1j2 SRT05CoQQTJdiG1QwWjOS0wvP38a3PU= X-Google-Smtp-Source: AMsMyM6baew0MYTta3xEIkNjyvoxjG5PDbmEYdcpgyifTX4ObSi0SaqQHHOuX6ApD8MLK4X4i31aUA44gnM= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:ccff:7b62:4ed6:2946]) (user=yuzhao job=sendgmr) by 2002:a92:d843:0:b0:2f3:5f18:bbe7 with SMTP id h3-20020a92d843000000b002f35f18bbe7mr6221368ilq.108.1663534092309; Sun, 18 Sep 2022 13:48:12 -0700 (PDT) Date: Sun, 18 Sep 2022 14:47:54 -0600 In-Reply-To: <20220918204755.3135720-1-yuzhao@google.com> Message-Id: <20220918204755.3135720-10-yuzhao@google.com> Mime-Version: 1.0 References: <20220918204755.3135720-1-yuzhao@google.com> X-Mailer: git-send-email 2.37.3.968.ga6b4b080e4-goog Subject: [PATCH v14-fix 10/11] mm: multi-gen LRU: fixed long-tailed direct reclaim latency From: Yu Zhao To: Andrew Morton Cc: linux-mm@kvack.org, Yu Zhao ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663534092; a=rsa-sha256; cv=none; b=N53MLWFUWpqeI92NhVfPdCKzoXzPyXY/pTD7BTQAwc8zgA2UgdXFOLzPLAAnu3TM64mqNf F03SiaQA2pg9MkcORkZmWJ8+bZQHVdXYIX45SO9sjFAEts+y3CiH74gAF1ZKq2BXuGqSCx ttsdxcqKurhYtMHxX13b+uxjSx+eHdc= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Yj+AkXfS; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of 3DIQnYwYKCNELHM4xB3BB381.zB985AHK-997Ixz7.BE3@flex--yuzhao.bounces.google.com designates 209.85.166.201 as permitted sender) smtp.mailfrom=3DIQnYwYKCNELHM4xB3BB381.zB985AHK-997Ixz7.BE3@flex--yuzhao.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663534092; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kHKKwMzpUGjXWL0kFP6rN6rEGcUyiSLgiCEgbGar96E=; b=7vmGf7Ij/BGTzW5+0tta2diwrgT49CsikZBAn+K3ZLpMoZsvLyi5baCZKqpD4Q4pzbjc1D 5MqaFaUYcd/Qoo1yxTP7Q20d3lx/vGS9/BbP95KuRSTQ/fnU5ctznMOm25TgdYQyLkSKij JXlb9SSMredgG72Ej4vQaNNdanzzeBE= X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: E912040007 Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Yj+AkXfS; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of 3DIQnYwYKCNELHM4xB3BB381.zB985AHK-997Ixz7.BE3@flex--yuzhao.bounces.google.com designates 209.85.166.201 as permitted sender) smtp.mailfrom=3DIQnYwYKCNELHM4xB3BB381.zB985AHK-997Ixz7.BE3@flex--yuzhao.bounces.google.com X-Stat-Signature: pti8uacujf4rowr5fya6i53agk4kb3uy X-HE-Tag: 1663534092-580964 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Long-tailed direct reclaim latency was seen on high-memory (TBs) machines: MGLRU is better at the 99th percentile but worse at the 99.9th. It turned out the old direct reclaim backoff, which tries to enforce a minimum fairness among all eligible memcgs, over-swapped by about (total_mem>>DEF_PRIORITY)-nr_to_reclaim: /* adjust priority if memcg is offline or the target is met */ if (!mem_cgroup_online(memcg)) priority = 0; else if (sc->nr_reclaimed - reclaimed >= sc->nr_to_reclaim) priority = DEF_PRIORITY; else priority = sc->priority; The new backoff, which pulls the plug on swapping once the target is met, trades some fairness for curtailed latency. Specifically, in should_abort_scan(): /* over-swapping can increase allocation latency */ if (sc->nr_reclaimed >= sc->nr_to_reclaim && need_swapping) return true; The fundamental problem is that the backoff requires a sophisticated model and the previous one was oversimplified. The new one may still be, but at least it can handle a couple more corner cases on top of the above: /* age each memcg once to ensure fairness */ if (max_seq - seq > 1) return true; The NR_FREE_PAGES check at the bottom of should_abort_scan(). Signed-off-by: Yu Zhao --- mm/vmscan.c | 105 ++++++++++++++++++++++++++++++++++------------------ 1 file changed, 70 insertions(+), 35 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f6eab73bdfb9..50764b2d462f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -135,10 +135,9 @@ struct scan_control { unsigned int no_demotion:1; #ifdef CONFIG_LRU_GEN - /* help make better choices when multiple memcgs are available */ + /* help kswapd make better choices among multiple memcgs */ unsigned int memcgs_need_aging:1; - unsigned int memcgs_need_swapping:1; - unsigned int memcgs_avoid_swapping:1; + unsigned long last_reclaimed; #endif /* Allocation order */ @@ -4524,22 +4523,19 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) VM_WARN_ON_ONCE(!current_is_kswapd()); + sc->last_reclaimed = sc->nr_reclaimed; + /* - * To reduce the chance of going into the aging path or swapping, which - * can be costly, optimistically skip them unless their corresponding - * flags were cleared in the eviction path. This improves the overall - * performance when multiple memcgs are available. + * To reduce the chance of going into the aging path, which can be + * costly, optimistically skip it if the flag below was cleared in the + * eviction path. This improves the overall performance when multiple + * memcgs are available. */ if (!sc->memcgs_need_aging) { sc->memcgs_need_aging = true; - sc->memcgs_avoid_swapping = !sc->memcgs_need_swapping; - sc->memcgs_need_swapping = true; return; } - sc->memcgs_need_swapping = true; - sc->memcgs_avoid_swapping = true; - set_mm_walk(pgdat); memcg = mem_cgroup_iter(NULL, NULL, NULL); @@ -5035,7 +5031,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap sc->nr_reclaimed += reclaimed; - if (type == LRU_GEN_ANON && need_swapping) + if (need_swapping && type == LRU_GEN_ANON) *need_swapping = true; return scanned; @@ -5047,19 +5043,13 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap * reclaim. */ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, - bool can_swap, unsigned long reclaimed, bool *need_aging) + bool can_swap, bool *need_aging) { - int priority; unsigned long nr_to_scan; struct mem_cgroup *memcg = lruvec_memcg(lruvec); DEFINE_MAX_SEQ(lruvec); DEFINE_MIN_SEQ(lruvec); - if (fatal_signal_pending(current)) { - sc->nr_reclaimed += MIN_LRU_BATCH; - return 0; - } - if (mem_cgroup_below_min(memcg) || (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim)) return 0; @@ -5068,15 +5058,7 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * if (!nr_to_scan) return 0; - /* adjust priority if memcg is offline or the target is met */ - if (!mem_cgroup_online(memcg)) - priority = 0; - else if (sc->nr_reclaimed - reclaimed >= sc->nr_to_reclaim) - priority = DEF_PRIORITY; - else - priority = sc->priority; - - nr_to_scan >>= priority; + nr_to_scan >>= mem_cgroup_online(memcg) ? sc->priority : 0; if (!nr_to_scan) return 0; @@ -5084,7 +5066,7 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * return nr_to_scan; /* skip the aging path at the default priority */ - if (priority == DEF_PRIORITY) + if (sc->priority == DEF_PRIORITY) goto done; /* leave the work to lru_gen_age_node() */ @@ -5097,6 +5079,60 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control * return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0; } +static bool should_abort_scan(struct lruvec *lruvec, unsigned long seq, + struct scan_control *sc, bool need_swapping) +{ + int i; + DEFINE_MAX_SEQ(lruvec); + + if (!current_is_kswapd()) { + /* age each memcg once to ensure fairness */ + if (max_seq - seq > 1) + return true; + + /* over-swapping can increase allocation latency */ + if (sc->nr_reclaimed >= sc->nr_to_reclaim && need_swapping) + return true; + + /* give this thread a chance to exit and free its memory */ + if (fatal_signal_pending(current)) { + sc->nr_reclaimed += MIN_LRU_BATCH; + return true; + } + + if (cgroup_reclaim(sc)) + return false; + } else if (sc->nr_reclaimed - sc->last_reclaimed < sc->nr_to_reclaim) + return false; + + /* keep scanning at low priorities to ensure fairness */ + if (sc->priority > DEF_PRIORITY - 2) + return false; + + /* + * A minimum amount of work was done under global memory pressure. For + * kswapd, it may be overshooting. For direct reclaim, the target isn't + * met, and yet the allocation may still succeed, since kswapd may have + * caught up. In either case, it's better to stop now, and restart if + * necessary. + */ + for (i = 0; i <= sc->reclaim_idx; i++) { + unsigned long wmark; + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i; + + if (!managed_zone(zone)) + continue; + + wmark = current_is_kswapd() ? high_wmark_pages(zone) : low_wmark_pages(zone); + if (wmark > zone_page_state(zone, NR_FREE_PAGES)) + return false; + } + + sc->nr_reclaimed += MIN_LRU_BATCH; + + return true; +} + static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) { struct blk_plug plug; @@ -5104,6 +5140,7 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc bool need_swapping = false; unsigned long scanned = 0; unsigned long reclaimed = sc->nr_reclaimed; + DEFINE_MAX_SEQ(lruvec); lru_add_drain(); @@ -5123,7 +5160,7 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc else swappiness = 0; - nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness, reclaimed, &need_aging); + nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness, &need_aging); if (!nr_to_scan) goto done; @@ -5135,17 +5172,15 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc if (scanned >= nr_to_scan) break; - if (sc->memcgs_avoid_swapping && swappiness < 200 && need_swapping) + if (should_abort_scan(lruvec, max_seq, sc, need_swapping)) break; cond_resched(); } /* see the comment in lru_gen_age_node() */ - if (!need_aging) + if (sc->nr_reclaimed - reclaimed >= MIN_LRU_BATCH && !need_aging) sc->memcgs_need_aging = false; - if (!need_swapping) - sc->memcgs_need_swapping = false; done: clear_mm_walk();