From patchwork Tue Feb 8 08:18:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12738306 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC3E3C433EF for ; Tue, 8 Feb 2022 08:19:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C5AD6B009C; Tue, 8 Feb 2022 03:19:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 274C26B009D; Tue, 8 Feb 2022 03:19:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 115326B009E; Tue, 8 Feb 2022 03:19:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0205.hostedemail.com [216.40.44.205]) by kanga.kvack.org (Postfix) with ESMTP id 003146B009C for ; Tue, 8 Feb 2022 03:19:40 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id C3A6B1801A210 for ; Tue, 8 Feb 2022 08:19:40 +0000 (UTC) X-FDA: 79118913720.20.C98E88B Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf06.hostedemail.com (Postfix) with ESMTP id 629AE180007 for ; Tue, 8 Feb 2022 08:19:40 +0000 (UTC) Received: by mail-yb1-f201.google.com with SMTP id 127-20020a250f85000000b00611ab6484abso33897681ybp.23 for ; Tue, 08 Feb 2022 00:19:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc:content-transfer-encoding; bh=0BXyO0PKK7CUSbeV+6Qnj8UbyGthBxtpCJsNt1soUUg=; b=KAx/iZ9svyK4ObUuX8iuJmFV3DiPdun9klzyG5ld1dYnzaBrkASTVzQxY84IbSKHzr 4xxKGTr2M2y0Mfkw+1VlzsT9YDQQE82VA1lRjpiZF1cHgCjyeiY18kxA0jDugTYpznp7 LnewJWKHsv/5TAO10lfqEGyj10AIzXBYZE16i9eN2RLjXdVjMJ9tXVbtLIfDtbxFV/Jo TNoU9JOxwZtsVRF1gcI8WMGFpn+RpNgS7n4HKOblWKkymgC4O3UqvcHnaqz/mN3jQ94F ArYg8fhyNRBNzb5Pzk6w2gskCSclECA9FkpOMpQHgn3qbNJ7oRsA8sqqiIaS0jarHPe+ m5KQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc:content-transfer-encoding; bh=0BXyO0PKK7CUSbeV+6Qnj8UbyGthBxtpCJsNt1soUUg=; b=D4XA6uEglmt4pRVDo7dNB+cPZ8I8UB0NzCSqpwfJCcVzR/ZuwxWP3HV5dHLbXhjbiy 75hcPNQWPAv3VV/6zxXX/4ePp73Wsqzsa8Q1+Q0DMQfjLSiFgaJPTWDCtwsws/nBouID Tm3U/1+sydNneQFiEATrp8L5K5LC6nIIH03PK8M8cp459/CeeR5BgnLvWz6O/gFQLBgT TjM+Hew5p0aKNbnUpBAqz+N9wik4sBaMXjmiU+cfjoe6gIs/h+xrAw5PGMKYVosD58ws i3OIGedIG0D2lkNeFum8wxRR7FnCuQ6A/tJtxt5/rwIDE7MEB58RrJon/JawqNiZQIsb 4yow== X-Gm-Message-State: AOAM5305aZtsgD+iaZk6y7qmH40fSQx8hfJWlzSzNdPGSXpgKUBaNH8J ENhhRRy1l3DQ1NKgDak6FqW+xTfv7oQ= X-Google-Smtp-Source: ABdhPJyfRNtK8fQEnpFOFnjoUd54PtiPMNuR17kYMjb9Xjj4M6nkXaP0KwGeiWwcMRE8Var+bOeyh57f1jQ= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:5f31:19c3:21f5:7300]) (user=yuzhao job=sendgmr) by 2002:a81:ce0b:: with SMTP id t11mr3946599ywi.421.1644308379615; Tue, 08 Feb 2022 00:19:39 -0800 (PST) Date: Tue, 8 Feb 2022 01:18:59 -0700 In-Reply-To: <20220208081902.3550911-1-yuzhao@google.com> Message-Id: <20220208081902.3550911-10-yuzhao@google.com> Mime-Version: 1.0 References: <20220208081902.3550911-1-yuzhao@google.com> X-Mailer: git-send-email 2.35.0.263.gb82422642f-goog Subject: [PATCH v7 09/12] mm: multigenerational LRU: runtime switch From: Yu Zhao To: Andrew Morton , Johannes Weiner , Mel Gorman , Michal Hocko Cc: Andi Kleen , Aneesh Kumar , Barry Song <21cnbao@gmail.com>, Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Michael Larabel , Mike Rapoport , Rik van Riel , Vlastimil Babka , Will Deacon , Ying Huang , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, page-reclaim@google.com, x86@kernel.org, Yu Zhao , Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , " =?utf-8?q?Holger_Hoffst=C3=A4tte?= " , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="KAx/iZ9s"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of 3mycCYgYKCAo849rkyqyyqvo.mywvsx47-wwu5kmu.y1q@flex--yuzhao.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3mycCYgYKCAo849rkyqyyqvo.mywvsx47-wwu5kmu.y1q@flex--yuzhao.bounces.google.com X-Rspamd-Server: rspam03 X-Rspam-User: X-Stat-Signature: tc6xs8tmxdmsdym3f7bbw3nju7aayxzm X-Rspamd-Queue-Id: 629AE180007 X-HE-Tag: 1644308380-837599 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add /sys/kernel/mm/lru_gen/enabled as a runtime switch. Features that can be enabled or disabled include: 0x0001: the multigenerational LRU 0x0002: the page table walks, when arch_has_hw_pte_young() returns true 0x0004: the use of the accessed bit in non-leaf PMD entries, when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y [yYnN]: apply to all the features above E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0007 echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 NB: the page table walks happen on the scale of seconds under heavy memory pressure. Under such a condition, the mmap_lock contention is a lesser concern, compared with the LRU lock contention and the I/O congestion. So far the only well-known case of the mmap_lock contention is Android, due to Scudo [1] which allocates several thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree also have provided their own assessments [2][3]. However, if the page table walks do worsen the mmap_lock contention, the runtime switch can be used to disable this feature. In this case the multigenerational LRU will suffer a minor performance degradation, as shown previously. The use of the accessed bit in non-leaf PMD entries can also be disabled, since this feature wasn't tested on x86 varieties other than Intel and AMD. [1] https://source.android.com/devices/tech/debug/scudo [2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/ [3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/ Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh --- include/linux/cgroup.h | 15 +- include/linux/mm_inline.h | 10 +- include/linux/mmzone.h | 7 + kernel/cgroup/cgroup-internal.h | 1 - mm/Kconfig | 6 + mm/vmscan.c | 236 +++++++++++++++++++++++++++++++- 6 files changed, 267 insertions(+), 8 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 75c151413fda..b145025f3eac 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp) css_put(&cgrp->self); } +extern struct mutex cgroup_mutex; + +static inline void cgroup_lock(void) +{ + mutex_lock(&cgroup_mutex); +} + +static inline void cgroup_unlock(void) +{ + mutex_unlock(&cgroup_mutex); +} + /** * task_css_set_check - obtain a task's css_set with extra access conditions * @task: the task to obtain css_set for @@ -446,7 +458,6 @@ static inline void cgroup_put(struct cgroup *cgrp) * as locks used during the cgroup_subsys::attach() methods. */ #ifdef CONFIG_PROVE_RCU -extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; #define task_css_set_check(task, __c) \ rcu_dereference_check((task)->cgroups, \ @@ -707,6 +718,8 @@ struct cgroup; static inline u64 cgroup_id(const struct cgroup *cgrp) { return 1; } static inline void css_get(struct cgroup_subsys_state *css) {} static inline void css_put(struct cgroup_subsys_state *css) {} +static inline void cgroup_lock(void) {} +static inline void cgroup_unlock(void) {} static inline int cgroup_attach_task_all(struct task_struct *from, struct task_struct *t) { return 0; } static inline int cgroupstats_build(struct cgroupstats *stats, diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 37c8a0ede4ff..130d62751e05 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -96,7 +96,15 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio) static inline bool lru_gen_enabled(void) { - return true; +#ifdef CONFIG_LRU_GEN_ENABLED + DECLARE_STATIC_KEY_TRUE(lru_gen_caps[NR_LRU_GEN_CAPS]); + + return static_branch_likely(&lru_gen_caps[LRU_GEN_CORE]); +#else + DECLARE_STATIC_KEY_FALSE(lru_gen_caps[NR_LRU_GEN_CAPS]); + + return static_branch_unlikely(&lru_gen_caps[LRU_GEN_CORE]); +#endif } static inline bool lru_gen_in_fault(void) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fa0a7a84ee58..4ecec9152761 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -311,6 +311,13 @@ struct page_vma_mapped_walk; #ifdef CONFIG_LRU_GEN +enum { + LRU_GEN_CORE, + LRU_GEN_MM_WALK, + LRU_GEN_NONLEAF_YOUNG, + NR_LRU_GEN_CAPS +}; + #define MIN_LRU_BATCH BITS_PER_LONG #define MAX_LRU_BATCH (MIN_LRU_BATCH * 128) diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 6e36e854b512..929ed3bf1a7c 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -165,7 +165,6 @@ struct cgroup_mgctx { #define DEFINE_CGROUP_MGCTX(name) \ struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name) -extern struct mutex cgroup_mutex; extern spinlock_t css_set_lock; extern struct cgroup_subsys *cgroup_subsys[]; extern struct list_head cgroup_roots; diff --git a/mm/Kconfig b/mm/Kconfig index e899623d5df0..aae72b740d8a 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -903,6 +903,12 @@ config LRU_GEN Documentation/admin-guide/mm/multigen_lru.rst and Documentation/vm/multigen_lru.rst for details. +config LRU_GEN_ENABLED + bool "Enable by default" + depends on LRU_GEN + help + This option enables the multigenerational LRU by default. + config NR_LRU_GENS int "Max number of generations" depends on LRU_GEN diff --git a/mm/vmscan.c b/mm/vmscan.c index fc09b6c10624..700c35f2a030 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3066,6 +3066,12 @@ enum { TYPE_FILE, }; +#ifdef CONFIG_LRU_GEN_ENABLED +DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS); +#else +DEFINE_STATIC_KEY_ARRAY_FALSE(lru_gen_caps, NR_LRU_GEN_CAPS); +#endif + /****************************************************************************** * shorthand helpers ******************************************************************************/ @@ -3102,6 +3108,15 @@ static int folio_lru_tier(struct folio *folio) return lru_tier_from_refs(refs); } +static bool get_cap(int cap) +{ +#ifdef CONFIG_LRU_GEN_ENABLED + return static_branch_likely(&lru_gen_caps[cap]); +#else + return static_branch_unlikely(&lru_gen_caps[cap]); +#endif +} + static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid) { struct pglist_data *pgdat = NODE_DATA(nid); @@ -3893,7 +3908,8 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area goto next; if (!pmd_trans_huge(pmd[i])) { - if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)) + if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && + get_cap(LRU_GEN_NONLEAF_YOUNG)) pmdp_test_and_clear_young(vma, addr, pmd + i); goto next; } @@ -4000,10 +4016,12 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, priv->mm_stats[MM_PMD_TOTAL]++; #ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG - if (!pmd_young(val)) - continue; + if (get_cap(LRU_GEN_NONLEAF_YOUNG)) { + if (!pmd_young(val)) + continue; - walk_pmd_range_locked(pud, addr, vma, walk, &pos); + walk_pmd_range_locked(pud, addr, vma, walk, &pos); + } #endif if (!priv->full_scan && !test_bloom_filter(priv->lruvec, priv->max_seq, pmd + i)) continue; @@ -4235,7 +4253,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, * handful of PTEs. Spreading the work out over a period of time usually * is less efficient, but it avoids bursty page faults. */ - if (!full_scan && !arch_has_hw_pte_young()) { + if (!full_scan && (!arch_has_hw_pte_young() || !get_cap(LRU_GEN_MM_WALK))) { success = iterate_mm_list_nowalk(lruvec, max_seq); goto done; } @@ -4941,6 +4959,211 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc current->reclaim_state->mm_walk = NULL; } +/****************************************************************************** + * state change + ******************************************************************************/ + +static bool __maybe_unused state_is_valid(struct lruvec *lruvec) +{ + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + if (lrugen->enabled) { + enum lru_list lru; + + for_each_evictable_lru(lru) { + if (!list_empty(&lruvec->lists[lru])) + return false; + } + } else { + int gen, type, zone; + + for_each_gen_type_zone(gen, type, zone) { + if (!list_empty(&lrugen->lists[gen][type][zone])) + return false; + + /* unlikely but not a bug when reset_batch_size() is pending */ + VM_WARN_ON(lrugen->nr_pages[gen][type][zone]); + } + } + + return true; +} + +static bool fill_evictable(struct lruvec *lruvec) +{ + enum lru_list lru; + int remaining = MAX_LRU_BATCH; + + for_each_evictable_lru(lru) { + int type = is_file_lru(lru); + bool active = is_active_lru(lru); + struct list_head *head = &lruvec->lists[lru]; + + while (!list_empty(head)) { + bool success; + struct folio *folio = lru_to_folio(head); + + VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio); + VM_BUG_ON_FOLIO(folio_test_active(folio) != active, folio); + VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio); + VM_BUG_ON_FOLIO(folio_lru_gen(folio) < MAX_NR_GENS, folio); + + lruvec_del_folio(lruvec, folio); + success = lru_gen_add_folio(lruvec, folio, false); + VM_BUG_ON(!success); + + if (!--remaining) + return false; + } + } + + return true; +} + +static bool drain_evictable(struct lruvec *lruvec) +{ + int gen, type, zone; + int remaining = MAX_LRU_BATCH; + + for_each_gen_type_zone(gen, type, zone) { + struct list_head *head = &lruvec->lrugen.lists[gen][type][zone]; + + while (!list_empty(head)) { + bool success; + struct folio *folio = lru_to_folio(head); + + VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio); + VM_BUG_ON_FOLIO(folio_test_active(folio), folio); + VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio); + VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio); + + success = lru_gen_del_folio(lruvec, folio, false); + VM_BUG_ON(!success); + lruvec_add_folio(lruvec, folio); + + if (!--remaining) + return false; + } + } + + return true; +} + +static void lru_gen_change_state(bool enable) +{ + static DEFINE_MUTEX(state_mutex); + + struct mem_cgroup *memcg; + + cgroup_lock(); + cpus_read_lock(); + get_online_mems(); + mutex_lock(&state_mutex); + + if (enable == lru_gen_enabled()) + goto unlock; + + if (enable) + static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]); + else + static_branch_disable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + int nid; + + for_each_node(nid) { + struct lruvec *lruvec = get_lruvec(memcg, nid); + + if (!lruvec) + continue; + + spin_lock_irq(&lruvec->lru_lock); + + VM_BUG_ON(!seq_is_valid(lruvec)); + VM_BUG_ON(!state_is_valid(lruvec)); + + lruvec->lrugen.enabled = enable; + + while (!(enable ? fill_evictable(lruvec) : drain_evictable(lruvec))) { + spin_unlock_irq(&lruvec->lru_lock); + cond_resched(); + spin_lock_irq(&lruvec->lru_lock); + } + + spin_unlock_irq(&lruvec->lru_lock); + } + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); +unlock: + mutex_unlock(&state_mutex); + put_online_mems(); + cpus_read_unlock(); + cgroup_unlock(); +} + +/****************************************************************************** + * sysfs interface + ******************************************************************************/ + +static ssize_t show_enable(struct kobject *kobj, struct kobj_attribute *attr, char *buf) +{ + unsigned int caps = 0; + + if (get_cap(LRU_GEN_CORE)) + caps |= BIT(LRU_GEN_CORE); + + if (arch_has_hw_pte_young() && get_cap(LRU_GEN_MM_WALK)) + caps |= BIT(LRU_GEN_MM_WALK); + + if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) && get_cap(LRU_GEN_NONLEAF_YOUNG)) + caps |= BIT(LRU_GEN_NONLEAF_YOUNG); + + return snprintf(buf, PAGE_SIZE, "0x%04x\n", caps); +} + +static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t len) +{ + int i; + unsigned int caps; + + if (tolower(*buf) == 'n') + caps = 0; + else if (tolower(*buf) == 'y') + caps = -1; + else if (kstrtouint(buf, 0, &caps)) + return -EINVAL; + + for (i = 0; i < NR_LRU_GEN_CAPS; i++) { + bool enable = caps & BIT(i); + + if (i == LRU_GEN_CORE) + lru_gen_change_state(enable); + else if (enable) + static_branch_enable(&lru_gen_caps[i]); + else + static_branch_disable(&lru_gen_caps[i]); + } + + return len; +} + +static struct kobj_attribute lru_gen_enabled_attr = __ATTR( + enabled, 0644, show_enable, store_enable +); + +static struct attribute *lru_gen_attrs[] = { + &lru_gen_enabled_attr.attr, + NULL +}; + +static struct attribute_group lru_gen_attr_group = { + .name = "lru_gen", + .attrs = lru_gen_attrs, +}; + /****************************************************************************** * initialization ******************************************************************************/ @@ -5007,6 +5230,9 @@ static int __init init_lru_gen(void) BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS); BUILD_BUG_ON(sizeof(MM_STAT_CODES) != NR_MM_STATS + 1); + if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) + pr_err("lru_gen: failed to create sysfs group\n"); + return 0; }; late_initcall(init_lru_gen);