From patchwork Wed Dec 21 00:12:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13078273 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4DF0C4332F for ; Wed, 21 Dec 2022 00:12:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ED15E8E0005; Tue, 20 Dec 2022 19:12:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E801F8E0001; Tue, 20 Dec 2022 19:12:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D49178E0005; Tue, 20 Dec 2022 19:12:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C57678E0001 for ; Tue, 20 Dec 2022 19:12:37 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9B5ECA9829 for ; Wed, 21 Dec 2022 00:12:37 +0000 (UTC) X-FDA: 80264387154.10.4BEECA3 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) by imf25.hostedemail.com (Postfix) with ESMTP id 111EAA0007 for ; Wed, 21 Dec 2022 00:12:35 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=mV30EQNX; spf=pass (imf25.hostedemail.com: domain of 3c0-iYwYKCL42y3lesksskpi.gsqpmry1-qqozego.svk@flex--yuzhao.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3c0-iYwYKCL42y3lesksskpi.gsqpmry1-qqozego.svk@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671581556; a=rsa-sha256; cv=none; b=EoJwx57BUqZCdPL9ek8528F6tPyNf8MbchmMXMOTSex2CJfbRnEF28HO5wsr49o0ect23V xmR+GEwgYoUV2NRjoo7hUytN79Q8CJVYO8q9LQit2EkZVxAyuR4+XN5DfZh7TuuKRxmvJb iKbTjAX90c4Ai6wVMOB/OXoB/9YLQ1k= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=mV30EQNX; spf=pass (imf25.hostedemail.com: domain of 3c0-iYwYKCL42y3lesksskpi.gsqpmry1-qqozego.svk@flex--yuzhao.bounces.google.com designates 209.85.128.202 as permitted sender) smtp.mailfrom=3c0-iYwYKCL42y3lesksskpi.gsqpmry1-qqozego.svk@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671581556; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=82GCjizgzGPZFVVPgJxtpPnUGDdhN+TgWsbKGRzfn8M=; b=coISV2HMoy8h0i1PJhqMflvr9yPReZQZIfnh6whlqlIDxEmeKjzOO3nM4wHEYKcM2LwGlA P0FM3jMGojc8m14esMsoNOHla9pRX3t38mEeTH7w5t7fUyCm24DT+6R0Z5SVJpzlrPNO85 XKui439RcKHfNfuramzdfC3MVLRcpX8= Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-352e29ff8c2so160865447b3.21 for ; Tue, 20 Dec 2022 16:12:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:mime-version:message-id:date:from:to:cc:subject :date:message-id:reply-to; bh=82GCjizgzGPZFVVPgJxtpPnUGDdhN+TgWsbKGRzfn8M=; b=mV30EQNXvQzsGbVj76mD5FyucVvmSnapn52DJPiSc2YR/0t6rqMECGo4LQP6x5vPD2 vbgVZqEUs3x1wwZl1jIes0zuCsoEne/qyY09FSqrgZna2Uk7gkmRs1qOtIFqbTpE8Vn+ z8IQLdyq0aMmC4hhbN6K6HOFUXoe1zmHW/NgOAP1/uFMm2KJVDDoKgPJICiysfuW1AQv 4/pkEmFNbilcA9edp6bvGr9W+1pkrUWLL5HnMRv52QwTdDHiubPHbq0xAF2ATJ4f4gb8 OcknjMsfQShkE/L9kB7w+QhnF1YtQMrixmcZ68SyldIjz5pSqne7zSQVlkAgLD7HrOhD MJHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:mime-version:message-id:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=82GCjizgzGPZFVVPgJxtpPnUGDdhN+TgWsbKGRzfn8M=; b=A8So91MncGQG3Xv6oOUyz3ARqo+yP0kK7tuhuwDrZA5lObWjA80q2EKc+LwliUB89h Zi13yR/WL9bUZRhaxAaeLfegikpOK5Ax7P6eyFfrFOL7e51o0xXBX62cvTM9Z3p24bGe lNv1yPv5BzUyB+8B7GTmoQPWK+9YtMT8p6m5g0/uQnX/GdHHCpJkTEnqtwZ1cS/uIYUk hrS24VHbaREP3ZLNBsiyQ/g2rBFavV5/8Ac+/ZMU910es8JcO02LPxfXpUhQ11J4CUqb NuPGXkU1PxGw5/eZNNytfvHTQBK3jSN41kONP2Z7QbGGUfaJqNHGGYwPnVPULBMpUn1r B3FA== X-Gm-Message-State: ANoB5pkJhPrrbwZ5jmVM96P6yzyrtZIcQoJtABOqN/eDtHB9dIEujJHg 50O11Of8nq7zkqkuDll9/hR3185eCLU= X-Google-Smtp-Source: AA0mqf6TmFHAnsnDGLb1qnXgVaQH7eoUU56Ey7uX/DzIhOL98ABrVuOpiSxXp4a/O8+lq7MZ+VAs0uR7hYY= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:a589:30f2:2daa:4ab7]) (user=yuzhao job=sendgmr) by 2002:a0d:ef83:0:b0:3f6:4500:813d with SMTP id y125-20020a0def83000000b003f64500813dmr15428723ywe.509.1671581555214; Tue, 20 Dec 2022 16:12:35 -0800 (PST) Date: Tue, 20 Dec 2022 17:12:00 -0700 Message-Id: <20221221001207.1376119-1-yuzhao@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Subject: [PATCH mm-unstable v2 0/8] mm: multi-gen LRU: memcg LRU From: Yu Zhao To: Andrew Morton Cc: Johannes Weiner , Jonathan Corbet , Michael Larabel , Michal Hocko , Mike Rapoport , Roman Gushchin , Suren Baghdasaryan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-mm@google.com, Yu Zhao X-Rspam-User: X-Rspamd-Queue-Id: 111EAA0007 X-Rspamd-Server: rspam01 X-Stat-Signature: fq3s659bue4a5q5xkdrf7a7dxa3z1hih X-HE-Tag: 1671581555-109303 X-HE-Meta: U2FsdGVkX1+9aLUb7OqQAFJTKnjpAvS/a5wCnstXstk1EKQ7yaEJSyv99V8G7dScPdeuvW6lFLpBGqMmTUMnZgewN6N6eyOmy4dlwRxb/Kjs8QchuYJ5Z+7PkENeaN2hT6nmkrLkxrRUoDMORU2HoCH6WdUERbD9mNV5+iavqUWtF4dkzkOW7/gC9rXLvDOeEtqb5oD0LtIke7xvcdmmRjwmdBoxTM579ikGpqZ2vRCeUk0MyjsaXhJP7oebwgEOmn336hVY6W0rL/WFdBMJrzfJSg4LNuLyCDM52w/zHBrGp9rR4sF4vEbpAAkWV7EOYtf5YhauA9HkD+LgkW7YciUFFE0+vgfxnO3lFKf+pTNCt31KHZn9Z8r3ffjCj9lInGxSh6DvZkTJNdiV0dnXHciOCnAL9slmGGrxl+2cCbrLsT7y4CCeSpla+tSxvCqlYYxgonWiSNe8XyX0WT6YTSgY9a6pkKarBdscfVHBApCUzOdmBrolxnWLZf+gCjUyIwFRGLox84c3oH26vhtULCL786+PyJDFIcrfj9YB5o7fN9n5SWNzEoe6qYg0Z8a0OO00dYW/Iah+I7PJWmKUCqIllBLYsdaK2pKamW7SNtyUTYKHfsVqtdUKc9q9LpxWTKutfOurKCR/haLw73evXym2xOjhhwktVXGXtvB64KbdTcWBpwsUXNIxHEAqpvUvYOrE5vPPgVQOdP/MX0/0hEbNLhddg4nfAk+CeFBIm4jpUGh7FqhwnSL2MV7e+wLnCJFSKdMUt+c+khV7NMKzzVfV3kcgeUqOJ6TqyqEIY/JcusHsWS4mNiZz1ZU8YvlgjOWfSMDVNXJqwwJo0vhrsSay+kUv+3vlkzTI/zKorVgwyYiQFYt4WN7ubypcUGbjtcP5EgDDEpOnhv0EQB+Y3zf1Ttf+MP8UgNknZ+ElerpxGRal18g8eJDSGTlYXZy/70mlc9O9bFsOM0d5jPq jaPV4woJ 9sJJ/FCTlQtSyaTqX2jLb00H+nuUWY995LVepnQFl3Yp2/O4rJ1iLoFqPS7/Up5Cd9z5IRnDW4ELN+Vbz/BRa1azAldPXlE0Y3GABDY9H6KfxQUH/Ul4qxzpaUFrUnhet9NrPgwIwKPO7HlTiPAnPy4M5BT3pWsk4GgmULb3d4usgi2jvmCuHQDUUsmLt/ZSZ1u4+kSba41g1wPRN3BQDXqZZ+KM5MMgr0cW2MmXiQClFJxzjHhbUFR7iqA/7A8EV/bX3cYqUrnr3a4sWzQHWYrhWXhyIcQfodATfu74YYJnQuHavHYN8ORkCG5N/gIx8LS0E/wqnpQpPJunOpKr6CvC3FgLG/Y94+0e/biyWLDOj5bsF2NKh/uONzQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: What's new ========== 1. Rebased to the latest mm-unstable. 2. Added two comprehensive benchmarks: https://lore.kernel.org/r/20221220214923.1229538-1-yuzhao@google.com/ https://lore.kernel.org/r/20221221000748.1374772-1-yuzhao@google.com/ Overview ======== An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see mem_cgroup_lruvec()). Its goal is to improve the scalability of global reclaim, which is critical to system-wide memory overcommit in data centers. Note that memcg reclaim is currently out of scope. Its memory bloat is a pointer to each lruvec and negligible to each pglist_data. In terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity in contrast to the current linear complexity. The basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios): 1. It has the young and the old (generations), the counterparts to the active and the inactive; 2. The increment of max_seq triggers promotion, the counterpart to activation; 3. Other events, e.g., offlining an memcg, triggers similar operations, e.g., demotion, the counterpart to deactivation. In terms of global reclaim, it has two distinct features: 1. Sharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism; 2. Eventual fairness, which allows direct reclaim to bail out and reduces latency without affecting fairness over some time. The commit message in patch 6 details the workflow: https://lore.kernel.org/r/20221221001207.1376119-7-yuzhao@google.com/ The following is a simple test to quickly verify its effectiveness. More benchmarks are coming soon. Test design: 1. Create multiple memcgs. 2. Each memcg contains a job (fio). 3. All jobs access the same amount of memory randomly. 4. The system does not experience global memory pressure. 5. Periodically write to the root memory.reclaim. Desired outcome: 1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)/mean(pgsteal) is close to 0%. 2. The total pgsteal is close to the total requested through memory.reclaim, i.e., sum(pgsteal)/sum(requested) is close to 100%. Actual outcome [1]: MGLRU off MGLRU on stddev(pgsteal) / mean(pgsteal) 75% 20% sum(pgsteal) / sum(requested) 425% 95% #################################################################### MEMCGS=128 for ((memcg = 0; memcg < $MEMCGS; memcg++)); do mkdir /sys/fs/cgroup/memcg$memcg done start() { echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \ --filename=/dev/zero --size=1920M --rw=randrw \ --rate=64m,64m --random_distribution=random \ --fadvise_hint=0 --time_based --runtime=10h \ --group_reporting --minimal } for ((memcg = 0; memcg < $MEMCGS; memcg++)); do start & done sleep 600 for ((i = 0; i < 600; i++)); do echo 256m >/sys/fs/cgroup/memory.reclaim sleep 6 done for ((memcg = 0; memcg < $MEMCGS; memcg++)); do grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat done #################################################################### [1]: This was obtained from running the above script (touches less than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an hour. Yu Zhao (8): mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[] mm: multi-gen LRU: remove eviction fairness safeguard mm: multi-gen LRU: remove aging fairness safeguard mm: multi-gen LRU: shuffle should_run_aging() mm: multi-gen LRU: per-node lru_gen_folio lists mm: multi-gen LRU: clarify scan_control flags mm: multi-gen LRU: simplify arch_has_hw_pte_young() check Documentation/mm/multigen_lru.rst | 8 +- include/linux/memcontrol.h | 10 + include/linux/mm_inline.h | 25 +- include/linux/mmzone.h | 131 ++++- mm/memcontrol.c | 16 + mm/page_alloc.c | 1 + mm/vmscan.c | 768 ++++++++++++++++++++---------- mm/workingset.c | 4 +- 8 files changed, 692 insertions(+), 271 deletions(-)