From patchwork Thu Dec 22 04:18:58 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13079360 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66C0BC4332F for ; Thu, 22 Dec 2022 04:19:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 60E6E900003; Wed, 21 Dec 2022 23:19:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5973D900002; Wed, 21 Dec 2022 23:19:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 410A1900003; Wed, 21 Dec 2022 23:19:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 2BB3C900002 for ; Wed, 21 Dec 2022 23:19:43 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E2E5A1A0E07 for ; Thu, 22 Dec 2022 04:19:42 +0000 (UTC) X-FDA: 80268638604.01.B1EB8A9 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf12.hostedemail.com (Postfix) with ESMTP id 623C84000A for ; Thu, 22 Dec 2022 04:19:41 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=LOF6Y7Ok; spf=pass (imf12.hostedemail.com: domain of 33NqjYwYKCEU516ohvnvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=33NqjYwYKCEU516ohvnvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671682781; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=hecVxc/xjfWlg6qdlZfet4PMI8t4pFn0RW9oF1Q4vnw=; b=Bo6G4Hi7Zwo/OX88edC9zJRmicxR6j0dWmCo8d8gtLV8EKfzbufo/XkJUtI0dQxkFTZE0i 0A6sLTJPg9tqlnQjzfgAwt1dawej8c3D8DmsCk+4LHRfcuzuDgikU/LGdxbtQgeGjZ/8kd GM8tzw8bNRMtEu2ixMxMg9GrprnYAlc= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=LOF6Y7Ok; spf=pass (imf12.hostedemail.com: domain of 33NqjYwYKCEU516ohvnvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=33NqjYwYKCEU516ohvnvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671682781; a=rsa-sha256; cv=none; b=zX5AAkAYJZ/06FKu/OC4F1euadmlgFTRq6ZgNr6nYE2JDf1nGm7/4a0+fJ9uGup5HQoq6j KkGqsORPRliFkwYijecSObmnAc/nMLTjBk0aG6FhHRf+jNo3kF6kt/ZNCVXs9I55Tkn/LO XXWoMxYERoNZuZtmABcPt+NvtP6HpFw= Received: by mail-yb1-f202.google.com with SMTP id a4-20020a5b0004000000b006fdc6aaec4fso664007ybp.20 for ; Wed, 21 Dec 2022 20:19:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:mime-version:message-id:date:from:to:cc:subject :date:message-id:reply-to; bh=hecVxc/xjfWlg6qdlZfet4PMI8t4pFn0RW9oF1Q4vnw=; b=LOF6Y7OkksAIbfYHvpQQDGcKYQUBjmyI/508Hfp6nP+cWB5jNwnfh9oFhUHmMWPuoV XS0/DzLTeFeV/DsuryyLc+UoVwX0gyHpW22w5tsVQiWjiwNAVv2r2ZJ5drGAronSs1u+ WZVkm8H/t7luoSN68OklY2SC+x89QDozr0OocgeZxIs8bWSnfolf2WpekTjupEtNXfwg aAjoEv2ugLZnGXLpZZe7xwhE5t61AhHBJI6PR9a2Bbf71sLiNI9YMOIi+UookiF9XdQD ImjOhCnEMnDREg68U9RWGXC824OmbHq64Qy13fxpnyCTmiygdijzhmt0+XJofDpR6vRL 515g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:mime-version:message-id:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=hecVxc/xjfWlg6qdlZfet4PMI8t4pFn0RW9oF1Q4vnw=; b=snXwK5hKqt0SMaIclGYGxQ2sxksyn5lngBRsEhAS5m/uz5AP7INdjcgUB03nC1f3e8 7gdZdtfkqSZ9Bo3THhxyEus/uZC2OCjJH31n2ohWQ5xw8PLuvPzipzEMDw7knOUmoQ8O fWX3XqMKP5g12qdGCvHwPVWhURBeUAmLCVVmre2UD/B9rlTV+nhOCuJhsNgRXmToDouI rscnugbxqkS89cmEXI0vfwDGA6uVkiKOs+itcdSCkxaGFiFwYgeSn6ITlsMsrjGXFq6/ rR9PDF/zm6vZG8k1+Iv2bdveM08RfUrOWmQUDpdr2NpCU/8y6xduZQW+AyYQMAlZX+vw LE6w== X-Gm-Message-State: AFqh2krcT7ppJllLtvmyAYjotUDnw82v6vwaF1G81fBYMAHby+ZkNhVi cArm7qcUXeu9z6yZowr5f3yArhfLLDE= X-Google-Smtp-Source: AMrXdXsg2Q4NFzTySWUaKWP/6xYZg2sihP1TSF3s2QYFc2nEU3vh1XKgFJZmLGmcZIg3t3frfbJ9Y9SyVp0= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:a463:5f7b:440e:5c77]) (user=yuzhao job=sendgmr) by 2002:a25:550a:0:b0:708:522d:cd52 with SMTP id j10-20020a25550a000000b00708522dcd52mr449573ybb.312.1671682780501; Wed, 21 Dec 2022 20:19:40 -0800 (PST) Date: Wed, 21 Dec 2022 21:18:58 -0700 Message-Id: <20221222041905.2431096-1-yuzhao@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Subject: [PATCH mm-unstable v3 0/8] mm: multi-gen LRU: memcg LRU From: Yu Zhao To: Andrew Morton Cc: Johannes Weiner , Jonathan Corbet , Michael Larabel , Michal Hocko , Mike Rapoport , Roman Gushchin , Suren Baghdasaryan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-mm@google.com, Yu Zhao X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 623C84000A X-Rspam-User: X-Stat-Signature: xr7oqk3dwx5f3xtmny1xekn4y9rmw3qo X-HE-Tag: 1671682781-520817 X-HE-Meta: U2FsdGVkX1+7ux8+fLoAouJbGQ+660nqIU8P4YdECivM0ie/CDzankaZC7XjFKbJ225qadTqg5J7lkoCX5pyKIZ6mq6YvEOI0TMrFD1DU8mp/IB0Q+bzHTNoqgKoSDftJEnf2dxeJ+QbNMyLPW6Sv4ToNpirySv1y/SoWv4BmOmF2uWTCbHqw/RF77azqyUNmEZ5YlSJtt8NyD8LAZR+CebAQg1irAIV9GKKATiXXvHny6PQbMIQXyR9xwkdLMR85WW5fEnSkT9h38EkfrrIXiVms8yxxx2bmOf4e/edzmN66A9xEOZ46vLmwJO+npC8RcJI+MCiZy69jY9hsbQUus5mulgnA/UT/QGjSj6p4S51vJo1iZyIu3MsqcKStdndIi4HHCJCpCV3VQHQNL79RD5So3DDZThBr3FoCauHceku0PvDTTvgdzM6Y+oJK/HHjnRaezvw7y0jUw1w88eoXP+TxcEBjZj+/9x/OP1cw809NrfeoCo9qsmjBvm5NG9LmW/xsv0YKW4S3/mygkwK/0Wobbt6tdIgpO8pX1ETZoMKJhILW9yIoA/+YDANa6txt2Uh8XIGm+7zw5NtkLhzHtdxyw7iag+9y/sur6XWe7hwrzY4KcHiq7E2BY9gcTbHkYEwN7KAgm/xZAuCVEs3jWzQCOFPtLhvEErHzwa8IE6kT2ulbDLkf7OK1dVVDFJODYkTIFba3KKfwfwD0OgXZkMd74AMhzBzpV3zicKDYpYjkZGk8mFGXRZEpab0GTtfo9eHnL8ONS6oJdMXZ2gRwFDa6nStKPKMwjAto8n83PYZm951h5OVbLovCFQPOKnRfoDuo6q/SnU9IEnanOCwedvqNdP99AdO4dHxbimgwq1vi4gPvLjAz60yafRgkuDg5fjFC1ceE2m4hc3OwU+JdSYemjP4P0i1j1fre9Lt27kM0oe5wWC+/GxnSHbSVYwqJhcV+EXJwu4XAESQjRk jQMVTZuJ NrgAqNWLHPfqrOXDgHN6TdFi9Qqc9T66y9e2u9tJlwpsPPNmHUyw+COUJYY4rVgK1jtteJnFqfv7CeXc334rDEupsaYUZD7dF4/vDipIx1I58uvEVW3eYru21NOlQr1kDG8AIdEO+hFuWc5M1lTWwEfdSR2vMAaxtYB4VJdXDhQL2o36NJqz3pPMmaIcnwAwIQ0J6teYvKAV2CmW8Z4GLdMlc71QUBBrvZNVb7TCJnvBy3NaIhMtJAa36C/bwZm7i+802RViORM6ihiLofYsiB/VcV2JxlDJWhU71ZOxTRJ3wl7w+7iu8qmDtZFNPHJ76xPydrUgyvcjjx3/nvhDQDIN6D01ANft8XAJ01ieUCc2eKjdt90MWZGl3YA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: What's new ========== 1. Rebased to the latest mm-unstable and resolved the conflict with commit 8032bf1233a7 ("treewide: use get_random_u32_below() instead of deprecated function"). 2. Added two comprehensive benchmarks: https://lore.kernel.org/r/20221220214923.1229538-1-yuzhao@google.com/ https://lore.kernel.org/r/20221221000748.1374772-1-yuzhao@google.com/ Overview ======== An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see mem_cgroup_lruvec()). Its goal is to improve the scalability of global reclaim, which is critical to system-wide memory overcommit in data centers. Note that memcg reclaim is currently out of scope. Its memory bloat is a pointer to each lruvec and negligible to each pglist_data. In terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity in contrast to the current linear complexity. The basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios): 1. It has the young and the old (generations), i.e., the counterparts to the active and the inactive; 2. The increment of max_seq triggers promotion, i.e., the counterpart to activation; 3. Other events trigger similar operations, e.g., offlining an memcg triggers demotion, i.e., the counterpart to deactivation. In terms of global reclaim, it has two distinct features: 1. Sharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism; 2. Eventual fairness, which allows direct reclaim to bail out at will and reduces latency without affecting fairness over some time. The commit message in patch 6 details the workflow: https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/ The following is a simple test to quickly verify its effectiveness. Test design: 1. Create multiple memcgs. 2. Each memcg contains a job (fio). 3. All jobs access the same amount of memory randomly. 4. The system does not experience global memory pressure. 5. Periodically write to the root memory.reclaim. Desired outcome: 1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal) over mean(pgsteal) is close to 0%. 2. The total pgsteal is close to the total requested through memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close to 100%. Actual outcome [1]: MGLRU off MGLRU on stddev(pgsteal) / mean(pgsteal) 75% 20% sum(pgsteal) / sum(requested) 425% 95% #################################################################### MEMCGS=128 for ((memcg = 0; memcg < $MEMCGS; memcg++)); do mkdir /sys/fs/cgroup/memcg$memcg done start() { echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \ --filename=/dev/zero --size=1920M --rw=randrw \ --rate=64m,64m --random_distribution=random \ --fadvise_hint=0 --time_based --runtime=10h \ --group_reporting --minimal } for ((memcg = 0; memcg < $MEMCGS; memcg++)); do start & done sleep 600 for ((i = 0; i < 600; i++)); do echo 256m >/sys/fs/cgroup/memory.reclaim sleep 6 done for ((memcg = 0; memcg < $MEMCGS; memcg++)); do grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat done #################################################################### [1]: This was obtained from running the above script (touches less than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an hour. Yu Zhao (8): mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[] mm: multi-gen LRU: remove eviction fairness safeguard mm: multi-gen LRU: remove aging fairness safeguard mm: multi-gen LRU: shuffle should_run_aging() mm: multi-gen LRU: per-node lru_gen_folio lists mm: multi-gen LRU: clarify scan_control flags mm: multi-gen LRU: simplify arch_has_hw_pte_young() check Documentation/mm/multigen_lru.rst | 8 +- include/linux/memcontrol.h | 10 + include/linux/mm_inline.h | 25 +- include/linux/mmzone.h | 131 ++++- mm/memcontrol.c | 16 + mm/page_alloc.c | 1 + mm/vmscan.c | 769 ++++++++++++++++++++---------- mm/workingset.c | 4 +- 8 files changed, 693 insertions(+), 271 deletions(-)