From patchwork Thu May 20 06:53:55 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12269249 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A77B2C433B4 for ; Thu, 20 May 2021 06:54:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5313E61186 for ; Thu, 20 May 2021 06:54:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5313E61186 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B7BE66B0082; Thu, 20 May 2021 02:54:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B55236B0083; Thu, 20 May 2021 02:54:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9A8796B0085; Thu, 20 May 2021 02:54:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0083.hostedemail.com [216.40.44.83]) by kanga.kvack.org (Postfix) with ESMTP id 67D726B0082 for ; Thu, 20 May 2021 02:54:24 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 0B008BEF5 for ; Thu, 20 May 2021 06:54:24 +0000 (UTC) X-FDA: 78160695648.08.BCEB181 Received: from mail-qk1-f201.google.com (mail-qk1-f201.google.com [209.85.222.201]) by imf19.hostedemail.com (Postfix) with ESMTP id 5D42F90009EA for ; Thu, 20 May 2021 06:54:22 +0000 (UTC) Received: by mail-qk1-f201.google.com with SMTP id i141-20020a379f930000b02902e94f6d938dso11715781qke.5 for ; Wed, 19 May 2021 23:54:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=tmnYUMpAe2KoFw1JK5DEOLa6QKjWz+/jEuUps2TjE0M=; b=DnoKJgXGcZrakGIsy2wdggTSzr8gNr5Cga30A6c8a6Hf9x2dffeKxupvvvPjuu1gFH aGdEv0BQdUdQtd0c3PTB1yYrqJsJcPp5S6L8/JeU1mBsAkTgRAJC+WwYC2oJaN+K/+rh m7SHkphIH6F6L72NTt2b96CmRop8AS7h70mGFoqBtxgJZEEG0JjTr93/mLmeGl1DrblN ViY8g/jh939e21AJjULOIlpeBbxplek6u+fXKVxsYdCV2JKDsA0LwaCxMlx08fCc/j9n pt2cBRltMZSTctDaJlkHWcEOuGP8bGJA/JzG0MeUfva0r9KcYGAVy5zcvXU4Mkz8AXA/ v3JQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=tmnYUMpAe2KoFw1JK5DEOLa6QKjWz+/jEuUps2TjE0M=; b=GJ3+KfFLDiyDWff8BI0XRZ6MxTNbc9lS8S26cOZfFu2x8N3FNChexR1fhK7Zz82ZIu DxAD/wys+lh/+XZzsIxZ3FtnK/Y0ZX1hrWYPi4q1QQlU5Cx/0QtkN+pBcHG/RsxzNi+Y Txg+U4GU2lnfQjEqbtldPkq+plqSjFWO9BGMOjZLVAZhvRF3D0EnsTrKla/2LIcR+YZw Uk38p08r4pd+9AdCACLhyxQaqJFkPLEWizT+51H6kdXMBTYi4tyAXbxzcnrON1xlvNtB efuo0wfUmvLpBCVH7jWJaTDOHOhG6ZoCybTBkIDTzaD4dnH6xRn/pKHiObTRRv6wLR20 ZPzg== X-Gm-Message-State: AOAM533OmQWDZ6z4hmSxbSUzfci1mZR6ZO0j3ePJ8MmGX7wA4DaLC++y t7qEsAj/SK+7VxLP2vlrpEzVmG49hQqx+pw3s+DudMdDLFID68uz2kZjJT+f75OTx/UGOZTGZwB /uSh3wL2Yn+4aZq1WAnwTTXLCCfehqfhsC+3/52xHgDshr40neTpSPoTs X-Google-Smtp-Source: ABdhPJyfR8302KuxyD/mIOKCO+jxW1RXoZnlJejF8SLfwvo9YuRoFSL43tZzQ7DdKcZXlLzVckFytBbp+9s= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:595d:62ee:f08:8e83]) (user=yuzhao job=sendgmr) by 2002:ad4:5a52:: with SMTP id ej18mr3968319qvb.31.1621493662894; Wed, 19 May 2021 23:54:22 -0700 (PDT) Date: Thu, 20 May 2021 00:53:55 -0600 In-Reply-To: <20210520065355.2736558-1-yuzhao@google.com> Message-Id: <20210520065355.2736558-15-yuzhao@google.com> Mime-Version: 1.0 References: <20210520065355.2736558-1-yuzhao@google.com> X-Mailer: git-send-email 2.31.1.751.gd2f1c929bd-goog Subject: [PATCH v3 14/14] mm: multigenerational lru: documentation From: Yu Zhao To: linux-mm@kvack.org Cc: Alex Shi , Andi Kleen , Andrew Morton , Dave Chinner , Dave Hansen , Donald Carr , Hillf Danton , Jens Axboe , Johannes Weiner , Jonathan Corbet , Joonsoo Kim , Konstantin Kharlamov , Marcus Seyfarth , Matthew Wilcox , Mel Gorman , Miaohe Lin , Michael Larabel , Michal Hocko , Michel Lespinasse , Rik van Riel , Roman Gushchin , Tim Chen , Vlastimil Babka , Yang Shi , Ying Huang , Zi Yan , linux-kernel@vger.kernel.org, lkp@lists.01.org, page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=DnoKJgXG; spf=pass (imf19.hostedemail.com: domain of 3ngemYAYKCFEHDI0t7z77z4x.v75416DG-553Etv3.7Az@flex--yuzhao.bounces.google.com designates 209.85.222.201 as permitted sender) smtp.mailfrom=3ngemYAYKCFEHDI0t7z77z4x.v75416DG-553Etv3.7Az@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 5D42F90009EA X-Stat-Signature: 4uf9wcw5mb5xsxyn4mxs873cx7goh4t4 X-HE-Tag: 1621493662-28836 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add Documentation/vm/multigen_lru.rst. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- Documentation/vm/index.rst | 1 + Documentation/vm/multigen_lru.rst | 143 ++++++++++++++++++++++++++++++ 2 files changed, 144 insertions(+) create mode 100644 Documentation/vm/multigen_lru.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index eff5fbd492d0..c353b3f55924 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -17,6 +17,7 @@ various features of the Linux memory management swap_numa zswap + multigen_lru Kernel developers MM documentation ================================== diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..a18416ed7e92 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,143 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Multigenerational LRU +===================== + +Quick Start +=========== +Build Options +------------- +:Required: Set ``CONFIG_LRU_GEN=y``. + +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by + default. + +:Optional: Change ``CONFIG_NR_LRU_GENS`` to a number ``X`` to support + a maximum of ``X`` generations. + +:Optional: Change ``CONFIG_TIERS_PER_GEN`` to a number ``Y`` to + support a maximum of ``Y`` tiers per generation. + +Runtime Options +--------------- +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the + feature was not turned on by default. + +:Optional: Change ``/sys/kernel/mm/lru_gen/spread`` to a number ``N`` + to spread pages out across ``N+1`` generations. ``N`` should be less + than ``X``. Larger values make the background aging more aggressive. + +:Optional: Read ``/sys/kernel/debug/lru_gen`` to verify the feature. + This file has the following output: + +:: + + memcg memcg_id memcg_path + node node_id + min_gen birth_time anon_size file_size + ... + max_gen birth_time anon_size file_size + +Given a memcg and a node, ``min_gen`` is the oldest generation +(number) and ``max_gen`` is the youngest. Birth time is in +milliseconds. The sizes of anon and file types are in pages. + +Recipes +------- +:Android on ARMv8.1+: ``X=4``, ``Y=3`` and ``N=0``. + +:Android on pre-ARMv8.1 CPUs: Not recommended due to the lack of + ``ARM64_HW_AFDBM``. + +:Laptops and workstations running Chrome on x86_64: Use the default + values. + +:Working set estimation: Write ``+ memcg_id node_id gen [swappiness]`` + to ``/sys/kernel/debug/lru_gen`` to account referenced pages to + generation ``max_gen`` and create the next generation ``max_gen+1``. + ``gen`` should be equal to ``max_gen``. A swap file and a non-zero + ``swappiness`` are required to scan anon type. If swapping is not + desired, set ``vm.swappiness`` to ``0``. + +:Proactive reclaim: Write ``- memcg_id node_id gen [swappiness] + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to evict + generations less than or equal to ``gen``. ``gen`` should be less + than ``max_gen-1`` as ``max_gen`` and ``max_gen-1`` are active + generations and therefore protected from the eviction. Use + ``nr_to_reclaim`` to limit the number of pages to evict. Multiple + command lines are supported, so does concatenation with delimiters + ``,`` and ``;``. + +Framework +========= +For each ``lruvec``, evictable pages are divided into multiple +generations. The youngest generation number is stored in ``max_seq`` +for both anon and file types as they are aged on an equal footing. The +oldest generation numbers are stored in ``min_seq[2]`` separately for +anon and file types as clean file pages can be evicted regardless of +swap and write-back constraints. These three variables are +monotonically increasing. Generation numbers are truncated into +``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into +``page->flags``. The sliding window technique is used to prevent +truncated generation numbers from overlapping. Each truncated +generation number is an index to an array of per-type and per-zone +lists. Evictable pages are added to the per-zone lists indexed by +``max_seq`` or ``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``), +depending on their types. + +Each generation is then divided into multiple tiers. Tiers represent +levels of usage from file descriptors only. Pages accessed N times via +file descriptors belong to tier order_base_2(N). Each generation +contains at most CONFIG_TIERS_PER_GEN tiers, and they require +additional CONFIG_TIERS_PER_GEN-2 bits in page->flags. In contrast to +moving across generations which requires the lru lock for the list +operations, moving across tiers only involves an atomic operation on +``page->flags`` and therefore has a negligible cost. A feedback loop +modeled after the PID controller monitors the refault rates across all +tiers and decides when to activate pages from which tiers in the +reclaim path. + +The framework comprises two conceptually independent components: the +aging and the eviction, which can be invoked separately from user +space for the purpose of working set estimation and proactive reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, the aging +scans page tables for referenced pages of this ``lruvec``. Upon +finding one, the aging updates its generation number to ``max_seq``. +After each round of scan, the aging increments ``max_seq``. + +The aging maintains either a system-wide ``mm_struct`` list or +per-memcg ``mm_struct`` lists, and it only scans page tables of +processes that have been scheduled since the last scan. + +The aging is due when both of ``min_seq[2]`` reaches ``max_seq-1``, +assuming both anon and file types are reclaimable. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, the +eviction scans the pages on the per-zone lists indexed by either of +``min_seq[2]``. It first tries to select a type based on the values of +``min_seq[2]``. When anon and file types are both available from the +same generation, it selects the one that has a lower refault rate. + +During a scan, the eviction sorts pages according to their new +generation numbers, if the aging has found them referenced. It also +moves pages from the tiers that have higher refault rates than tier 0 +to the next generation. + +When it finds all the per-zone lists of a selected type are empty, the +eviction increments ``min_seq[2]`` indexed by this selected type. + +To-do List +========== +KVM Optimization +---------------- +Support shadow page table scanning. + +NUMA Optimization +----------------- +Optimize page table scan for NUMA.