[mm-unstable,v2,0/8] mm: multi-gen LRU: memcg LRU

Message ID	20221221001207.1376119-1-yuzhao@google.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Tue, 20 Dec 2022 17:12:00 -0700 Message-Id: <20221221001207.1376119-1-yuzhao@google.com> Mime-Version: 1.0 Subject: [PATCH mm-unstable v2 0/8] mm: multi-gen LRU: memcg LRU From: Yu Zhao <yuzhao@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>, Michael Larabel <michael@michaellarabel.com>, Michal Hocko <mhocko@kernel.org>, Mike Rapoport <rppt@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Suren Baghdasaryan <surenb@google.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-mm@google.com, Yu Zhao <yuzhao@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: multi-gen LRU: memcg LRU \| expand [mm-unstable,v2,0/8] mm: multi-gen LRU: memcg LRU [mm-unstable,v2,1/8] mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio [mm-unstable,v2,2/8] mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[] [mm-unstable,v2,3/8] mm: multi-gen LRU: remove eviction fairness safeguard [mm-unstable,v2,4/8] mm: multi-gen LRU: remove aging fairness safeguard [mm-unstable,v2,5/8] mm: multi-gen LRU: shuffle should_run_aging() [mm-unstable,v2,6/8] mm: multi-gen LRU: per-node lru_gen_folio lists [mm-unstable,v2,7/8] mm: multi-gen LRU: clarify scan_control flags [mm-unstable,v2,8/8] mm: multi-gen LRU: simplify arch_has_hw_pte_young() check

Message ID

20221221001207.1376119-1-yuzhao@google.com (mailing list archive)

Headers

Date: Tue, 20 Dec 2022 17:12:00 -0700
Message-Id: <20221221001207.1376119-1-yuzhao@google.com>
Mime-Version: 1.0
Subject: [PATCH mm-unstable v2 0/8] mm: multi-gen LRU: memcg LRU
From: Yu Zhao <yuzhao@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>,
	Michael Larabel <michael@michaellarabel.com>,
 Michal Hocko <mhocko@kernel.org>,
	Mike Rapoport <rppt@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>,
	Suren Baghdasaryan <surenb@google.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org,
	linux-mm@google.com, Yu Zhao <yuzhao@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: multi-gen LRU: memcg LRU | expand

Message

Yu Zhao Dec. 21, 2022, 12:12 a.m. UTC

What's new
==========
1. Rebased to the latest mm-unstable.
2. Added two comprehensive benchmarks:
   https://lore.kernel.org/r/20221220214923.1229538-1-yuzhao@google.com/
   https://lore.kernel.org/r/20221221000748.1374772-1-yuzhao@google.com/

Overview
========
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).

Its goal is to improve the scalability of global reclaim, which is
critical to system-wide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.

Its memory bloat is a pointer to each lruvec and negligible to each
pglist_data. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not
affect the worst-case complexity O(n). Therefore, on average, it has
a sublinear complexity in contrast to the current linear complexity.

The basic structure of an memcg LRU can be understood by an analogy
to the active/inactive LRU (of folios):
1. It has the young and the old (generations), the counterparts to
   the active and the inactive;
2. The increment of max_seq triggers promotion, the counterpart to
   activation;
3. Other events, e.g., offlining an memcg, triggers similar
   operations, e.g., demotion, the counterpart to deactivation.

In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
   the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out and
   reduces latency without affecting fairness over some time.

The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221221001207.1376119-7-yuzhao@google.com/

The following is a simple test to quickly verify its effectiveness.
More benchmarks are coming soon.

  Test design:
  1. Create multiple memcgs.
  2. Each memcg contains a job (fio).
  3. All jobs access the same amount of memory randomly.
  4. The system does not experience global memory pressure.
  5. Periodically write to the root memory.reclaim.

  Desired outcome:
  1. All memcgs have similar pgsteal counts, i.e.,
     stddev(pgsteal)/mean(pgsteal) is close to 0%.
  2. The total pgsteal is close to the total requested through
     memory.reclaim, i.e., sum(pgsteal)/sum(requested) is close to
     100%.

  Actual outcome [1]:
                                     MGLRU off    MGLRU on
  stddev(pgsteal) / mean(pgsteal)    75%          20%
  sum(pgsteal) / sum(requested)      425%         95%
  ####################################################################
  MEMCGS=128

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      mkdir /sys/fs/cgroup/memcg$memcg
  done

  start() {
      echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs

      fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
          --filename=/dev/zero --size=1920M --rw=randrw \
          --rate=64m,64m --random_distribution=random \
          --fadvise_hint=0 --time_based --runtime=10h \
          --group_reporting --minimal
  }

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      start &
  done

  sleep 600

  for ((i = 0; i < 600; i++)); do
      echo 256m >/sys/fs/cgroup/memory.reclaim
      sleep 6
  done

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
  done
  ####################################################################

[1]: This was obtained from running the above script (touches less
     than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
     hour.

Yu Zhao (8):
  mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
  mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
  mm: multi-gen LRU: remove eviction fairness safeguard
  mm: multi-gen LRU: remove aging fairness safeguard
  mm: multi-gen LRU: shuffle should_run_aging()
  mm: multi-gen LRU: per-node lru_gen_folio lists
  mm: multi-gen LRU: clarify scan_control flags
  mm: multi-gen LRU: simplify arch_has_hw_pte_young() check

 Documentation/mm/multigen_lru.rst |   8 +-
 include/linux/memcontrol.h        |  10 +
 include/linux/mm_inline.h         |  25 +-
 include/linux/mmzone.h            | 131 ++++-
 mm/memcontrol.c                   |  16 +
 mm/page_alloc.c                   |   1 +
 mm/vmscan.c                       | 768 ++++++++++++++++++++----------
 mm/workingset.c                   |   4 +-
 8 files changed, 692 insertions(+), 271 deletions(-)

Comments

Yu Zhao Dec. 22, 2022, 12:17 a.m. UTC | #1

On Tue, Dec 20, 2022 at 5:12 PM Yu Zhao <yuzhao@google.com> wrote:
>
> What's new
> ==========
> 1. Rebased to the latest mm-unstable.

Apparently today's mm-unstable doesn't have prandom_u32_max() anymore.
IOW, this series now has a conflict with commit 8032bf1233a7
("treewide: use get_random_u32_below() instead of deprecated
function").

Will post v3 to use get_random_u32_below() instead.