mbox series

[v1,00/14] Multigenerational LRU

Message ID 20210313075747.3781593-1-yuzhao@google.com (mailing list archive)
Headers show
Series Multigenerational LRU | expand

Message

Yu Zhao March 13, 2021, 7:57 a.m. UTC
TLDR
====
The current page reclaim is too expensive in terms of CPU usage and
often making poor choices about what to evict. We would like to offer
a performant, versatile and straightforward augment.

Repo
====
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1

Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101

Background
==========
DRAM is a major factor in total cost of ownership, and improving
memory overcommit brings a high return on investment. Over the past
decade of research and experimentation in memory overcommit, we
observed a distinct trend across millions of servers and clients: the
size of page cache has been decreasing because of the growing
popularity of cloud storage. Nowadays anon pages account for more than
90% of our memory consumption and page cache contains mostly
executable pages.

Problems
========
Notion of the active/inactive
-----------------------------
For servers equipped with hundreds of gigabytes of memory, the
granularity of the active/inactive is too coarse to be useful for job
scheduling. And false active/inactive rates are relatively high. In
addition, scans of largely varying numbers of pages are unpredictable
because inactive_is_low() is based on magic numbers.

For phones and laptops, the eviction is biased toward file pages
because the selection has to resort to heuristics as direct
comparisons between anon and file types are infeasible. On Android and
Chrome OS, executable pages are frequently evicted despite the fact
that there are many less recently used anon pages. This causes "janks"
(slow UI rendering) and negatively impacts user experience.

For systems with multiple nodes and/or memcgs, it is impossible to
compare lruvecs based on the notion of the active/inactive.

Incremental scans via the rmap
------------------------------
Each incremental scan picks up at where the last scan left off and
stops after it has found a handful of unreferenced pages. For most of
the systems running cloud workloads, incremental scans lose the
advantage under sustained memory pressure due to high ratios of the
number of scanned pages to the number of reclaimed pages. In our case,
the average ratio of pgscan to pgsteal is about 7.

On top of that, the rmap has poor memory locality due to its complex
data structures. The combined effects typically result in a high
amount of CPU usage in the reclaim path. For example, with zram, a
typical kswapd profile on v5.11 looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

And with real swap, it looks like:
  45.16%  page_vma_mapped_walk
   7.61%  do_raw_spin_lock
   5.69%  vma_interval_tree_iter_next
   4.91%  vma_interval_tree_subtree_search
   3.71%  page_referenced_one

Solutions
=========
Notion of generation numbers
----------------------------
The notion of generation numbers introduces a quantitative approach to
memory overcommit. A larger number of pages can be spread out across
configurable generations, and thus they have relatively low false
active/inactive rates. Each generation includes all pages that have
been referenced since the last generation.

Given an lruvec, scans and the selections between anon and file types
are all based on generation numbers, which are simple and yet
effective. For different lruvecs, comparisons are still possible based
on birth times of generations.

Differential scans via page tables
----------------------------------
Each differential scan discovers all pages that have been referenced
since the last scan. Specifically, it walks the mm_struct list
associated with an lruvec to scan page tables of processes that have
been scheduled since the last scan. The cost of each differential scan
is roughly proportional to the number of referenced pages it
discovers. Unless address spaces are extremely sparse, page tables
usually have better memory locality than the rmap. The end result is
generally a significant reduction in CPU usage, for most of the
systems running cloud workloads.

On Chrome OS, our real-world benchmark that browses popular websites
in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
(full) less PSI on v5.11. And kswapd profile looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced 7%. These metrics are
important to phones and laptops as they are correlated to user
experience.

Workflow
========
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lruvec->evictable.max_seq
for both anon and file types as they are aged on an equal footing. The
oldest generation numbers are stored in lruvec->evictable.min_seq[2]
separately for anon and file types as clean file pages can be evicted
regardless of may_swap or may_writepage. Generation numbers are
truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
page->flags. The sliding window technique is used to prevent truncated
generation numbers from overlapping. Each truncated generation number
is an index to
lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
Evictable pages are added to the per-zone lists indexed by max_seq or
min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
faulted in or read ahead. The workflow comprises two conceptually
independent functions: the aging and the eviction.

Aging
-----
The aging produces young generations. Given an lruvec, the aging scans
page tables for referenced pages of this lruvec. Upon finding one, the
aging updates its generation number to max_seq. After each round of
scan, the aging increments max_seq. The aging maintains either a
system-wide mm_struct list or per-memcg mm_struct lists and tracks
whether an mm_struct is being used on any CPUs or has been used since
the last scan. Multiple threads can concurrently work on the same
mm_struct list, and each of them will be given a different mm_struct
belonging to a process that has been scheduled since the last scan.

Eviction
--------
The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It selects a type according to the values of min_seq[2] and
swappiness. During a scan, the eviction either sorts or isolates a
page, depending on whether the aging has updated its generation
number. When it finds all the per-zone lists are empty, the eviction
increments min_seq[2] indexed by this selected type. The eviction
triggers the aging when both of min_seq[2] reaches max_seq-1, assuming
both anon and file types are reclaimable.

Use cases
=========
On Android, our most advanced simulation that generates memory
pressure from realistic user behavior shows 18% fewer low-memory
kills, which in turn reduces cold starts by 16%.

On Borg, a similar approach enables us to identify jobs that
underutilize their memory and downsize them considerably without
compromising any of our service level indicators.

On Chrome OS, our field telemetry reports 96% fewer low-memory tab
discards and 59% fewer OOM kills from fully-utilized devices and no UX
regressions from underutilized devices.

For other use cases include working set estimation, proactive reclaim,
far memory tiering and NUMA-aware job scheduling, please refer to the
documentation included in this series and the following references.

References
==========
1. Long-term SLOs for reclaimed cloud computing resources
   https://research.google/pubs/pub43017/
2. Profiling a warehouse-scale computer
   https://research.google/pubs/pub44271/
3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
   https://research.google/pubs/pub48329/
4. Software-defined far memory in warehouse-scale computers
   https://research.google/pubs/pub48551/
5. Borg: the Next Generation
   https://research.google/pubs/pub49065/

Yu Zhao (14):
  include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
    !CONFIG_MEMCG
  include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
  include/linux/huge_mm.h: define is_huge_zero_pmd() if
    !CONFIG_TRANSPARENT_HUGEPAGE
  include/linux/cgroup.h: export cgroup_mutex
  mm/swap.c: export activate_page()
  mm, x86: support the access bit on non-leaf PMD entries
  mm/pagewalk.c: add pud_entry_post() for post-order traversals
  mm/vmscan.c: refactor shrink_node()
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: core
  mm: multigenerational lru: page activation
  mm: multigenerational lru: user space interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst        |    1 +
 Documentation/vm/multigen_lru.rst |  210 +++
 arch/Kconfig                      |    8 +
 arch/x86/Kconfig                  |    1 +
 arch/x86/include/asm/pgtable.h    |    2 +-
 arch/x86/mm/pgtable.c             |    5 +-
 fs/exec.c                         |    2 +
 fs/proc/task_mmu.c                |    3 +-
 include/linux/cgroup.h            |   15 +-
 include/linux/huge_mm.h           |    5 +
 include/linux/memcontrol.h        |    5 +-
 include/linux/mm.h                |    1 +
 include/linux/mm_inline.h         |  246 ++++
 include/linux/mm_types.h          |  135 ++
 include/linux/mmzone.h            |   62 +-
 include/linux/nodemask.h          |    1 +
 include/linux/page-flags-layout.h |   20 +-
 include/linux/pagewalk.h          |    4 +
 include/linux/pgtable.h           |    4 +-
 include/linux/swap.h              |    5 +-
 kernel/events/uprobes.c           |    2 +-
 kernel/exit.c                     |    1 +
 kernel/fork.c                     |   10 +
 kernel/kthread.c                  |    1 +
 kernel/sched/core.c               |    2 +
 mm/Kconfig                        |   29 +
 mm/huge_memory.c                  |    5 +-
 mm/khugepaged.c                   |    2 +-
 mm/memcontrol.c                   |   28 +
 mm/memory.c                       |   14 +-
 mm/migrate.c                      |    2 +-
 mm/mm_init.c                      |   13 +-
 mm/mmzone.c                       |    2 +
 mm/pagewalk.c                     |    5 +
 mm/rmap.c                         |    6 +
 mm/swap.c                         |   58 +-
 mm/swapfile.c                     |    6 +-
 mm/userfaultfd.c                  |    2 +-
 mm/vmscan.c                       | 2091 +++++++++++++++++++++++++++--
 39 files changed, 2870 insertions(+), 144 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

Comments

Zi Yan March 14, 2021, 10:48 p.m. UTC | #1
On 13 Mar 2021, at 2:57, Yu Zhao wrote:

> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> a performant, versatile and straightforward augment.
>
> Repo
> ====
> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1
>
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
>
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
>
> Problems
> ========
> Notion of the active/inactive
> -----------------------------
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. And false active/inactive rates are relatively high. In
> addition, scans of largely varying numbers of pages are unpredictable
> because inactive_is_low() is based on magic numbers.
>
> For phones and laptops, the eviction is biased toward file pages
> because the selection has to resort to heuristics as direct
> comparisons between anon and file types are infeasible. On Android and
> Chrome OS, executable pages are frequently evicted despite the fact
> that there are many less recently used anon pages. This causes "janks"
> (slow UI rendering) and negatively impacts user experience.
>
> For systems with multiple nodes and/or memcgs, it is impossible to
> compare lruvecs based on the notion of the active/inactive.
>
> Incremental scans via the rmap
> ------------------------------
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For most of
> the systems running cloud workloads, incremental scans lose the
> advantage under sustained memory pressure due to high ratios of the
> number of scanned pages to the number of reclaimed pages. In our case,
> the average ratio of pgscan to pgsteal is about 7.
>
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>    4.63%  do_raw_spin_lock
>    3.89%  vma_interval_tree_iter_next
>    3.33%  vma_interval_tree_subtree_search
>
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>    7.61%  do_raw_spin_lock
>    5.69%  vma_interval_tree_iter_next
>    4.91%  vma_interval_tree_subtree_search
>    3.71%  page_referenced_one
>
> Solutions
> =========
> Notion of generation numbers
> ----------------------------
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> configurable generations, and thus they have relatively low false
> active/inactive rates. Each generation includes all pages that have
> been referenced since the last generation.
>
> Given an lruvec, scans and the selections between anon and file types
> are all based on generation numbers, which are simple and yet
> effective. For different lruvecs, comparisons are still possible based
> on birth times of generations.
>
> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for most of the
> systems running cloud workloads.
>
> On Chrome OS, our real-world benchmark that browses popular websites
> in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
> (full) less PSI on v5.11. And kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw

Is this profile from a system with this patchset applied or not?
Do you mind sharing some profiling data with before and after applying
the patchset? So it would be easier to see the improvement brought by
this patchset.

>
> In addition, direct reclaim latency is reduced by 22% at 99th
> percentile and the number of refaults is reduced 7%. These metrics are
> important to phones and laptops as they are correlated to user
> experience.
>
> Workflow
> ========
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lruvec->evictable.max_seq
> for both anon and file types as they are aged on an equal footing. The
> oldest generation numbers are stored in lruvec->evictable.min_seq[2]
> separately for anon and file types as clean file pages can be evicted
> regardless of may_swap or may_writepage. Generation numbers are
> truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
> page->flags. The sliding window technique is used to prevent truncated
> generation numbers from overlapping. Each truncated generation number
> is an index to
> lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> Evictable pages are added to the per-zone lists indexed by max_seq or
> min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> faulted in or read ahead. The workflow comprises two conceptually
> independent functions: the aging and the eviction.
>
> Aging
> -----
> The aging produces young generations. Given an lruvec, the aging scans
> page tables for referenced pages of this lruvec. Upon finding one, the
> aging updates its generation number to max_seq. After each round of
> scan, the aging increments max_seq. The aging maintains either a
> system-wide mm_struct list or per-memcg mm_struct lists and tracks
> whether an mm_struct is being used on any CPUs or has been used since
> the last scan. Multiple threads can concurrently work on the same
> mm_struct list, and each of them will be given a different mm_struct
> belonging to a process that has been scheduled since the last scan.
>
> Eviction
> --------
> The eviction consumes old generations. Given an lruvec, the eviction
> scans the pages on the per-zone lists indexed by either of min_seq[2].
> It selects a type according to the values of min_seq[2] and
> swappiness. During a scan, the eviction either sorts or isolates a
> page, depending on whether the aging has updated its generation
> number. When it finds all the per-zone lists are empty, the eviction
> increments min_seq[2] indexed by this selected type. The eviction
> triggers the aging when both of min_seq[2] reaches max_seq-1, assuming
> both anon and file types are reclaimable.
>
> Use cases
> =========
> On Android, our most advanced simulation that generates memory
> pressure from realistic user behavior shows 18% fewer low-memory
> kills, which in turn reduces cold starts by 16%.
>
> On Borg, a similar approach enables us to identify jobs that
> underutilize their memory and downsize them considerably without
> compromising any of our service level indicators.
>
> On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> discards and 59% fewer OOM kills from fully-utilized devices and no UX
> regressions from underutilized devices.
>
> For other use cases include working set estimation, proactive reclaim,
> far memory tiering and NUMA-aware job scheduling, please refer to the
> documentation included in this series and the following references.

Are there any performance numbers for specific application (before and
after applying the patches) you can show to demonstrate the improvement?

Thanks.

>
> References
> ==========
> 1. Long-term SLOs for reclaimed cloud computing resources
>   https://research.google/pubs/pub43017/
> 2. Profiling a warehouse-scale computer
>   https://research.google/pubs/pub44271/
> 3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
>   https://research.google/pubs/pub48329/
> 4. Software-defined far memory in warehouse-scale computers
>   https://research.google/pubs/pub48551/
> 5. Borg: the Next Generation
>   https://research.google/pubs/pub49065/
>


—
Best Regards,
Yan Zi
Yu Zhao March 15, 2021, 12:52 a.m. UTC | #2
On Sun, Mar 14, 2021 at 06:48:17PM -0400, Zi Yan wrote:
> On 13 Mar 2021, at 2:57, Yu Zhao wrote:

> > Problems
> > ========

> >   31.03%  page_vma_mapped_walk
> >   25.59%  lzo1x_1_do_compress
> >    4.63%  do_raw_spin_lock
> >    3.89%  vma_interval_tree_iter_next
> >    3.33%  vma_interval_tree_subtree_search

> > Solutions
> > =========

> >   49.36%  lzo1x_1_do_compress
> >    4.54%  page_vma_mapped_walk
> >    4.45%  memset_erms
> >    3.47%  walk_pte_range
> >    2.88%  zram_bvec_rw
> 
> Is this profile from a system with this patchset applied or not?
> Do you mind sharing some profiling data with before and after applying
> the patchset? So it would be easier to see the improvement brought by
> this patchset.

I've snipped everything else to make the context more clear.

These two kswapd profiles were collected under roughly the same memory
pressure. In other words, kswapd reclaimed (compressed) about the same
number of pages and therefore spent about the same amount of CPU time
in lzo1x_1_do_compress() in each profile.

The percentages of lzo1x_1_do_compress() are different because the
total CPU usage are different. Dividing the second percentage by the
first, we know we have roughly cut kswapd CPU usage by half.

> Are there any performance numbers for specific application (before and
> after applying the patches) you can show to demonstrate the improvement?

The kswapd profiles are from Chrome OS, i.e., laptops running the
v5.11 kernel and the Chrome browser. And we've also collected
benchmarks from various workloads on servers and phones running older
kernel versions too. Do you have a platform in mind? I'd be happy to
share the data with you. Or if you have some workloads/benchmarks, I
could collect some numbers from them too.
Hillf Danton March 15, 2021, 1:13 a.m. UTC | #3
On Sat, 13 Mar 2021 00:57:33 -0700 Yu Zhao wrote:
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> a performant, versatile and straightforward augment.

It makes my day, Monday of thick smog in one of the far east big
cities, to read the fresh work, something like 0b0695f2b34a that removed
heuristics as much as possible, of a coming Mr. Kswapd. 

> 
> Repo
> ====
> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1
> 
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
> 
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
> 
> Problems
> ========
> Notion of the active/inactive
> -----------------------------
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. And false active/inactive rates are relatively high. In
> addition, scans of largely varying numbers of pages are unpredictable
> because inactive_is_low() is based on magic numbers.
> 
> For phones and laptops, the eviction is biased toward file pages
> because the selection has to resort to heuristics as direct
> comparisons between anon and file types are infeasible. On Android and
> Chrome OS, executable pages are frequently evicted despite the fact
> that there are many less recently used anon pages. This causes "janks"
> (slow UI rendering) and negatively impacts user experience.
> 
> For systems with multiple nodes and/or memcgs, it is impossible to
> compare lruvecs based on the notion of the active/inactive.
> 
> Incremental scans via the rmap
> ------------------------------
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For most of
> the systems running cloud workloads, incremental scans lose the
> advantage under sustained memory pressure due to high ratios of the
> number of scanned pages to the number of reclaimed pages. In our case,
> the average ratio of pgscan to pgsteal is about 7.
> 
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>    4.63%  do_raw_spin_lock
>    3.89%  vma_interval_tree_iter_next
>    3.33%  vma_interval_tree_subtree_search
> 
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>    7.61%  do_raw_spin_lock
>    5.69%  vma_interval_tree_iter_next
>    4.91%  vma_interval_tree_subtree_search
>    3.71%  page_referenced_one
> 
> Solutions
> =========
> Notion of generation numbers
> ----------------------------
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> configurable generations, and thus they have relatively low false
> active/inactive rates. Each generation includes all pages that have
> been referenced since the last generation.
> 
> Given an lruvec, scans and the selections between anon and file types
> are all based on generation numbers, which are simple and yet
> effective. For different lruvecs, comparisons are still possible based
> on birth times of generations.
> 
> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for most of the
> systems running cloud workloads.
> 
> On Chrome OS, our real-world benchmark that browses popular websites
> in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
> (full) less PSI on v5.11. And kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw
> 
> In addition, direct reclaim latency is reduced by 22% at 99th
> percentile and the number of refaults is reduced 7%. These metrics are
> important to phones and laptops as they are correlated to user
> experience.
> 
> Workflow
> ========
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lruvec->evictable.max_seq
> for both anon and file types as they are aged on an equal footing. The
> oldest generation numbers are stored in lruvec->evictable.min_seq[2]
> separately for anon and file types as clean file pages can be evicted
> regardless of may_swap or may_writepage. Generation numbers are
> truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
> page->flags. The sliding window technique is used to prevent truncated
> generation numbers from overlapping. Each truncated generation number
> is an index to
> lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> Evictable pages are added to the per-zone lists indexed by max_seq or
> min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> faulted in or read ahead. The workflow comprises two conceptually
> independent functions: the aging and the eviction.
> 
> Aging
> -----
> The aging produces young generations. Given an lruvec, the aging scans
> page tables for referenced pages of this lruvec. Upon finding one, the
> aging updates its generation number to max_seq. After each round of
> scan, the aging increments max_seq. The aging maintains either a
> system-wide mm_struct list or per-memcg mm_struct lists and tracks
> whether an mm_struct is being used on any CPUs or has been used since
> the last scan. Multiple threads can concurrently work on the same
> mm_struct list, and each of them will be given a different mm_struct
> belonging to a process that has been scheduled since the last scan.
> 
> Eviction
> --------
> The eviction consumes old generations. Given an lruvec, the eviction
> scans the pages on the per-zone lists indexed by either of min_seq[2].
> It selects a type according to the values of min_seq[2] and
> swappiness. During a scan, the eviction either sorts or isolates a
> page, depending on whether the aging has updated its generation
> number. When it finds all the per-zone lists are empty, the eviction
> increments min_seq[2] indexed by this selected type. The eviction
> triggers the aging when both of min_seq[2] reaches max_seq-1, assuming
> both anon and file types are reclaimable.
> 
> Use cases
> =========
> On Android, our most advanced simulation that generates memory
> pressure from realistic user behavior shows 18% fewer low-memory
> kills, which in turn reduces cold starts by 16%.
> 
> On Borg, a similar approach enables us to identify jobs that
> underutilize their memory and downsize them considerably without
> compromising any of our service level indicators.
> 
> On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> discards and 59% fewer OOM kills from fully-utilized devices and no UX
> regressions from underutilized devices.
> 
> For other use cases include working set estimation, proactive reclaim,
> far memory tiering and NUMA-aware job scheduling, please refer to the
> documentation included in this series and the following references.
> 
> References
> ==========
> 1. Long-term SLOs for reclaimed cloud computing resources
>    https://research.google/pubs/pub43017/
> 2. Profiling a warehouse-scale computer
>    https://research.google/pubs/pub44271/
> 3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
>    https://research.google/pubs/pub48329/
> 4. Software-defined far memory in warehouse-scale computers
>    https://research.google/pubs/pub48551/
> 5. Borg: the Next Generation
>    https://research.google/pubs/pub49065/
> 
> Yu Zhao (14):
>   include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
>     !CONFIG_MEMCG
>   include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
>   include/linux/huge_mm.h: define is_huge_zero_pmd() if
>     !CONFIG_TRANSPARENT_HUGEPAGE
>   include/linux/cgroup.h: export cgroup_mutex
>   mm/swap.c: export activate_page()
>   mm, x86: support the access bit on non-leaf PMD entries
>   mm/pagewalk.c: add pud_entry_post() for post-order traversals
>   mm/vmscan.c: refactor shrink_node()
>   mm: multigenerational lru: mm_struct list
>   mm: multigenerational lru: core
>   mm: multigenerational lru: page activation
>   mm: multigenerational lru: user space interface
>   mm: multigenerational lru: Kconfig
>   mm: multigenerational lru: documentation
> 
>  Documentation/vm/index.rst        |    1 +
>  Documentation/vm/multigen_lru.rst |  210 +++
>  arch/Kconfig                      |    8 +
>  arch/x86/Kconfig                  |    1 +
>  arch/x86/include/asm/pgtable.h    |    2 +-
>  arch/x86/mm/pgtable.c             |    5 +-
>  fs/exec.c                         |    2 +
>  fs/proc/task_mmu.c                |    3 +-
>  include/linux/cgroup.h            |   15 +-
>  include/linux/huge_mm.h           |    5 +
>  include/linux/memcontrol.h        |    5 +-
>  include/linux/mm.h                |    1 +
>  include/linux/mm_inline.h         |  246 ++++
>  include/linux/mm_types.h          |  135 ++
>  include/linux/mmzone.h            |   62 +-
>  include/linux/nodemask.h          |    1 +
>  include/linux/page-flags-layout.h |   20 +-
>  include/linux/pagewalk.h          |    4 +
>  include/linux/pgtable.h           |    4 +-
>  include/linux/swap.h              |    5 +-
>  kernel/events/uprobes.c           |    2 +-
>  kernel/exit.c                     |    1 +
>  kernel/fork.c                     |   10 +
>  kernel/kthread.c                  |    1 +
>  kernel/sched/core.c               |    2 +
>  mm/Kconfig                        |   29 +
>  mm/huge_memory.c                  |    5 +-
>  mm/khugepaged.c                   |    2 +-
>  mm/memcontrol.c                   |   28 +
>  mm/memory.c                       |   14 +-
>  mm/migrate.c                      |    2 +-
>  mm/mm_init.c                      |   13 +-
>  mm/mmzone.c                       |    2 +
>  mm/pagewalk.c                     |    5 +
>  mm/rmap.c                         |    6 +
>  mm/swap.c                         |   58 +-
>  mm/swapfile.c                     |    6 +-
>  mm/userfaultfd.c                  |    2 +-
>  mm/vmscan.c                       | 2091 +++++++++++++++++++++++++++--
>  39 files changed, 2870 insertions(+), 144 deletions(-)
>  create mode 100644 Documentation/vm/multigen_lru.rst
> 
> -- 
> 2.31.0.rc2.261.g7f71774620-goog
> 
>
Yu Zhao March 15, 2021, 6:49 a.m. UTC | #4
On Mon, Mar 15, 2021 at 09:13:50AM +0800, Hillf Danton wrote:
> On Sat, 13 Mar 2021 00:57:33 -0700 Yu Zhao wrote:
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and
> > often making poor choices about what to evict. We would like to offer
> > a performant, versatile and straightforward augment.
> 
> It makes my day, Monday of thick smog in one of the far east big
> cities, to read the fresh work, something like 0b0695f2b34a that removed
> heuristics as much as possible, of a coming Mr. Kswapd. 

Hi Hillf!

Sorry to hear about the smog, we don't have smog, only a few feet of
snow...

I shared the latest version of the cover letter here, if you are a fan
of Google Docs:
https://docs.google.com/document/d/1UxcpPAFNk1KpTJDKDXWekj_n6ebpQ-cwbXZlYoebTVM

And speaking of heuristics, yeah, I totally understand. We've had more
than fair share of problems with get_scan_count() and
inactive_is_low(). And we are still carrying a workaround (admittedly
a terrible one) we posted more that a decade ago, on some of our old
kernel versions:
https://lore.kernel.org/linux-mm/20101028191523.GA14972@google.com/

And people who run the Chrome browser but don't have this patch (non-
Chrome OS) had problems!
https://lore.kernel.org/linux-mm/54C77086.7090505@suse.cz/

With generation numbers, the equivalent to inactive_is_low() is

  max_seq - min(min_seq[!swappiness], min_seq[1]) + 1 > MIN_NR_GENS

in get_nr_to_scan(); and the equivalent to get_scan_count() is

  *file = !swappiness || min_seq[0] > min_seq[1] ||
	  (min_seq[0] == min_seq[1] &&
	   max(lruvec->evictable.isolated[0], 1UL) * (200 - swappiness) >
	   max(lruvec->evictable.isolated[1], 1UL) * (swappiness - 1));

in isolate_lru_gen_pages(), both in the 10th patch.

They work amazingly well for us, and hopefully for you too :)
Dave Hansen March 15, 2021, 6 p.m. UTC | #5
On 3/12/21 11:57 PM, Yu Zhao wrote:
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.

This makes a compelling argument that current reclaim is not well
optimized for anonymous memory with low rates of sharing.  Basically,
anonymous rmap is very powerful, but we're not getting enough bang for
our buck out of it.

I also understand that the workloads you reference are anonymous-heavy
and that page cache isn't a *major* component.

But, what does happens to page-cache-heavy workloads?  Does this just
effectively force databases that want to use shmem over to hugetlbfs?
How bad does this scanning get in the worst case if there's a lot of
sharing?

I'm kinda surprised by this, but my 16GB laptop has a lot more page
cache than I would have guessed:

> Active(anon):    4065088 kB
> Inactive(anon):  3981928 kB
> Active(file):    2260580 kB
> Inactive(file):  3738096 kB
> AnonPages:       6624776 kB
> Mapped:           692036 kB
> Shmem:            776276 kB

Most of it isn't mapped, but it's far from all being used for text.
Yang Shi March 15, 2021, 6:38 p.m. UTC | #6
On Fri, Mar 12, 2021 at 11:57 PM Yu Zhao <yuzhao@google.com> wrote:
>
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> a performant, versatile and straightforward augment.
>
> Repo
> ====
> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1
>
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
>
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
>
> Problems
> ========
> Notion of the active/inactive
> -----------------------------
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. And false active/inactive rates are relatively high. In
> addition, scans of largely varying numbers of pages are unpredictable
> because inactive_is_low() is based on magic numbers.
>
> For phones and laptops, the eviction is biased toward file pages
> because the selection has to resort to heuristics as direct
> comparisons between anon and file types are infeasible. On Android and
> Chrome OS, executable pages are frequently evicted despite the fact
> that there are many less recently used anon pages. This causes "janks"
> (slow UI rendering) and negatively impacts user experience.
>
> For systems with multiple nodes and/or memcgs, it is impossible to
> compare lruvecs based on the notion of the active/inactive.
>
> Incremental scans via the rmap
> ------------------------------
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For most of
> the systems running cloud workloads, incremental scans lose the
> advantage under sustained memory pressure due to high ratios of the
> number of scanned pages to the number of reclaimed pages. In our case,
> the average ratio of pgscan to pgsteal is about 7.

So, you mean the reclaim efficiency is just 1/7? It seems quite low.
Just out of curiosity, did you have more insights about why it is that
low? I think it heavily depends on workload. We have page cache heavy
workloads, the efficiency rate is quite high.

>
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>    4.63%  do_raw_spin_lock
>    3.89%  vma_interval_tree_iter_next
>    3.33%  vma_interval_tree_subtree_search
>
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>    7.61%  do_raw_spin_lock
>    5.69%  vma_interval_tree_iter_next
>    4.91%  vma_interval_tree_subtree_search
>    3.71%  page_referenced_one

I guess it is because your workloads have a lot of shared anon pages?

>
> Solutions
> =========
> Notion of generation numbers
> ----------------------------
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> configurable generations, and thus they have relatively low false
> active/inactive rates. Each generation includes all pages that have
> been referenced since the last generation.
>
> Given an lruvec, scans and the selections between anon and file types
> are all based on generation numbers, which are simple and yet
> effective. For different lruvecs, comparisons are still possible based
> on birth times of generations.

It means you replace the active/inactive lists to multiple lists, from
most active to least active?

>
> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for most of the
> systems running cloud workloads.

How's about unmapped page caches? I think they are still quite common
for a lot of workloads.

>
> On Chrome OS, our real-world benchmark that browses popular websites
> in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
> (full) less PSI on v5.11. And kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw
>
> In addition, direct reclaim latency is reduced by 22% at 99th
> percentile and the number of refaults is reduced 7%. These metrics are
> important to phones and laptops as they are correlated to user
> experience.
>
> Workflow
> ========
> Evictable pages are divided into multiple generations for each lruvec.
> The youngest generation number is stored in lruvec->evictable.max_seq
> for both anon and file types as they are aged on an equal footing. The
> oldest generation numbers are stored in lruvec->evictable.min_seq[2]
> separately for anon and file types as clean file pages can be evicted
> regardless of may_swap or may_writepage. Generation numbers are
> truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
> page->flags. The sliding window technique is used to prevent truncated
> generation numbers from overlapping. Each truncated generation number
> is an index to
> lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> Evictable pages are added to the per-zone lists indexed by max_seq or
> min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> faulted in or read ahead. The workflow comprises two conceptually
> independent functions: the aging and the eviction.

Could you please illustrate the data structures? I think this would be
very helpful to understand the code. I haven't looked into the code
closely yet, per my shallow understanding to the above paragraphs, the
new lruvec looks like:

----------------
| max_seq  |
----------------
| .....            |
----------------
| min_seq.  | -----> -------------
----------------          |  Anon    | ---------> -------------------
                              ------------               | MAX_ZONE  |
--------> list of pages
                             |  File       |              --------------------
                              -------------              | .......
          | --------->
                                                            --------------------
                                                            | ZONE_DMA
 | --------->
                                                            --------------------

And the max_seq/min_seq is per memcg, is my understanding correct?

>
> Aging
> -----
> The aging produces young generations. Given an lruvec, the aging scans
> page tables for referenced pages of this lruvec. Upon finding one, the
> aging updates its generation number to max_seq. After each round of
> scan, the aging increments max_seq. The aging maintains either a
> system-wide mm_struct list or per-memcg mm_struct lists and tracks
> whether an mm_struct is being used on any CPUs or has been used since
> the last scan. Multiple threads can concurrently work on the same
> mm_struct list, and each of them will be given a different mm_struct
> belonging to a process that has been scheduled since the last scan.

I don't quite get how the "aging" works. IIUC, you have a dedicated
kernel thread or threads to scan the page tables periodically to
update the generations and promote or demote pages among the lists or
the "aging" just happens in reclaimer?

Thanks,
Yang

>
> Eviction
> --------
> The eviction consumes old generations. Given an lruvec, the eviction
> scans the pages on the per-zone lists indexed by either of min_seq[2].
> It selects a type according to the values of min_seq[2] and
> swappiness. During a scan, the eviction either sorts or isolates a
> page, depending on whether the aging has updated its generation
> number. When it finds all the per-zone lists are empty, the eviction
> increments min_seq[2] indexed by this selected type. The eviction
> triggers the aging when both of min_seq[2] reaches max_seq-1, assuming
> both anon and file types are reclaimable.
>
> Use cases
> =========
> On Android, our most advanced simulation that generates memory
> pressure from realistic user behavior shows 18% fewer low-memory
> kills, which in turn reduces cold starts by 16%.
>
> On Borg, a similar approach enables us to identify jobs that
> underutilize their memory and downsize them considerably without
> compromising any of our service level indicators.
>
> On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> discards and 59% fewer OOM kills from fully-utilized devices and no UX
> regressions from underutilized devices.
>
> For other use cases include working set estimation, proactive reclaim,
> far memory tiering and NUMA-aware job scheduling, please refer to the
> documentation included in this series and the following references.
>
> References
> ==========
> 1. Long-term SLOs for reclaimed cloud computing resources
>    https://research.google/pubs/pub43017/
> 2. Profiling a warehouse-scale computer
>    https://research.google/pubs/pub44271/
> 3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
>    https://research.google/pubs/pub48329/
> 4. Software-defined far memory in warehouse-scale computers
>    https://research.google/pubs/pub48551/
> 5. Borg: the Next Generation
>    https://research.google/pubs/pub49065/
>
> Yu Zhao (14):
>   include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
>     !CONFIG_MEMCG
>   include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
>   include/linux/huge_mm.h: define is_huge_zero_pmd() if
>     !CONFIG_TRANSPARENT_HUGEPAGE
>   include/linux/cgroup.h: export cgroup_mutex
>   mm/swap.c: export activate_page()
>   mm, x86: support the access bit on non-leaf PMD entries
>   mm/pagewalk.c: add pud_entry_post() for post-order traversals
>   mm/vmscan.c: refactor shrink_node()
>   mm: multigenerational lru: mm_struct list
>   mm: multigenerational lru: core
>   mm: multigenerational lru: page activation
>   mm: multigenerational lru: user space interface
>   mm: multigenerational lru: Kconfig
>   mm: multigenerational lru: documentation
>
>  Documentation/vm/index.rst        |    1 +
>  Documentation/vm/multigen_lru.rst |  210 +++
>  arch/Kconfig                      |    8 +
>  arch/x86/Kconfig                  |    1 +
>  arch/x86/include/asm/pgtable.h    |    2 +-
>  arch/x86/mm/pgtable.c             |    5 +-
>  fs/exec.c                         |    2 +
>  fs/proc/task_mmu.c                |    3 +-
>  include/linux/cgroup.h            |   15 +-
>  include/linux/huge_mm.h           |    5 +
>  include/linux/memcontrol.h        |    5 +-
>  include/linux/mm.h                |    1 +
>  include/linux/mm_inline.h         |  246 ++++
>  include/linux/mm_types.h          |  135 ++
>  include/linux/mmzone.h            |   62 +-
>  include/linux/nodemask.h          |    1 +
>  include/linux/page-flags-layout.h |   20 +-
>  include/linux/pagewalk.h          |    4 +
>  include/linux/pgtable.h           |    4 +-
>  include/linux/swap.h              |    5 +-
>  kernel/events/uprobes.c           |    2 +-
>  kernel/exit.c                     |    1 +
>  kernel/fork.c                     |   10 +
>  kernel/kthread.c                  |    1 +
>  kernel/sched/core.c               |    2 +
>  mm/Kconfig                        |   29 +
>  mm/huge_memory.c                  |    5 +-
>  mm/khugepaged.c                   |    2 +-
>  mm/memcontrol.c                   |   28 +
>  mm/memory.c                       |   14 +-
>  mm/migrate.c                      |    2 +-
>  mm/mm_init.c                      |   13 +-
>  mm/mmzone.c                       |    2 +
>  mm/pagewalk.c                     |    5 +
>  mm/rmap.c                         |    6 +
>  mm/swap.c                         |   58 +-
>  mm/swapfile.c                     |    6 +-
>  mm/userfaultfd.c                  |    2 +-
>  mm/vmscan.c                       | 2091 +++++++++++++++++++++++++++--
>  39 files changed, 2870 insertions(+), 144 deletions(-)
>  create mode 100644 Documentation/vm/multigen_lru.rst
>
> --
> 2.31.0.rc2.261.g7f71774620-goog
>
Yu Zhao March 16, 2021, 2:24 a.m. UTC | #7
On Mon, Mar 15, 2021 at 11:00:06AM -0700, Dave Hansen wrote:
> On 3/12/21 11:57 PM, Yu Zhao wrote:
> > Background
> > ==========
> > DRAM is a major factor in total cost of ownership, and improving
> > memory overcommit brings a high return on investment. Over the past
> > decade of research and experimentation in memory overcommit, we
> > observed a distinct trend across millions of servers and clients: the
> > size of page cache has been decreasing because of the growing
> > popularity of cloud storage. Nowadays anon pages account for more than
> > 90% of our memory consumption and page cache contains mostly
> > executable pages.
> 
> This makes a compelling argument that current reclaim is not well
> optimized for anonymous memory with low rates of sharing.  Basically,
> anonymous rmap is very powerful, but we're not getting enough bang for
> our buck out of it.
> 
> I also understand that the workloads you reference are anonymous-heavy
> and that page cache isn't a *major* component.
> 
> But, what does happens to page-cache-heavy workloads?  Does this just
> effectively force databases that want to use shmem over to hugetlbfs?

No, they should benefit too. In terms of page reclaim, shmem pages are
basically considered anon: they are on anon lru and dirty shmem pages
can only be swapped (we can safely assume clean shmem pages are
virtually nonexistent) in contrast to file pages that have backing
storage and need to be written back.

I should have phrased it better: our accounting is based on what the
kernel provides, i.e., anon/file (lru) sizes you listed below.

> How bad does this scanning get in the worst case if there's a lot of
> sharing?

Actually the improvement is larger when there is more sharing, i.e.,
higher map_count larger improvement. Let's assume we have a shmem
page mapped by two processes. To reclaim this page, we need to make
sure neither PTE from the two sets of page tables has the accessed
bit. The current page reclaim uses the rmap, i.e., rmap_walk_file().
It first looks up the two VMAs (from the two processes mapping this
shmem file) in the interval tree of this shmem file, then from each
VMA, it goes through PGD/PUD/PMD to reach the PTE. The page can't be
reclaimed if either of the PTEs has the accessed bit, therefore cost
of the scanning is more than proportional to the number of accesses,
when there is a lot sharing.

Why this series makes it better? We track the usage of page tables.
Specifically, we work alongside switch_mm(): if one of the processes
above hasn't be scheduled since the last scan, we don't need to scan
its page tables. So the cost is roughly proportional to the number of
accesses, regardless of how many processes. And instead of scanning
pages one by one, we do it in large batches. However, page tables can
be very sparse -- this is not a problem for the rmap because it knows
exactly where the PTEs are (by vma_address()). We only know ranges (by
vma->vm_start/vm_end). This is where the accessed bit on non-leaf
PMDs can be of help.

But I guess you are wondering what downsides are. Well, we haven't
seen any (yet). We do have page cache (non-shmem) heavy workloads,
but not at a scale large enough to make any statistically meaningful
observations. We are very interested in working with anybody who has
page cache (non-shmem) heavy workloads and is willing to try out this
series.

> I'm kinda surprised by this, but my 16GB laptop has a lot more page
> cache than I would have guessed:
> 
> > Active(anon):    4065088 kB
> > Inactive(anon):  3981928 kB
> > Active(file):    2260580 kB
> > Inactive(file):  3738096 kB
> > AnonPages:       6624776 kB
> > Mapped:           692036 kB
> > Shmem:            776276 kB
> 
> Most of it isn't mapped, but it's far from all being used for text.

We have categorized two groups:
  1) average users that haven't experienced memory pressure since
  their systems have booted. The booting process fills up page cache
  with one-off file pages, and they remain until users experience
  memory pressure. This can be confirmed by looking at those counters
  of a freshly rebooted and idle system. My guess this is the case for
  your laptop.
  2) engineering users who store git repos and compile locally. They
  complained about their browsers being janky because anon memory got
  swapped even though their systems had a lot of stale file pages in
  page cache, with the current page reclaim. They are what we consider
  part of the page cache (non-shmem) heavy group.
Yu Zhao March 16, 2021, 3:38 a.m. UTC | #8
On Mon, Mar 15, 2021 at 11:38:20AM -0700, Yang Shi wrote:
> On Fri, Mar 12, 2021 at 11:57 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and
> > often making poor choices about what to evict. We would like to offer
> > a performant, versatile and straightforward augment.
> >
> > Repo
> > ====
> > git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/01/1101/1
> >
> > Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
> >
> > Background
> > ==========
> > DRAM is a major factor in total cost of ownership, and improving
> > memory overcommit brings a high return on investment. Over the past
> > decade of research and experimentation in memory overcommit, we
> > observed a distinct trend across millions of servers and clients: the
> > size of page cache has been decreasing because of the growing
> > popularity of cloud storage. Nowadays anon pages account for more than
> > 90% of our memory consumption and page cache contains mostly
> > executable pages.
> >
> > Problems
> > ========
> > Notion of the active/inactive
> > -----------------------------
> > For servers equipped with hundreds of gigabytes of memory, the
> > granularity of the active/inactive is too coarse to be useful for job
> > scheduling. And false active/inactive rates are relatively high. In
> > addition, scans of largely varying numbers of pages are unpredictable
> > because inactive_is_low() is based on magic numbers.
> >
> > For phones and laptops, the eviction is biased toward file pages
> > because the selection has to resort to heuristics as direct
> > comparisons between anon and file types are infeasible. On Android and
> > Chrome OS, executable pages are frequently evicted despite the fact
> > that there are many less recently used anon pages. This causes "janks"
> > (slow UI rendering) and negatively impacts user experience.
> >
> > For systems with multiple nodes and/or memcgs, it is impossible to
> > compare lruvecs based on the notion of the active/inactive.
> >
> > Incremental scans via the rmap
> > ------------------------------
> > Each incremental scan picks up at where the last scan left off and
> > stops after it has found a handful of unreferenced pages. For most of
> > the systems running cloud workloads, incremental scans lose the
> > advantage under sustained memory pressure due to high ratios of the
> > number of scanned pages to the number of reclaimed pages. In our case,
> > the average ratio of pgscan to pgsteal is about 7.
> 
> So, you mean the reclaim efficiency is just 1/7? It seems quite low.

Well, from the perspective of memory utilization, 6/7 is non-idle. And
in our dictionary, high "reclaim efficiency" is synonym for
underutilization :)

> Just out of curiosity, did you have more insights about why it is that
> low? I think it heavily depends on workload. We have page cache heavy
> workloads, the efficiency rate is quite high.

Yes, our observation on (a small group of) page cache heavy workloads
is the same. They access files via file descriptors, and sometimes
stream large files, i.e., only reading each file page once. Those
pages they leave in page cache are highly reclaimable because they
are clean, not mapped into page tables and therefore can be dropped
quickly.

> > On top of that, the rmap has poor memory locality due to its complex
> > data structures. The combined effects typically result in a high
> > amount of CPU usage in the reclaim path. For example, with zram, a
> > typical kswapd profile on v5.11 looks like:
> >   31.03%  page_vma_mapped_walk
> >   25.59%  lzo1x_1_do_compress
> >    4.63%  do_raw_spin_lock
> >    3.89%  vma_interval_tree_iter_next
> >    3.33%  vma_interval_tree_subtree_search
> >
> > And with real swap, it looks like:
> >   45.16%  page_vma_mapped_walk
> >    7.61%  do_raw_spin_lock
> >    5.69%  vma_interval_tree_iter_next
> >    4.91%  vma_interval_tree_subtree_search
> >    3.71%  page_referenced_one
> 
> I guess it is because your workloads have a lot of shared anon pages?

Sharing (map_count > 1) does make kswapd profile look worse. But the
majority of our anon memory including shmem is not shared but mapped
(map_count = 1).

> > Solutions
> > =========
> > Notion of generation numbers
> > ----------------------------
> > The notion of generation numbers introduces a quantitative approach to
> > memory overcommit. A larger number of pages can be spread out across
> > configurable generations, and thus they have relatively low false
> > active/inactive rates. Each generation includes all pages that have
> > been referenced since the last generation.
> >
> > Given an lruvec, scans and the selections between anon and file types
> > are all based on generation numbers, which are simple and yet
> > effective. For different lruvecs, comparisons are still possible based
> > on birth times of generations.
> 
> It means you replace the active/inactive lists to multiple lists, from
> most active to least active?

Precisely.

> > Differential scans via page tables
> > ----------------------------------
> > Each differential scan discovers all pages that have been referenced
> > since the last scan. Specifically, it walks the mm_struct list
> > associated with an lruvec to scan page tables of processes that have
> > been scheduled since the last scan. The cost of each differential scan
> > is roughly proportional to the number of referenced pages it
> > discovers. Unless address spaces are extremely sparse, page tables
> > usually have better memory locality than the rmap. The end result is
> > generally a significant reduction in CPU usage, for most of the
> > systems running cloud workloads.
> 
> How's about unmapped page caches? I think they are still quite common
> for a lot of workloads.

Yes, they are covered too, by mark_page_accessed(), when they are
read/written via file descriptors.

> > On Chrome OS, our real-world benchmark that browses popular websites
> > in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
> > (full) less PSI on v5.11. And kswapd profile looks like:
> >   49.36%  lzo1x_1_do_compress
> >    4.54%  page_vma_mapped_walk
> >    4.45%  memset_erms
> >    3.47%  walk_pte_range
> >    2.88%  zram_bvec_rw
> >
> > In addition, direct reclaim latency is reduced by 22% at 99th
> > percentile and the number of refaults is reduced 7%. These metrics are
> > important to phones and laptops as they are correlated to user
> > experience.
> >
> > Workflow
> > ========
> > Evictable pages are divided into multiple generations for each lruvec.
> > The youngest generation number is stored in lruvec->evictable.max_seq
> > for both anon and file types as they are aged on an equal footing. The
> > oldest generation numbers are stored in lruvec->evictable.min_seq[2]
> > separately for anon and file types as clean file pages can be evicted
> > regardless of may_swap or may_writepage. Generation numbers are
> > truncated into ilog2(MAX_NR_GENS)+1 bits in order to fit into
> > page->flags. The sliding window technique is used to prevent truncated
> > generation numbers from overlapping. Each truncated generation number
> > is an index to
> > lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> > Evictable pages are added to the per-zone lists indexed by max_seq or
> > min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> > faulted in or read ahead. The workflow comprises two conceptually
> > independent functions: the aging and the eviction.
> 
> Could you please illustrate the data structures? I think this would be
> very helpful to understand the code. I haven't looked into the code
> closely yet, per my shallow understanding to the above paragraphs, the
> new lruvec looks like:
> 
> ----------------
> | max_seq  |
> ----------------
> | .....            |
> ----------------
> | min_seq.  | -----> -------------
> ----------------          |  Anon    | ---------> -------------------
>                               ------------               | MAX_ZONE  |
> --------> list of pages
>                              |  File       |              --------------------
>                               -------------              | .......
>           | --------->
>                                                             --------------------
>                                                             | ZONE_DMA
>  | --------->
>                                                             --------------------
> 
> And the max_seq/min_seq is per memcg, is my understanding correct?

Yes, on single-node systems. To be precise, they are per lruvec. Each
memcg has N lruvecs for N-node systems.

A crude analogy would be a ring buffer: the aging to the writer
advancing max_seq and the eviction to the reader advancing min_seq,
in terms of generations. (The aging only tags pages -- it doesn't add
pages to the lists; page allocations do.)

> > Aging
> > -----
> > The aging produces young generations. Given an lruvec, the aging scans
> > page tables for referenced pages of this lruvec. Upon finding one, the
> > aging updates its generation number to max_seq. After each round of
> > scan, the aging increments max_seq. The aging maintains either a
> > system-wide mm_struct list or per-memcg mm_struct lists and tracks
> > whether an mm_struct is being used on any CPUs or has been used since
> > the last scan. Multiple threads can concurrently work on the same
> > mm_struct list, and each of them will be given a different mm_struct
> > belonging to a process that has been scheduled since the last scan.
> 
> I don't quite get how the "aging" works. IIUC, you have a dedicated
> kernel thread or threads to scan the page tables periodically to
> update the generations and promote or demote pages among the lists or
> the "aging" just happens in reclaimer?

The aging can happen in any reclaiming threads, when let's say "the
inactive" is low. There is no dedicated kernel threads, unless you
count kswapd as one.

For example, for memcg reclaim, we have:
  page charge failure
    memcg reclaim
      select a node
        get lruvec from the node and the memcg
retry:
          if max_seq - min_seq < 2, i.e., no inactive pages
            the aging: scan the mm_struct lists
              increment max_seq
          the eviction: scan the page lists
            if the per-zone lists are empty
              increment min_seq
              goto retry
Dave Hansen March 16, 2021, 2:50 p.m. UTC | #9
On 3/15/21 7:24 PM, Yu Zhao wrote:
> On Mon, Mar 15, 2021 at 11:00:06AM -0700, Dave Hansen wrote:
>> How bad does this scanning get in the worst case if there's a lot of
>> sharing?
> 
> Actually the improvement is larger when there is more sharing, i.e.,
> higher map_count larger improvement. Let's assume we have a shmem
> page mapped by two processes. To reclaim this page, we need to make
> sure neither PTE from the two sets of page tables has the accessed
> bit. The current page reclaim uses the rmap, i.e., rmap_walk_file().
> It first looks up the two VMAs (from the two processes mapping this
> shmem file) in the interval tree of this shmem file, then from each
> VMA, it goes through PGD/PUD/PMD to reach the PTE. The page can't be
> reclaimed if either of the PTEs has the accessed bit, therefore cost
> of the scanning is more than proportional to the number of accesses,
> when there is a lot sharing.
> 
> Why this series makes it better? We track the usage of page tables.
> Specifically, we work alongside switch_mm(): if one of the processes
> above hasn't be scheduled since the last scan, we don't need to scan
> its page tables. So the cost is roughly proportional to the number of
> accesses, regardless of how many processes. And instead of scanning
> pages one by one, we do it in large batches. However, page tables can
> be very sparse -- this is not a problem for the rmap because it knows
> exactly where the PTEs are (by vma_address()). We only know ranges (by
> vma->vm_start/vm_end). This is where the accessed bit on non-leaf
> PMDs can be of help.

That's an interesting argument.  *But*, this pivoted into describing an
optimization.  My takeaway from this is that large amounts of sharing
are probably only handled well if the processes doing the sharing are
not running constantly.

> But I guess you are wondering what downsides are. Well, we haven't
> seen any (yet). We do have page cache (non-shmem) heavy workloads,
> but not at a scale large enough to make any statistically meaningful
> observations. We are very interested in working with anybody who has
> page cache (non-shmem) heavy workloads and is willing to try out this
> series.

I would also be very interested to see some synthetic, worst-case
micros.  Maybe take a few thousand processes with very sparse page
tables that all map some shared memory.  They wake up long enough to
touch a few pages, then go back to sleep.

What happens if we do that?  I'm not saying this is a good workload or
that things must behave well, but I do find it interesting to watch the
worst case.

I think it would also be very worthwhile to include some research in
this series about why the kernel moved away from page table scanning.
What has changed?  Are the workloads we were concerned about way back
then not around any more?  Has faster I/O or larger memory sizes with a
stagnating page size changed something?

>> I'm kinda surprised by this, but my 16GB laptop has a lot more page
>> cache than I would have guessed:
>>
>>> Active(anon):    4065088 kB
>>> Inactive(anon):  3981928 kB
>>> Active(file):    2260580 kB
>>> Inactive(file):  3738096 kB
>>> AnonPages:       6624776 kB
>>> Mapped:           692036 kB
>>> Shmem:            776276 kB
>>
>> Most of it isn't mapped, but it's far from all being used for text.
> 
> We have categorized two groups:
>   1) average users that haven't experienced memory pressure since
>   their systems have booted. The booting process fills up page cache
>   with one-off file pages, and they remain until users experience
>   memory pressure. This can be confirmed by looking at those counters
>   of a freshly rebooted and idle system. My guess this is the case for
>   your laptop.

It's been up ~12 days.  There is ~10GB of data in swap, and there's been
a lot of scanning activity which I would associate with memory pressure:

> SwapCached:      1187596 kB
> SwapTotal:      51199996 kB
> SwapFree:       40419428 kB
...
> nr_vmscan_write 24900719
> nr_vmscan_immediate_reclaim 115535
> pgscan_kswapd 320831544
> pgscan_direct 23396383
> pgscan_direct_throttle 0
> pgscan_anon 127491077
> pgscan_file 216736850
> slabs_scanned 400469680
> compact_migrate_scanned 1092813949
> compact_free_scanned 4919523035
> compact_daemon_migrate_scanned 2372223
> compact_daemon_free_scanned 20989310
> unevictable_pgs_scanned 307388545


>   2) engineering users who store git repos and compile locally. They
>   complained about their browsers being janky because anon memory got
>   swapped even though their systems had a lot of stale file pages in
>   page cache, with the current page reclaim. They are what we consider
>   part of the page cache (non-shmem) heavy group.

Interesting.  You shouldn't have a shortage of folks like that among
kernel developers.
Yu Zhao March 16, 2021, 8:30 p.m. UTC | #10
On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
> On 3/15/21 7:24 PM, Yu Zhao wrote:
> > On Mon, Mar 15, 2021 at 11:00:06AM -0700, Dave Hansen wrote:
> >> How bad does this scanning get in the worst case if there's a lot of
> >> sharing?
> > 
> > Actually the improvement is larger when there is more sharing, i.e.,
> > higher map_count larger improvement. Let's assume we have a shmem
> > page mapped by two processes. To reclaim this page, we need to make
> > sure neither PTE from the two sets of page tables has the accessed
> > bit. The current page reclaim uses the rmap, i.e., rmap_walk_file().
> > It first looks up the two VMAs (from the two processes mapping this
> > shmem file) in the interval tree of this shmem file, then from each
> > VMA, it goes through PGD/PUD/PMD to reach the PTE. The page can't be
> > reclaimed if either of the PTEs has the accessed bit, therefore cost
> > of the scanning is more than proportional to the number of accesses,
> > when there is a lot sharing.
> > 
> > Why this series makes it better? We track the usage of page tables.
> > Specifically, we work alongside switch_mm(): if one of the processes
> > above hasn't be scheduled since the last scan, we don't need to scan
> > its page tables. So the cost is roughly proportional to the number of
> > accesses, regardless of how many processes. And instead of scanning
> > pages one by one, we do it in large batches. However, page tables can
> > be very sparse -- this is not a problem for the rmap because it knows
> > exactly where the PTEs are (by vma_address()). We only know ranges (by
> > vma->vm_start/vm_end). This is where the accessed bit on non-leaf
> > PMDs can be of help.
> 
> That's an interesting argument.  *But*, this pivoted into describing an
> optimization.  My takeaway from this is that large amounts of sharing
> are probably only handled well if the processes doing the sharing are
> not running constantly.
> 
> > But I guess you are wondering what downsides are. Well, we haven't
> > seen any (yet). We do have page cache (non-shmem) heavy workloads,
> > but not at a scale large enough to make any statistically meaningful
> > observations. We are very interested in working with anybody who has
> > page cache (non-shmem) heavy workloads and is willing to try out this
> > series.
> 
> I would also be very interested to see some synthetic, worst-case
> micros.  Maybe take a few thousand processes with very sparse page
> tables that all map some shared memory.  They wake up long enough to
> touch a few pages, then go back to sleep.
> 
> What happens if we do that?  I'm not saying this is a good workload or
> that things must behave well, but I do find it interesting to watch the
> worst case.

It is a reasonable request, thank you. I've just opened a bug to cover
this case (a large sparse shared shmem) and we'll have something soon.

> I think it would also be very worthwhile to include some research in
> this series about why the kernel moved away from page table scanning.
> What has changed?  Are the workloads we were concerned about way back
> then not around any more?  Has faster I/O or larger memory sizes with a
> stagnating page size changed something?

Sure. Hugh also suggested this too but I personally found that ancient
pre-2.4 history too irrelevant (and uninteresting) to the modern age
and decided to spare audience of the boredom.

> >> I'm kinda surprised by this, but my 16GB laptop has a lot more page
> >> cache than I would have guessed:
> >>
> >>> Active(anon):    4065088 kB
> >>> Inactive(anon):  3981928 kB
> >>> Active(file):    2260580 kB
> >>> Inactive(file):  3738096 kB
> >>> AnonPages:       6624776 kB
> >>> Mapped:           692036 kB
> >>> Shmem:            776276 kB
> >>
> >> Most of it isn't mapped, but it's far from all being used for text.
> > 
> > We have categorized two groups:
> >   1) average users that haven't experienced memory pressure since
> >   their systems have booted. The booting process fills up page cache
> >   with one-off file pages, and they remain until users experience
> >   memory pressure. This can be confirmed by looking at those counters
> >   of a freshly rebooted and idle system. My guess this is the case for
> >   your laptop.
> 
> It's been up ~12 days.  There is ~10GB of data in swap, and there's been
> a lot of scanning activity which I would associate with memory pressure:
> 
> > SwapCached:      1187596 kB
> > SwapTotal:      51199996 kB
> > SwapFree:       40419428 kB
> ...
> > nr_vmscan_write 24900719
> > nr_vmscan_immediate_reclaim 115535
> > pgscan_kswapd 320831544
> > pgscan_direct 23396383
> > pgscan_direct_throttle 0
> > pgscan_anon 127491077
> > pgscan_file 216736850
> > slabs_scanned 400469680
> > compact_migrate_scanned 1092813949
> > compact_free_scanned 4919523035
> > compact_daemon_migrate_scanned 2372223
> > compact_daemon_free_scanned 20989310
> > unevictable_pgs_scanned 307388545

10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
The file rss does seem a bit high to me, my wild speculation is there
have been git/make activities in addition to a VM?
Dave Hansen March 16, 2021, 9:14 p.m. UTC | #11
On 3/16/21 1:30 PM, Yu Zhao wrote:
> On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
>> I think it would also be very worthwhile to include some research in
>> this series about why the kernel moved away from page table scanning.
>> What has changed?  Are the workloads we were concerned about way back
>> then not around any more?  Has faster I/O or larger memory sizes with a
>> stagnating page size changed something?
> 
> Sure. Hugh also suggested this too but I personally found that ancient
> pre-2.4 history too irrelevant (and uninteresting) to the modern age
> and decided to spare audience of the boredom.

IIRC, rmap chains showed up in the 2.5 era and the VM was quite bumpy
until anon_vmas came around, which was early-ish in the 2.6 era.

But, either way, I think there is a sufficient population of nostalgic
crusty old folks around to warrant a bit of a history lesson.  We'll
enjoy the trip down memory lane, fondly remembering the old days in
Ottawa...

>>> nr_vmscan_write 24900719
>>> nr_vmscan_immediate_reclaim 115535
>>> pgscan_kswapd 320831544
>>> pgscan_direct 23396383
>>> pgscan_direct_throttle 0
>>> pgscan_anon 127491077
>>> pgscan_file 216736850
>>> slabs_scanned 400469680
>>> compact_migrate_scanned 1092813949
>>> compact_free_scanned 4919523035
>>> compact_daemon_migrate_scanned 2372223
>>> compact_daemon_free_scanned 20989310
>>> unevictable_pgs_scanned 307388545
> 
> 10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
> The file rss does seem a bit high to me, my wild speculation is there
> have been git/make activities in addition to a VM?

I wish I was doing more git/make activities.  It's been an annoying
amount of email and web browsers for 12 days.  If anything, I'd suspect
that Thunderbird is at fault for keeping a bunch of mail in the page
cache.  There are a couple of VM's running though.
Yu Zhao April 10, 2021, 9:21 a.m. UTC | #12
On Tue, Mar 16, 2021 at 02:14:43PM -0700, Dave Hansen wrote:
> On 3/16/21 1:30 PM, Yu Zhao wrote:
> > On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
> >> I think it would also be very worthwhile to include some research in
> >> this series about why the kernel moved away from page table scanning.
> >> What has changed?  Are the workloads we were concerned about way back
> >> then not around any more?  Has faster I/O or larger memory sizes with a
> >> stagnating page size changed something?
> > 
> > Sure. Hugh also suggested this too but I personally found that ancient
> > pre-2.4 history too irrelevant (and uninteresting) to the modern age
> > and decided to spare audience of the boredom.
> 
> IIRC, rmap chains showed up in the 2.5 era and the VM was quite bumpy
> until anon_vmas came around, which was early-ish in the 2.6 era.
> 
> But, either way, I think there is a sufficient population of nostalgic
> crusty old folks around to warrant a bit of a history lesson.  We'll
> enjoy the trip down memory lane, fondly remembering the old days in
> Ottawa...
> 
> >>> nr_vmscan_write 24900719
> >>> nr_vmscan_immediate_reclaim 115535
> >>> pgscan_kswapd 320831544
> >>> pgscan_direct 23396383
> >>> pgscan_direct_throttle 0
> >>> pgscan_anon 127491077
> >>> pgscan_file 216736850
> >>> slabs_scanned 400469680
> >>> compact_migrate_scanned 1092813949
> >>> compact_free_scanned 4919523035
> >>> compact_daemon_migrate_scanned 2372223
> >>> compact_daemon_free_scanned 20989310
> >>> unevictable_pgs_scanned 307388545
> > 
> > 10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
> > The file rss does seem a bit high to me, my wild speculation is there
> > have been git/make activities in addition to a VM?
> 
> I wish I was doing more git/make activities.  It's been an annoying
> amount of email and web browsers for 12 days.  If anything, I'd suspect
> that Thunderbird is at fault for keeping a bunch of mail in the page
> cache.  There are a couple of VM's running though.

Hi Dave,

Sorry for the late reply. Here is the benchmark result from the worst
case scenario.

As you suggested, we create a lot of processes sharing one large
sparse shmem, and they access the shmem at random 2MB-aligned offsets.
So there will be at most one valid PTE entry per PTE table, hence the
worst case scenario for the multigenerational LRU, since it is based
on page table scanning.

TL;DR: the multigenerational LRU did not perform worse than the rmap.

My test configurations:

  The size of the shmem: 256GB
  The number of processes: 450
  Total memory size: 200GB
  The number of CPUs: 64
  The number of nodes: 2

There is no clear winner in the background reclaim path (kswapd).

  kswapd (5.12.0-rc6):
    43.99%  kswapd1  page_vma_mapped_walk
    34.86%  kswapd0  page_vma_mapped_walk
     2.43%  kswapd0  count_shadow_nodes
     1.17%  kswapd1  page_referenced_one
     1.15%  kswapd0  _find_next_bit.constprop.0
     0.95%  kswapd0  page_referenced_one
     0.87%  kswapd1  try_to_unmap_one
     0.75%  kswapd0  cpumask_next
     0.67%  kswapd0  shrink_slab
     0.66%  kswapd0  down_read_trylock

  kswapd (the multigenerational LRU):
    33.39%  kswapd0  walk_pud_range
    10.93%  kswapd1  walk_pud_range
     9.36%  kswapd0  page_vma_mapped_walk
     7.15%  kswapd1  page_vma_mapped_walk
     3.83%  kswapd0  count_shadow_nodes
     2.60%  kswapd1  shrink_slab
     2.47%  kswapd1  down_read_trylock
     2.03%  kswapd0  _raw_spin_lock
     1.87%  kswapd0  shrink_slab
     1.67%  kswapd1  count_shadow_nodes

The multigenerational LRU is somewhat winning in the direct reclaim
path (sparse is the test binary name):

  The test process context (5.12.0-rc6):
    65.02%  sparse   page_vma_mapped_walk
     5.49%  sparse   page_counter_try_charge
     3.60%  sparse   propagate_protected_usage
     2.31%  sparse   page_counter_uncharge
     2.06%  sparse   count_shadow_nodes
     1.81%  sparse   native_queued_spin_lock_slowpath
     1.79%  sparse   down_read_trylock
     1.67%  sparse   page_referenced_one
     1.42%  sparse   shrink_slab
     0.87%  sparse   try_to_unmap_one

  CPU % (direct reclaim vs the rest): 71% vs 29%
  # grep oom_kill /proc/vmstat
  oom_kill 81

  The test process context (the multigenerational LRU):
    33.12%  sparse   page_vma_mapped_walk
    10.70%  sparse   walk_pud_range
     9.64%  sparse   page_counter_try_charge
     6.63%  sparse   propagate_protected_usage
     4.43%  sparse   native_queued_spin_lock_slowpath
     3.85%  sparse   page_counter_uncharge
     3.71%  sparse   irqentry_exit_to_user_mode
     2.16%  sparse   _raw_spin_lock
     1.83%  sparse   unmap_page_range
     1.82%  sparse   shrink_slab

  CPU % (direct reclaim vs the rest): 47% vs 53%
  # grep oom_kill /proc/vmstat
  oom_kill 80

I also compared other numbers from /proc/vmstat. They do not provide
any additional insight than the profiles, so I will just omit them
here.

The following optimizations and the stats measuring their efficacies
explain why the multigenerational LRU did not perform worse:

  Optimization 1: take advantage of the scheduling information.
    # of active processes           270
    # of inactive processes         105

  Optimization 2: take the advantage of the accessed bit on non-leaf
  PMD entries.
    # of old non-leaf PMD entries   30523335
    # of young non-leaf PMD entries 1358400

These stats are not currently included. But I will add them to the
debugfs interface in the next version coming soon. And I will also add
another optimization for Android. It reduces zigzags when there are
many single-page VMAs, i.e., not returning to the PGD table for each
of such VMAs. Just a heads-up.

The rmap, on the other hand, had to
  1) lock each (shmem) page it scans
  2) go through five levels of page tables for each page, even though
  some of them have the same LCAs
during the test. The second part is worse given that I have 5 levels
of page tables configured.

Any additional benchmarks you would suggest? Thanks.
Huang, Ying April 13, 2021, 3:02 a.m. UTC | #13
Yu Zhao <yuzhao@google.com> writes:

> On Tue, Mar 16, 2021 at 02:14:43PM -0700, Dave Hansen wrote:
>> On 3/16/21 1:30 PM, Yu Zhao wrote:
>> > On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
>> >> I think it would also be very worthwhile to include some research in
>> >> this series about why the kernel moved away from page table scanning.
>> >> What has changed?  Are the workloads we were concerned about way back
>> >> then not around any more?  Has faster I/O or larger memory sizes with a
>> >> stagnating page size changed something?
>> > 
>> > Sure. Hugh also suggested this too but I personally found that ancient
>> > pre-2.4 history too irrelevant (and uninteresting) to the modern age
>> > and decided to spare audience of the boredom.
>> 
>> IIRC, rmap chains showed up in the 2.5 era and the VM was quite bumpy
>> until anon_vmas came around, which was early-ish in the 2.6 era.
>> 
>> But, either way, I think there is a sufficient population of nostalgic
>> crusty old folks around to warrant a bit of a history lesson.  We'll
>> enjoy the trip down memory lane, fondly remembering the old days in
>> Ottawa...
>> 
>> >>> nr_vmscan_write 24900719
>> >>> nr_vmscan_immediate_reclaim 115535
>> >>> pgscan_kswapd 320831544
>> >>> pgscan_direct 23396383
>> >>> pgscan_direct_throttle 0
>> >>> pgscan_anon 127491077
>> >>> pgscan_file 216736850
>> >>> slabs_scanned 400469680
>> >>> compact_migrate_scanned 1092813949
>> >>> compact_free_scanned 4919523035
>> >>> compact_daemon_migrate_scanned 2372223
>> >>> compact_daemon_free_scanned 20989310
>> >>> unevictable_pgs_scanned 307388545
>> > 
>> > 10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
>> > The file rss does seem a bit high to me, my wild speculation is there
>> > have been git/make activities in addition to a VM?
>> 
>> I wish I was doing more git/make activities.  It's been an annoying
>> amount of email and web browsers for 12 days.  If anything, I'd suspect
>> that Thunderbird is at fault for keeping a bunch of mail in the page
>> cache.  There are a couple of VM's running though.
>
> Hi Dave,
>
> Sorry for the late reply. Here is the benchmark result from the worst
> case scenario.
>
> As you suggested, we create a lot of processes sharing one large
> sparse shmem, and they access the shmem at random 2MB-aligned offsets.
> So there will be at most one valid PTE entry per PTE table, hence the
> worst case scenario for the multigenerational LRU, since it is based
> on page table scanning.
>
> TL;DR: the multigenerational LRU did not perform worse than the rmap.
>
> My test configurations:
>
>   The size of the shmem: 256GB
>   The number of processes: 450
>   Total memory size: 200GB
>   The number of CPUs: 64
>   The number of nodes: 2
>
> There is no clear winner in the background reclaim path (kswapd).
>
>   kswapd (5.12.0-rc6):
>     43.99%  kswapd1  page_vma_mapped_walk
>     34.86%  kswapd0  page_vma_mapped_walk
>      2.43%  kswapd0  count_shadow_nodes
>      1.17%  kswapd1  page_referenced_one
>      1.15%  kswapd0  _find_next_bit.constprop.0
>      0.95%  kswapd0  page_referenced_one
>      0.87%  kswapd1  try_to_unmap_one
>      0.75%  kswapd0  cpumask_next
>      0.67%  kswapd0  shrink_slab
>      0.66%  kswapd0  down_read_trylock
>
>   kswapd (the multigenerational LRU):
>     33.39%  kswapd0  walk_pud_range
>     10.93%  kswapd1  walk_pud_range
>      9.36%  kswapd0  page_vma_mapped_walk
>      7.15%  kswapd1  page_vma_mapped_walk
>      3.83%  kswapd0  count_shadow_nodes
>      2.60%  kswapd1  shrink_slab
>      2.47%  kswapd1  down_read_trylock
>      2.03%  kswapd0  _raw_spin_lock
>      1.87%  kswapd0  shrink_slab
>      1.67%  kswapd1  count_shadow_nodes
>
> The multigenerational LRU is somewhat winning in the direct reclaim
> path (sparse is the test binary name):
>
>   The test process context (5.12.0-rc6):
>     65.02%  sparse   page_vma_mapped_walk
>      5.49%  sparse   page_counter_try_charge
>      3.60%  sparse   propagate_protected_usage
>      2.31%  sparse   page_counter_uncharge
>      2.06%  sparse   count_shadow_nodes
>      1.81%  sparse   native_queued_spin_lock_slowpath
>      1.79%  sparse   down_read_trylock
>      1.67%  sparse   page_referenced_one
>      1.42%  sparse   shrink_slab
>      0.87%  sparse   try_to_unmap_one
>
>   CPU % (direct reclaim vs the rest): 71% vs 29%
>   # grep oom_kill /proc/vmstat
>   oom_kill 81
>
>   The test process context (the multigenerational LRU):
>     33.12%  sparse   page_vma_mapped_walk
>     10.70%  sparse   walk_pud_range
>      9.64%  sparse   page_counter_try_charge
>      6.63%  sparse   propagate_protected_usage
>      4.43%  sparse   native_queued_spin_lock_slowpath
>      3.85%  sparse   page_counter_uncharge
>      3.71%  sparse   irqentry_exit_to_user_mode
>      2.16%  sparse   _raw_spin_lock
>      1.83%  sparse   unmap_page_range
>      1.82%  sparse   shrink_slab
>
>   CPU % (direct reclaim vs the rest): 47% vs 53%
>   # grep oom_kill /proc/vmstat
>   oom_kill 80
>
> I also compared other numbers from /proc/vmstat. They do not provide
> any additional insight than the profiles, so I will just omit them
> here.
>
> The following optimizations and the stats measuring their efficacies
> explain why the multigenerational LRU did not perform worse:
>
>   Optimization 1: take advantage of the scheduling information.
>     # of active processes           270
>     # of inactive processes         105
>
>   Optimization 2: take the advantage of the accessed bit on non-leaf
>   PMD entries.
>     # of old non-leaf PMD entries   30523335
>     # of young non-leaf PMD entries 1358400
>
> These stats are not currently included. But I will add them to the
> debugfs interface in the next version coming soon. And I will also add
> another optimization for Android. It reduces zigzags when there are
> many single-page VMAs, i.e., not returning to the PGD table for each
> of such VMAs. Just a heads-up.
>
> The rmap, on the other hand, had to
>   1) lock each (shmem) page it scans
>   2) go through five levels of page tables for each page, even though
>   some of them have the same LCAs
> during the test. The second part is worse given that I have 5 levels
> of page tables configured.
>
> Any additional benchmarks you would suggest? Thanks.

Hi, Yu,

Thanks for your data.

In addition to the data your measured above, is it possible for you to
measure some raw data?  For example, how many CPU cycles does it take to
scan all pages in the system?  For the page table scanning, the page
tables of all processes will be scanned.  For the rmap scanning, all
pages in LRU will be scanned.  And we can do that with difference
parameters, for example, shared vs. non-shared, sparse vs. dense.  Then
we can get an idea about how fast the page table scanning can be.

Best Regards,
Huang, Ying
Yu Zhao April 13, 2021, 11 p.m. UTC | #14
On Mon, Apr 12, 2021 at 9:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Tue, Mar 16, 2021 at 02:14:43PM -0700, Dave Hansen wrote:
> >> On 3/16/21 1:30 PM, Yu Zhao wrote:
> >> > On Tue, Mar 16, 2021 at 07:50:23AM -0700, Dave Hansen wrote:
> >> >> I think it would also be very worthwhile to include some research in
> >> >> this series about why the kernel moved away from page table scanning.
> >> >> What has changed?  Are the workloads we were concerned about way back
> >> >> then not around any more?  Has faster I/O or larger memory sizes with a
> >> >> stagnating page size changed something?
> >> >
> >> > Sure. Hugh also suggested this too but I personally found that ancient
> >> > pre-2.4 history too irrelevant (and uninteresting) to the modern age
> >> > and decided to spare audience of the boredom.
> >>
> >> IIRC, rmap chains showed up in the 2.5 era and the VM was quite bumpy
> >> until anon_vmas came around, which was early-ish in the 2.6 era.
> >>
> >> But, either way, I think there is a sufficient population of nostalgic
> >> crusty old folks around to warrant a bit of a history lesson.  We'll
> >> enjoy the trip down memory lane, fondly remembering the old days in
> >> Ottawa...
> >>
> >> >>> nr_vmscan_write 24900719
> >> >>> nr_vmscan_immediate_reclaim 115535
> >> >>> pgscan_kswapd 320831544
> >> >>> pgscan_direct 23396383
> >> >>> pgscan_direct_throttle 0
> >> >>> pgscan_anon 127491077
> >> >>> pgscan_file 216736850
> >> >>> slabs_scanned 400469680
> >> >>> compact_migrate_scanned 1092813949
> >> >>> compact_free_scanned 4919523035
> >> >>> compact_daemon_migrate_scanned 2372223
> >> >>> compact_daemon_free_scanned 20989310
> >> >>> unevictable_pgs_scanned 307388545
> >> >
> >> > 10G swap + 8G anon rss + 6G file rss, hmm... an interesting workload.
> >> > The file rss does seem a bit high to me, my wild speculation is there
> >> > have been git/make activities in addition to a VM?
> >>
> >> I wish I was doing more git/make activities.  It's been an annoying
> >> amount of email and web browsers for 12 days.  If anything, I'd suspect
> >> that Thunderbird is at fault for keeping a bunch of mail in the page
> >> cache.  There are a couple of VM's running though.
> >
> > Hi Dave,
> >
> > Sorry for the late reply. Here is the benchmark result from the worst
> > case scenario.
> >
> > As you suggested, we create a lot of processes sharing one large
> > sparse shmem, and they access the shmem at random 2MB-aligned offsets.
> > So there will be at most one valid PTE entry per PTE table, hence the
> > worst case scenario for the multigenerational LRU, since it is based
> > on page table scanning.
> >
> > TL;DR: the multigenerational LRU did not perform worse than the rmap.
> >
> > My test configurations:
> >
> >   The size of the shmem: 256GB
> >   The number of processes: 450
> >   Total memory size: 200GB
> >   The number of CPUs: 64
> >   The number of nodes: 2
> >
> > There is no clear winner in the background reclaim path (kswapd).
> >
> >   kswapd (5.12.0-rc6):
> >     43.99%  kswapd1  page_vma_mapped_walk
> >     34.86%  kswapd0  page_vma_mapped_walk
> >      2.43%  kswapd0  count_shadow_nodes
> >      1.17%  kswapd1  page_referenced_one
> >      1.15%  kswapd0  _find_next_bit.constprop.0
> >      0.95%  kswapd0  page_referenced_one
> >      0.87%  kswapd1  try_to_unmap_one
> >      0.75%  kswapd0  cpumask_next
> >      0.67%  kswapd0  shrink_slab
> >      0.66%  kswapd0  down_read_trylock
> >
> >   kswapd (the multigenerational LRU):
> >     33.39%  kswapd0  walk_pud_range
> >     10.93%  kswapd1  walk_pud_range
> >      9.36%  kswapd0  page_vma_mapped_walk
> >      7.15%  kswapd1  page_vma_mapped_walk
> >      3.83%  kswapd0  count_shadow_nodes
> >      2.60%  kswapd1  shrink_slab
> >      2.47%  kswapd1  down_read_trylock
> >      2.03%  kswapd0  _raw_spin_lock
> >      1.87%  kswapd0  shrink_slab
> >      1.67%  kswapd1  count_shadow_nodes
> >
> > The multigenerational LRU is somewhat winning in the direct reclaim
> > path (sparse is the test binary name):
> >
> >   The test process context (5.12.0-rc6):
> >     65.02%  sparse   page_vma_mapped_walk
> >      5.49%  sparse   page_counter_try_charge
> >      3.60%  sparse   propagate_protected_usage
> >      2.31%  sparse   page_counter_uncharge
> >      2.06%  sparse   count_shadow_nodes
> >      1.81%  sparse   native_queued_spin_lock_slowpath
> >      1.79%  sparse   down_read_trylock
> >      1.67%  sparse   page_referenced_one
> >      1.42%  sparse   shrink_slab
> >      0.87%  sparse   try_to_unmap_one
> >
> >   CPU % (direct reclaim vs the rest): 71% vs 29%
> >   # grep oom_kill /proc/vmstat
> >   oom_kill 81
> >
> >   The test process context (the multigenerational LRU):
> >     33.12%  sparse   page_vma_mapped_walk
> >     10.70%  sparse   walk_pud_range
> >      9.64%  sparse   page_counter_try_charge
> >      6.63%  sparse   propagate_protected_usage
> >      4.43%  sparse   native_queued_spin_lock_slowpath
> >      3.85%  sparse   page_counter_uncharge
> >      3.71%  sparse   irqentry_exit_to_user_mode
> >      2.16%  sparse   _raw_spin_lock
> >      1.83%  sparse   unmap_page_range
> >      1.82%  sparse   shrink_slab
> >
> >   CPU % (direct reclaim vs the rest): 47% vs 53%
> >   # grep oom_kill /proc/vmstat
> >   oom_kill 80
> >
> > I also compared other numbers from /proc/vmstat. They do not provide
> > any additional insight than the profiles, so I will just omit them
> > here.
> >
> > The following optimizations and the stats measuring their efficacies
> > explain why the multigenerational LRU did not perform worse:
> >
> >   Optimization 1: take advantage of the scheduling information.
> >     # of active processes           270
> >     # of inactive processes         105
> >
> >   Optimization 2: take the advantage of the accessed bit on non-leaf
> >   PMD entries.
> >     # of old non-leaf PMD entries   30523335
> >     # of young non-leaf PMD entries 1358400
> >
> > These stats are not currently included. But I will add them to the
> > debugfs interface in the next version coming soon. And I will also add
> > another optimization for Android. It reduces zigzags when there are
> > many single-page VMAs, i.e., not returning to the PGD table for each
> > of such VMAs. Just a heads-up.
> >
> > The rmap, on the other hand, had to
> >   1) lock each (shmem) page it scans
> >   2) go through five levels of page tables for each page, even though
> >   some of them have the same LCAs
> > during the test. The second part is worse given that I have 5 levels
> > of page tables configured.
> >
> > Any additional benchmarks you would suggest? Thanks.
>
> Hi, Yu,
>
> Thanks for your data.
>
> In addition to the data your measured above, is it possible for you to
> measure some raw data?  For example, how many CPU cycles does it take to
> scan all pages in the system?  For the page table scanning, the page
> tables of all processes will be scanned.  For the rmap scanning, all
> pages in LRU will be scanned.  And we can do that with difference
> parameters, for example, shared vs. non-shared, sparse vs. dense.  Then
> we can get an idea about how fast the page table scanning can be.

SGTM. I'll get back to you later.