[v5,00/10] Multigenerational LRU Framework

Message ID	20211111041510.402534-1-yuzhao@google.com (mailing list archive)
Headers	show Return-Path: <SRS0=7n0b=P6=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 9F5F86134F Date: Wed, 10 Nov 2021 21:15:00 -0700 Message-Id: <20211111041510.402534-1-yuzhao@google.com> Mime-Version: 1.0 Subject: [PATCH v5 00/10] Multigenerational LRU Framework From: Yu Zhao <yuzhao@google.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, x86@kernel.org, page-reclaim@google.com, holger@applied-asynchrony.com, iam@valdikss.org.ru, Yu Zhao <yuzhao@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Multigenerational LRU Framework \| expand [v5,00/10] Multigenerational LRU Framework [v5,01/10] mm: x86, arm64: add arch_has_hw_pte_young() [v5,02/10] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG [v5,03/10] mm/vmscan.c: refactor shrink_node() [v5,04/10] mm: multigenerational lru: groundwork [v5,05/10] mm: multigenerational lru: mm_struct list [v5,06/10] mm: multigenerational lru: aging [v5,07/10] mm: multigenerational lru: eviction [v5,08/10] mm: multigenerational lru: user interface [v5,09/10] mm: multigenerational lru: Kconfig [v5,10/10] mm: multigenerational lru: documentation

Message ID

20211111041510.402534-1-yuzhao@google.com (mailing list archive)

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 9F5F86134F
Date: Wed, 10 Nov 2021 21:15:00 -0700
Message-Id: <20211111041510.402534-1-yuzhao@google.com>
Mime-Version: 1.0
Subject: [PATCH v5 00/10] Multigenerational LRU Framework
From: Yu Zhao <yuzhao@google.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	x86@kernel.org, page-reclaim@google.com, holger@applied-asynchrony.com,
	iam@valdikss.org.ru, Yu Zhao <yuzhao@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Multigenerational LRU Framework | expand

Message

Yu Zhao Nov. 11, 2021, 4:15 a.m. UTC

What's new in v5
================
1) Rebased to v5.15 per users' requests
2) Performance benchmark reports on the previous version:
   https://patchwork.kernel.org/project/linux-mm/cover/20210818063107.2696454-1-yuzhao@google.com/#24506153

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Problems
========
Active/inactive
---------------
Data centers need to predict whether a job can successfully land on a
machine without actually impacting the existing jobs. The granularity
of the active/inactive is too coarse to be useful for job schedulers
to make such decisions. In addition, data centers need to monitor
their memory utilization for horizontal scaling. The active/inactive
are relative terms and therefore cannot give any insight into a pool
of machines, e.g., aggregating the active/inactive across multiple
machines without a common frame of reference yields no meaningful
results.

Phones and laptops need to make good choices about what to evict
because they are more sensitive to major faults and power consumption.
Major faults can cause janks, i.e., slow UI renderings, and negatively
impact user experience. The selection between anon and file types has
been suboptimal because it is difficult to compare the access patterns
of the two types. On phones and laptops, executable pages are
frequently evicted despite the fact that there are many less
frequently used anon pages. Conversely, on workstations building large
projects, anon pages are sometimes swapped out while there are many
less recently used file pages.

Fundamentally, the notion of active/inactive has very limited ability
to measure temporal locality.

Rmap walk
---------
Traversing a list of pages and searching the rmap for PTEs mapping
each page can be very expensive because those pages are likely to be
unrelated. For workloads using a high percentage of anon memory, the
rmap becomes a bottleneck in page reclaim. For example, kswapd can
easily spend more CPU time in the rmap than in anything else on
laptops running Chrome. And the kernel can spend more CPU time in the
rmap than in any other functions on servers that heavily overcommit
anon memory.

Simply put, it does not take advantage of spatial locality when using
the rmap to test the accessed bit over a large number of pages.

Solutions
=========
Generations
-----------
This solution introduces a temporal dimension. Each generation is a
dot on the timeline and its population includes all mapped pages that
have been accessed since the birth of this generation.

All eviction choices are made based on generation numbers, which are
simple and yet effective. A large number of pages can be spread out
across many generations. Since each generation is timestamped at
birth, its population is aggregatable across different machines. This
is especially useful for data centers that require working set
estimation and proactive reclaim.

Page table walk
---------------
Each walk traverses an mm_struct list to scan PTEs for accessed pages
only. Processes that have been sleeping since the last walk are
skipped. The cost of this solution is roughly proportional to the
number of accessed pages. Since page tables usually have good spatial
locality for workloads using a high percentage of anon memory, the end
result is generally a significant reduction in kswapd CPU usage.

Note that page table walks are conditional and therefore do not
replace the rmap. For workloads that have sparse mappings, this
solution falls back to the rmap.

Use cases
=========
Page cache overcommit
---------------------
Tiers within each generation are specifically designed to improve the
performance of page cache under memory pressure. The fio/io_uring
benchmark shows 14% increase in IOPS for buffered I/O.

Without this patchset, the profile of fio/io_uring looks like:
  12.03%  __page_cache_alloc
   6.53%  shrink_active_list
   2.53%  mark_page_accessed

With this patchset, it looks like:
   9.45%  __page_cache_alloc
   0.52%  mark_page_accessed

Essentially, the idea of tiers is a feedback loop based on trial and
error. Instead of unconditionally moving file pages to the active list
upon the second access, this solution monitors refaults and
conditionally protects file pages with outlying refaults.

Anon memory overcommit
----------------------
Our real-world benchmark that browses popular websites in multiple
Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
less PSI.

Without this patchset, the profile of kswapd looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

With this patchset, it looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced by 7%. Both metrics
are important to phones and laptops as they are highly correlated to
user experience.

Working set estimation
----------------------
Userspace can invoke the aging by writing "+ memcg_id node_id max_gen
[swappiness]" to /sys/kernel/debug/lru_gen. Reading this debugfs
interface returns the birth time and the population of each
generation.

Given a pool of machines, by periodically invoking the aging, a job
scheduler is able to rank these machines based on the sizes of their
working sets and in turn selects the most ideal ones to land new jobs.

Proactive reclaim
-----------------
Userspace can invoke the eviction by writing "- memcg_id node_id
min_gen [swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen.
Multiple command lines are supported, as is concatenation with
delimiters "," and ";".

A typical use case is that a job scheduler invokes the eviction in
anticipation of new jobs. The savings from proactive reclaim can
provide certain SLA to landing these new jobs.

Yu Zhao (10):
  mm: x86, arm64: add arch_has_hw_pte_young()
  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  mm/vmscan.c: refactor shrink_node()
  mm: multigenerational lru: groundwork
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: aging
  mm: multigenerational lru: eviction
  mm: multigenerational lru: user interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst          |    1 +
 Documentation/vm/multigen_lru.rst   |  132 ++
 arch/Kconfig                        |    9 +
 arch/arm64/include/asm/cpufeature.h |    5 +
 arch/arm64/include/asm/pgtable.h    |   13 +-
 arch/arm64/kernel/cpufeature.c      |   10 +
 arch/arm64/tools/cpucaps            |    1 +
 arch/x86/Kconfig                    |    1 +
 arch/x86/include/asm/pgtable.h      |    9 +-
 arch/x86/mm/pgtable.c               |    5 +-
 fs/exec.c                           |    2 +
 fs/fuse/dev.c                       |    3 +-
 include/linux/cgroup.h              |   15 +-
 include/linux/memcontrol.h          |    7 +
 include/linux/mm.h                  |   36 +
 include/linux/mm_inline.h           |  198 ++
 include/linux/mm_types.h            |  106 ++
 include/linux/mmzone.h              |  175 ++
 include/linux/nodemask.h            |    1 +
 include/linux/oom.h                 |   16 +
 include/linux/page-flags-layout.h   |   19 +-
 include/linux/page-flags.h          |    4 +-
 include/linux/pgtable.h             |   17 +-
 include/linux/sched.h               |    3 +
 include/linux/swap.h                |    3 +
 kernel/bounds.c                     |    3 +
 kernel/cgroup/cgroup-internal.h     |    1 -
 kernel/exit.c                       |    1 +
 kernel/fork.c                       |   10 +
 kernel/kthread.c                    |    1 +
 kernel/sched/core.c                 |    2 +
 mm/Kconfig                          |   59 +
 mm/huge_memory.c                    |    3 +-
 mm/memcontrol.c                     |   31 +
 mm/memory.c                         |   21 +-
 mm/mm_init.c                        |    6 +-
 mm/oom_kill.c                       |    4 +-
 mm/page_alloc.c                     |    1 +
 mm/rmap.c                           |    8 +
 mm/swap.c                           |   51 +-
 mm/swapfile.c                       |    2 +
 mm/vmscan.c                         | 2697 ++++++++++++++++++++++++++-
 mm/workingset.c                     |  120 +-
 43 files changed, 3679 insertions(+), 133 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

Comments

bot@edi.works Nov. 22, 2021, 5:32 a.m. UTC | #1

Kernel / Redis benchmark with MGLRU

TLDR
====
With the MGLRU, Redis achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]%,
[11.47, 19.36]%, [1.27, 3.54]%, [10.11, 14.81]% and [8.75, 13.64]%
more operations per second (OPS), respectively, for sequential access
w/ THP=always, random access w/ THP=always, Gaussian (distribution)
access w/ THP=always, sequential access w/ THP=never, random access
w/ THP=never and Gaussian access w/ THP=never.

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradation and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

Redis is one of the most popular open-source in-memory KV stores.
memtier_benchmark is the leading open-source KV store benchmarking
software that supports multiple access patterns. THP can have a
negative effect under memory pressure, due to internal and/or
external fragmentations.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.15
* Patched: 5.15 + MGLRU

Memory utilization: % of memory size
* Underutilizing: N/A
* Overcommitting: ~10% swapped out (zram)

THP (2MB Transparent Huge Pages):
* Always
* Never

Access patterns (4kB objects, 100% read):
* Parallel sequential
* Uniform random
* Gaussian (SD = 1/6 of key range)

Concurrency: average # of users per CPU
* Low: 1

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~25

Note that the goal of this benchmark is to compare the performance
for the same key range, object size, and hit ratio. Since Redis does
not support eviction to backing storage, it would require fewer
in-memory objects to underutilize memory, which reduces the hit ratio
and therefore is not applicable in this case.

Procedure
=========
The latest MGLRU patchset for the 5.15 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/2

Baseline and patched 5.15 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
<duplicate Redis service>

<for each kernel>
    grub-set-default <baseline, patched>
    <for each THP setting>
        echo <always, never> >/sys/kernel/mm/transparent_hugepage/enabled
        <for each access pattern>
            <update run_memtier.sh>
            <for each data point>
                reboot
                run_memtier.sh
                <collect total OPS>

Note that the OSS version of Redis does not support sharding, i.e.,
one service uses a single thread to serve all connections. Therefore,
on larger machines, multiple Redis services are required to achieve
better throughput.

Hardware
========
Memory (GB): 256
CPU (total #): 48
NVMe SSD (GB): 1024

OS
==
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=21.10
DISTRIB_CODENAME=impish
DISTRIB_DESCRIPTION="Ubuntu 21.10"

$ cat /proc/swaps
Filename        Type          Size         Used         Priority
/dev/zram0      partition     10485756     0            1
/dev/zram1      partition     10485756     0            1
/dev/zram2      partition     10485756     0            1
/dev/zram3      partition     10485756     0            1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

Redis
=====
$ redis-server -v
Redis server v=6.0.15 sha=00000000:0 malloc=jemalloc-5.2.1 bits=64
build=4610f4c3acf7fb25

$ cat /etc/redis/redis.conf
<existing parameters>
save ""
unixsocket /var/run/redis/redis-server.sock

memtier_benchmark
=================
$ memtier_benchmark -v
memtier_benchmark 1.3.0
Copyright (C) 2011-2020 Redis Labs Ltd.
This is free software.  You may redistribute copies of it under the
terms of the GNU General Public License
<http://www.gnu.org/licenses/gpl.html>.  There is NO WARRANTY, to the
extent permitted by law.

$ cat run_memtier.sh
# load objects
for ((i = 0; i < 12; i++))
do
    memtier_benchmark -S /var/run/redis$i/redis-server.sock -P redis \
        -n allkeys -c 4 -t 4 --ratio 1:0 --pipeline 8 -d 4000 \
        --key-minimum=1 --key-maximum=5300000 --key-pattern=P:P &
done

wait

# run benchmark
for ((i = 0; i < 12; i++))
do
    memtier_benchmark -S /var/run/redis$i/redis-server.sock -P redis \
        --test-time=1200 -c 4 -t 4 --ratio 0:1 --pipeline 8 \
        --randomize --distinct-client-seed --key-minimum=1 \
        --key-maximum=5300000 --key-pattern=<P:P, R:R, G:G> &
done

wait

Results
=======
Comparing the patched with the baseline kernel, Redis achieved 95%
CIs [0.58, 5.94]%, [6.55, 14.58]%, [11.47, 19.36]%, [1.27, 3.54]%,
[10.11, 14.81]% and [8.75, 13.64]% more OPS, respectively, for
sequential access w/ THP=always, random access w/ THP=always,
Gaussian access w/ THP=always, sequential access w/ THP=never, random
access w/ THP=never and Gaussian access w/ THP=never.

+---------------------------+------------------+------------------+
| Mean million OPS [95% CI] | THP=always       | THP=never        |
+---------------------------+------------------+------------------+
| Sequential access         | 1.84 / 1.9       | 1.702 / 1.743    |
|                           | [0.01, 0.109]    | [0.021, 0.06]    |
+---------------------------+------------------+------------------+
| Random access             | 1.742 / 1.926    | 1.493 / 1.679    |
|                           | [0.114, 0.253]   | [0.15, 0.221]    |
+---------------------------+------------------+------------------+
| Gaussian access           | 1.771 / 2.044    | 1.635 / 1.818    |
|                           | [0.203, 0.342]   | [0.143, 0.222]   |
+---------------------------+------------------+------------------+
Table 1. Comparison between the baseline and patched kernels

Comparing THP=never with THP=always, Redis achieved 95% CIs [-8.66,
-6.34]%, [-17.6, -10.98]% and [-10.92, -4.44]% more OPS, respectively,
for sequential access, random access and Gaussian access when using
the baseline kernel; 95% CIs [-10.83, -5.7]%, [-15.72, -9.93]% and
[-13.92, -8.19]% more OPS, respectively, for sequential access, random
access and Gaussian access when using the patched kernel.

+---------------------------+------------------+------------------+
| Mean million OPS [95% CI] | Baseline kernel  | Patched kernel   |
+---------------------------+------------------+------------------+
| Sequential access         | 1.84 / 1.702     | 1.9 / 1.743      |
|                           | [-0.159, -0.116] | [-0.205, -0.108] |
+---------------------------+------------------+------------------+
| Random access             | 1.742 / 1.493    | 1.926 / 1.679    |
|                           | [-0.306, -0.191] | [-0.302, -0.191] |
+---------------------------+------------------+------------------+
| Gaussian access           | 1.771 / 1.635    | 2.044 / 1.818    |
|                           | [-0.193, -0.078] | [-0.284, -0.167] |
+---------------------------+------------------+------------------+
Table 2. Comparison between THP=always and THP=never

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/redis/5.15

Appendix
========
$ cat raw_data_redis.r
v <- c(
    # baseline THP=always sequential
    1.81, 1.81, 1.82, 1.84, 1.84, 1.84, 1.84, 1.85, 1.87, 1.88,
    # baseline THP=always random
    1.66, 1.67, 1.69, 1.69, 1.72, 1.75, 1.75, 1.77, 1.84, 1.88,
    # baseline THP=always Gaussian
    1.69, 1.70, 1.72, 1.76, 1.76, 1.76, 1.76, 1.78, 1.84, 1.94,
    # baseline THP=never sequential
    1.68, 1.68, 1.69, 1.69, 1.69, 1.69, 1.71, 1.72, 1.72, 1.75,
    # baseline THP=never random
    1.45, 1.45, 1.46, 1.47, 1.47, 1.47, 1.50, 1.53, 1.55, 1.58,
    # baseline THP=never Gaussian
    1.59, 1.60, 1.60, 1.60, 1.61, 1.63, 1.65, 1.66, 1.70, 1.71,
    # patched THP=always sequential
    1.79, 1.81, 1.85, 1.88, 1.90, 1.91, 1.96, 1.96, 1.96, 1.98,
    # patched THP=always random
    1.81, 1.86, 1.88, 1.89, 1.91, 1.94, 1.95, 1.96, 1.97, 2.09,
    # patched THP=always Gaussian
    1.95, 1.95, 1.98, 2.00, 2.04, 2.05, 2.08, 2.09, 2.12, 2.18,
    # patched THP=never sequential
    1.71, 1.73, 1.73, 1.74, 1.74, 1.74, 1.75, 1.75, 1.77, 1.77,
    # patched THP=never random
    1.65, 1.65, 1.65, 1.67, 1.68, 1.68, 1.69, 1.69, 1.71, 1.72,
    # patched THP=never Gaussian
    1.76, 1.76, 1.78, 1.81, 1.82, 1.83, 1.83, 1.84, 1.87, 1.88
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (thp in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, thp, 1], a[, dist, thp, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("thp%d dist%d: no significance", thp, dist)
        } else {
            s <- sprintf("thp%d dist%d: [%.2f, %.2f]%%", thp, dist, -p[2], -p[1])
        }
        print(s)
    }
}

# THP=always vs THP=never
for (kern in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, 1, kern], a[, dist, 2, kern])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("kern%d dist%d: no significance", kern, dist)
        } else {
            s <- sprintf("kern%d dist%d: [%.2f, %.2f]%%", kern, dist, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_redis.r

        Welch Two Sample t-test

data:  a[, dist, thp, 1] and a[, dist, thp, 2]
t = -2.6773, df = 11.109, p-value = 0.02135
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.10926587 -0.01073413
sample estimates:
mean of x mean of y
     1.84      1.90

[1] "thp1 dist1: [0.58, 5.94]%"

        Welch Two Sample t-test

data:  a[, dist, thp, 1] and a[, dist, thp, 2]
t = -5.5311, df = 17.957, p-value = 3.011e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2539026 -0.1140974
sample estimates:
mean of x mean of y
    1.742     1.926

[1] "thp1 dist2: [6.55, 14.58]%"

        Welch Two Sample t-test

data:  a[, dist, thp, 1] and a[, dist, thp, 2]
t = -8.2093, df = 17.98, p-value = 1.707e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3428716 -0.2031284
sample estimates:
mean of x mean of y
    1.771     2.044

[1] "thp1 dist3: [11.47, 19.36]%"

        Welch Two Sample t-test

data:  a[, dist, thp, 1] and a[, dist, thp, 2]
t = -4.4705, df = 17.276, p-value = 0.0003243
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.06032607 -0.02167393
sample estimates:
mean of x mean of y
    1.702     1.743

[1] "thp2 dist1: [1.27, 3.54]%"

        Welch Two Sample t-test

data:  a[, dist, thp, 1] and a[, dist, thp, 2]
t = -11.366, df = 13.885, p-value = 2.038e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2211244 -0.1508756
sample estimates:
mean of x mean of y
    1.493     1.679

[1] "thp2 dist2: [10.11, 14.81]%"

        Welch Two Sample t-test

data:  a[, dist, thp, 1] and a[, dist, thp, 2]
t = -9.6138, df = 17.962, p-value = 1.663e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2229972 -0.1430028
sample estimates:
mean of x mean of y
    1.635     1.818

[1] "thp2 dist3: [8.75, 13.64]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 13.532, df = 17.988, p-value = 7.194e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1165737 0.1594263
sample estimates:
mean of x mean of y
    1.840     1.702

[1] "kern1 dist1: [-8.66, -6.34]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 9.197, df = 15.127, p-value = 1.386e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1913354 0.3066646
sample estimates:
mean of x mean of y
    1.742     1.493

[1] "kern1 dist2: [-17.60, -10.98]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 5.0552, df = 14.669, p-value = 0.0001523
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.07854452 0.19345548
sample estimates:
mean of x mean of y
    1.771     1.635

[1] "kern1 dist3: [-10.92, -4.44]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 7.1487, df = 10.334, p-value = 2.614e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1082788 0.2057212
sample estimates:
mean of x mean of y
    1.900     1.743

[1] "kern2 dist1: [-10.83, -5.70]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 9.7525, df = 10.871, p-value = 1.042e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1911754 0.3028246
sample estimates:
mean of x mean of y
    1.926     1.679

[1] "kern2 dist2: [-15.72, -9.93]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 8.2831, df = 13.988, p-value = 9.168e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.167476 0.284524
sample estimates:
mean of x mean of y
    2.044     1.818

[1] "kern2 dist3: [-13.92, -8.19]%"

bot@edi.works Dec. 2, 2021, 6:28 a.m. UTC | #2

Kernel / Apache Cassandra benchmark with MGLRU

TLDR
====
With the MGLRU, Apache Cassandra achieved 95% CIs [1.06, 4.10]%,
[1.94, 5.43]% and [4.11, 7.50]% more operations per second (OPS),
respectively, for exponential (distribution) access, random access
and Zipfian access, when swap was off; 95% CIs [0.50, 2.60]%, [6.51,
8.77]% and [3.29, 6.75]% more OPS, respectively, for exponential
access, random access and Zipfian access, when swap was set to
minimum (vm.swappiness=1).

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradation and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

Apache Cassandra is one of the most popular open-source NoSQL
databases. YCSB is the leading open-source NoSQL database
benchmarking software that supports multiple access distributions.
Swap can have a negative effect, as Apache Cassandra cautions "Do
never allow your system to swap" [1].

[1]: https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L394

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.15
* Patched: 5.15 + MGLRU

Swap configurations:
* Off
* Minimum (vm.swappiness=1)

Concurrency: average # of users per CPU
* Medium: 3

Access distributions (2kB objects, 10% update):
* Exponential
* Uniform random
* Zipfian

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~40

Note that Apache Cassandra reached the peak performance for this
benchmark with 2-3 users per CPU, i.e., its performance started
degrading with fewer or more users.

Procedure
=========
The latest MGLRU patchset for the 5.15 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/2

Baseline and patched 5.15 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
ycsb_load.sh
systemctl stop cassandra
e2image <backup /mnt/data>

<for each kernel>
    grub-set-default <baseline, patched>
    <for each swap configuration>
        <swapoff, swapon>
        <for each access distribution>
            <update ycsb_run.sh>
            <for each data point>
                systemctl stop cassandra
                e2image <restore /mnt/data>
                reboot
                ycsb_run.sh
                <collect OPS>

Hardware
========
Memory (GB): 256
CPU (total #): 48
NVMe SSD (GB): 1024

OS
==
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=21.10
DISTRIB_CODENAME=impish
DISTRIB_DESCRIPTION="Ubuntu 21.10"

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/dev/nvme0n1p3    partition     32970748      0        -2

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

$ cat /proc/sys/vm/overcommit_memory
1

$ cat /proc/sys/vm/swappiness
1

$ cat /proc/sys/vm/max_map_count
1048575

Apache Cassandra
================
$ nodetool version
ReleaseVersion: 4.0.1

$ cat jvm8-server.options
<existing parameters>

#-XX:+UseParNewGC
#-XX:+UseConcMarkSweepGC
#-XX:+CMSParallelRemarkEnabled
#-XX:SurvivorRatio=8
#-XX:MaxTenuringThreshold=1
#-XX:CMSInitiatingOccupancyFraction=75
#-XX:+UseCMSInitiatingOccupancyOnly
#-XX:CMSWaitDuration=10000
#-XX:+CMSParallelInitialMarkEnabled
#-XX:+CMSEdenChunksRecordAlways
#-XX:+CMSClassUnloadingEnabled

-XX:+UseG1GC
-XX:+ParallelRefProcEnabled
-XX:MaxGCPauseMillis=400

<existing parameters>

$ cat cassandra.yaml
<existing parameters>

data_file_directories: /mnt/data/
key_cache_size_in_mb: 5000
file_cache_enabled: true
file_cache_size_in_mb: 10000
buffer_pool_use_heap_if_exhausted: false
memtable_offheap_space_in_mb: 10000
memtable_allocation_type: offheap_buffers

<existing parameters>

YCSB
====
$ git log
commit ce3eb9ce51c84ee9e236998cdd2cefaeb96798a8 (HEAD -> master,
origin/master, origin/HEAD)
Author: Ivan <john.koepi@gmail.com>
Date:   Tue Feb 16 17:38:00 2021 +0200

    [scylla] enable token aware LB by default, improve the docs (#1507)

$ cat ycsb_load.sh
# load objects
cqlsh -e "create keyspace ycsb WITH REPLICATION = {'class' : \
    'SimpleStrategy', 'replication_factor': 1};"
cqlsh -k ycsb -e "create table usertable (y_id varchar primary key, \
    field0 varchar, field1 varchar, field2 varchar, field3 varchar, \
    field4 varchar, field5 varchar ,field6 varchar, field7 varchar, \
    field8 varchar, field9 varchar);"
ycsb load cassandra-cql -s -threads 24 -p hosts=localhost \
    -p workload=site.ycsb.workloads.CoreWorkload -p fieldlength=200 \
    -p recordcount=130000000

$ cat ycsb_run.sh
# run benchmark
ycsb run cassandra-cql -s -threads 144 -p hosts=localhost \
    -p workload=site.ycsb.workloads.CoreWorkload \
    -p recordcount=130000000 -p operationcount=130000000 \
    -p readproportion=0.9 -p updateproportion=0.1 \
    -p maxexecutiontime=1800 \
    -p requestdistribution=<exponential, uniform, zipfian>

Results
=======
Comparing the patched with the baseline kernel, Apache Cassandra
achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]% and [4.11, 7.50]% more
OPS, respectively, for exponential access, random access and Zipfian
access, when swap was off; 95% CIs [0.50, 2.60]%, [6.51, 8.77]% and
[3.29, 6.75]% more OPS, respectively, for exponential access, random
access and Zipfian access, when swap was set to minimum
(vm.swappiness=1).

+--------------------+--------------------+---------------------+
| Mean OPS [95% CI]  | No swap            | Minimum swap        |
+--------------------+--------------------+---------------------+
| Exponential access | 71084.9 / 72917.5  | 71499.6 / 72607.9   |
|                    | [751.42, 2913.77]  | [358.40, 1858.19]   |
+--------------------+--------------------+---------------------+
| Random access      | 47127.2 / 48862.8  | 47585.4 / 51220.1   |
|                    | [912.68, 2558.51]  | [3097.39, 4172.00]  |
+--------------------+--------------------+---------------------+
| Zipfian access     | 70271.5 / 74348.8  | 70698.2 / 74248.3   |
|                    | [2887.20, 5267.39] | [2326.69, 4773.50]  |
+--------------------+--------------------+---------------------+
Table 1. Comparison between the baseline and the patched kernels

Comparing minimum swap with no swap, Apache Cassandra achieved 95%
CIs [4.05, 5.60]% more OPS for random access, when using the patched
kernel. There were no statistically significant changes in OPS under
other conditions.

+--------------------+--------------------+---------------------+
| Mean OPS [95% CI]  | Baseline kernel    |  Patched kernel     |
+--------------------+--------------------+---------------------+
| Exponential access | 71084.9 / 71499.6  | 72917.5 / 72607.9   |
|                    | [-358.97, 1188.37] | [-1376.93, 757.73]  |
+--------------------+--------------------+---------------------+
| Random access      | 47127.2 / 47585.4  | 48862.8 / 51220.1   |
|                    | [-424.55, 1340.95] | [1977.09, 2737.50]  |
+--------------------+--------------------+---------------------+
| Zipfian access     | 70271.5 / 70698.2  | 74348.8 / 74248.3   |
|                    | [-749.39, 1602.79] | [-1337.07, 1136.07] |
+--------------------+--------------------+---------------------+
Table 2. Comparison between no swap and minimum swap

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/cassandra/5.15

Appendix
========
$ cat raw_data_cassandra.r
v <- c(
    # baseline swapoff exp
    69952, 70274, 70286, 70818, 70946, 71202, 71244, 71615, 71787, 72725,
    # baseline swapoff uni
    45309, 46056, 46086, 46188, 47275, 47524, 47797, 48243, 48329, 48465,
    # baseline swapoff zip
    69096, 69194, 69386, 69408, 69412, 70795, 70890, 71170, 71232, 72132,
    # baseline swapon exp
    69836, 70783, 70951, 71188, 71521, 71764, 72035, 72166, 72287, 72465,
    # baseline swapon uni
    46089, 46963, 47308, 47599, 47776, 47822, 47952, 48042, 48092, 48211,
    # baseline swapon zip
    68986, 69279, 69290, 69805, 70146, 70913, 71462, 71978, 72370, 72753,
    # patched swapoff exp
    70701, 71328, 71458, 72846, 72885, 73078, 73702, 74077, 74415, 74685,
    # patched swapoff uni
    48275, 48460, 48735, 48813, 48902, 48969, 48996, 49007, 49213, 49258,
    # patched swapoff zip
    71829, 72909, 73259, 73835, 74200, 74544, 75318, 75514, 76031, 76049,
    # patched swapon exp
    71169, 71968, 72208, 72374, 72401, 72755, 72861, 72942, 73469, 73932,
    # patched swapon uni
    50292, 50529, 50981, 51224, 51414, 51420, 51480, 51608, 51625, 51628,
    # patched swapon zip
    72032, 72325, 73834, 74366, 74482, 74573, 74810, 75044, 75371, 75646
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (swap in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, swap, 1], a[, dist, swap, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("swap%d dist%d: no significance", swap, dist)
        } else {
            s <- sprintf("swap%d dist%d: [%.2f, %.2f]%%", swap, dist, -p[2], -p[1])
        }
        print(s)
    }
}

# swapoff vs swapon
for (kern in 1:2) {
    for (dist in 1:3) {
        r <- t.test(a[, dist, 1, kern], a[, dist, 2, kern])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("kern%d dist%d: no significance", kern, dist)
        } else {
            s <- sprintf("kern%d dist%d: [%.2f, %.2f]%%", kern, dist, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_cassandra.r

        Welch Two Sample t-test

data:  a[, dist, swap, 1] and a[, dist, swap, 2]
t = -3.6172, df = 14.793, p-value = 0.002585
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2913.7703  -751.4297
sample estimates:
mean of x mean of y
  71084.9   72917.5

[1] "swap1 dist1: [1.06, 4.10]%"

        Welch Two Sample t-test

data:  a[, dist, swap, 1] and a[, dist, swap, 2]
t = -4.679, df = 10.331, p-value = 0.0007961
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2558.5199  -912.6801
sample estimates:
mean of x mean of y
  47127.2   48862.8

[1] "swap1 dist2: [1.94, 5.43]%"

        Welch Two Sample t-test

data:  a[, dist, swap, 1] and a[, dist, swap, 2]
t = -7.2315, df = 16.902, p-value = 1.452e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5267.396 -2887.204
sample estimates:
mean of x mean of y
  70271.5   74348.8

[1] "swap1 dist3: [4.11, 7.50]%"

        Welch Two Sample t-test

data:  a[, dist, swap, 1] and a[, dist, swap, 2]
t = -3.1057, df = 17.95, p-value = 0.006118
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1858.191  -358.409
sample estimates:
mean of x mean of y
  71499.6   72607.9

[1] "swap2 dist1: [0.50, 2.60]%"

        Welch Two Sample t-test

data:  a[, dist, swap, 1] and a[, dist, swap, 2]
t = -14.307, df = 16.479, p-value = 1.022e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4172.006 -3097.394
sample estimates:
mean of x mean of y
  47585.4   51220.1

[1] "swap2 dist2: [6.51, 8.77]%"

        Welch Two Sample t-test

data:  a[, dist, swap, 1] and a[, dist, swap, 2]
t = -6.1048, df = 17.664, p-value = 9.877e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4773.504 -2326.696
sample estimates:
mean of x mean of y
  70698.2   74248.3

[1] "swap2 dist3: [3.29, 6.75]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -1.1261, df = 17.998, p-value = 0.2749
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1188.3785   358.9785
sample estimates:
mean of x mean of y
  71084.9   71499.6

[1] "kern1 dist1: no significance"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -1.1108, df = 14.338, p-value = 0.2849
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1340.9555   424.5555
sample estimates:
mean of x mean of y
  47127.2   47585.4

[1] "kern1 dist2: no significance"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -0.76534, df = 17.035, p-value = 0.4545
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1602.7926   749.3926
sample estimates:
mean of x mean of y
  70271.5   70698.2

[1] "kern1 dist3: no significance"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 0.62117, df = 14.235, p-value = 0.5443
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -757.7355 1376.9355
sample estimates:
mean of x mean of y
  72917.5   72607.9

[1] "kern2 dist1: no significance"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = -13.18, df = 15.466, p-value = 8.07e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2737.509 -1977.091
sample estimates:
mean of x mean of y
  48862.8   51220.1

[1] "kern2 dist2: [4.05, 5.60]%"

        Welch Two Sample t-test

data:  a[, dist, 1, kern] and a[, dist, 2, kern]
t = 0.17104, df = 17.575, p-value = 0.8661
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1136.076  1337.076
sample estimates:
mean of x mean of y
  74348.8   74248.3

[1] "kern2 dist3: no significance"

bot@edi.works Dec. 9, 2021, 7:24 a.m. UTC | #3

Kernel / Apache Hadoop benchmark with MGLRU

TLDR
====
With the MGLRU, Apache Hadoop took 95% CIs [5.31, 9.69]% and [2.02,
7.86]% less wall time to finish TeraSort, respectively, under the
medium- and the high-concurrency conditions, when swap was on. There
were no statistically significant changes in wall time for the rest
of the test matrix.

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradation and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

Apache Hadoop is one of the most popular open-source big-data
frameworks. TeraSort is the most widely used benchmark for Apache
Hadoop.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.15
* Patched: 5.15 + MGLRU

Swap configurations:
* Off
* On

Concurrency conditions: average # of tasks per CPU
* Low: 1/2
* Medium: 1
* High: 2

Cluster mode: local (12 concurrent jobs)
Dataset size: 100 million records from TeraGen

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~20

Procedure
=========
The latest MGLRU patchset for the 5.15 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/2

Baseline and patched 5.15 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
teragen 100000000 /mnt/data/raw
e2image <backup /mnt/data>

<for each kernel>
    grub-set-default <baseline, patched>
    <for each swap configuration>
        <swapoff, swapon>
        <update run_terasort.sh>
        <for each concurrency condition>
            <update run_terasort.sh>
            <for each data point>
                e2image <restore /mnt/data>
                reboot
                run_terasort.sh
                <collect wall time>

Hardware
========
Memory (GB): 256
CPU (total #): 48
NVMe SSD (GB): 2048

OS
==
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=21.10
DISTRIB_CODENAME=impish
DISTRIB_DESCRIPTION="Ubuntu 21.10"

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/swap.img         partition     67108860      0        -2

$ cat /proc/sys/vm/overcommit_memory
1

$ cat /proc/sys/vm/swappiness
1

Apache Hadoop
=============
$ hadoop version
Hadoop 3.3.1
Source code repository https://github.com/apache/hadoop.git -r
a3b9c37a397ad4188041dd80621bdeefc46885f2
Compiled by ubuntu on 2021-06-15T05:13Z
Compiled with protoc 3.7.1
From source with checksum 88a4ddb2299aca054416d6b7f81ca55
This command was run using
/root/hadoop-3.3.1/share/hadoop/common/hadoop-common-3.3.1.jar

$ cat run_terasort.sh
export HADOOP_ROOT_LOGGER="WARN,DRFA"
export HADOOP_HEAPSIZE_MAX=<swapoff: 20G, swapon: 22G>

for ((i = 0; i < 12; i++))
do
    /usr/bin/time -f "%e" hadoop jar \
        hadoop-mapreduce-examples-3.3.1.jar terasort \
        -Dfile.stream-buffer-size=8388608 \
        -Dio.file.buffer.size=8388608 \
        -Dmapreduce.job.heap.memory-mb.ratio=1.0 \
        -Dmapreduce.reduce.input.buffer.percent=1.0 \
        -Dmapreduce.reduce.merge.inmem.threshold=0 \
        -Dmapreduce.task.io.sort.factor=100 \
        -Dmapreduce.task.io.sort.mb=1000 \
        -Dmapreduce.terasort.final.sync=false \
        -Dmapreduce.terasort.num.partitions=100 \
        -Dmapreduce.terasort.partitions.sample=1000000 \
        -Dmapreduce.local.map.tasks.maximum=<2, 4, 8> \
        -Dmapreduce.local.reduce.tasks.maximum=<2, 4, 8> \
        -Dmapreduce.reduce.shuffle.parallelcopies=<2, 4, 8> \
        -Dhadoop.tmp.dir=/mnt/data/tmp$i \
        /mnt/data/raw /mnt/data/sorted$i
done

wait

Results
=======
Comparing the patched with the baseline kernel, Apache Hadoop took
95% CIs [5.31, 9.69]% and [2.02, 7.86]% less wall time to finish
TeraSort, respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically significant
changes in wall time for the rest of the test matrix.

+--------------------+------------------+------------------+
| Mean wall time (s) | Swap off         | Swap on          |
| [95% CI]           |                  |                  |
+--------------------+------------------+------------------+
| Low concurrency    | 758.43 / 746.83  | 740.78 / 733.42  |
|                    | [-26.80, 3.60]   | [-18.07, 3.35]   |
+--------------------+------------------+------------------+
| Medium concurrency | 911.81 / 910.19  | 911.53 / 843.15  |
|                    | [-26.70, 23.46]  | [-88.35, -48.39] |
+--------------------+------------------+------------------+
| High concurrency   | 921.17 / 929.51  | 1042.85 / 991.33 |
|                    | [-25.50, 42.18]  | [-81.94, -21.08] |
+--------------------+------------------+------------------+
Table 1. Comparison between the baseline and the patched kernels

Comparing swap on with swap off, Apache Hadoop took 95% CIs [-3.39,
-1.27]% and [10.69, 15.73]% more wall time to finish TeraSort,
respectively, under the low- and the high-concurrency conditions,
when using the baseline kernel; 95% CIs [-9.34, -5.39]% and [2.52,
10.78]% more wall time, respectively, under the medium- and the
high-concurrency conditions, when using the patched kernel. There
were no statistically significant changes in wall time for the rest
of the test matrix.

+--------------------+------------------+------------------+
| Mean wall time (s) | Baseline kernel  | Patched kernel   |
| [95% CI]           |                  |                  |
+--------------------+------------------+------------------+
| Low concurrency    | 758.43 / 740.78  | 746.83 / 733.42  |
|                    | [-25.67, -9.64]  | [-29.80, 2.97]   |
+--------------------+------------------+------------------+
| Medium concurrency | 911.81 / 911.53  | 910.19 / 843.15  |
|                    | [-26.62, 26.06]  | [-84.98, -49.09] |
+--------------------+------------------+------------------+
| High concurrency   | 921.17 / 1042.85 | 929.51 / 991.33  |
|                    | [98.51, 144.85]  | [23.43, 100.21]  |
+--------------------+------------------+------------------+
Table 2. Comparison between swap off and on

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/hadoop/5.15

Appendix
========
$ cat raw_data_hadoop.r
v <- c(
    # baseline swapoff 2mr
    742.83, 751.91, 755.75, 757.50, 757.83, 758.16, 758.25, 763.58, 766.58, 772.00,
    # baseline swapoff 4mr
    863.25, 868.08, 886.58, 894.66, 901.16, 918.25, 940.91, 944.08, 949.66, 951.50,
    # baseline swapoff 8mr
    892.16, 895.75, 909.25, 922.58, 922.91, 922.91, 923.16, 926.00, 935.33, 961.66,
    # baseline swapon 2mr
    731.58, 732.08, 736.66, 737.75, 738.00, 738.08, 740.08, 740.33, 752.58, 760.66,
    # baseline swapon 4mr
    878.83, 886.33, 902.75, 904.83, 907.25, 918.50, 921.33, 925.50, 927.58, 942.41,
    # baseline swapon 8mr
    1016.58, 1017.33, 1019.33, 1019.50, 1026.08, 1030.50, 1065.16, 1070.50, 1075.25, 1088.33,
    # patched swapoff 2mr
    720.41, 724.58, 727.41, 732.00, 745.41, 748.00, 754.50, 767.91, 773.16, 775.00,
    # patched swapoff 4mr
    887.16, 887.50, 906.66, 907.41, 915.00, 915.58, 915.66, 916.91, 925.00, 925.08,
    # patched swapoff 8mr
    857.08, 864.41, 910.25, 918.58, 921.91, 933.75, 949.50, 966.75, 984.00, 988.91,
    # patched swapon 2mr
    719.33, 721.91, 724.41, 724.83, 725.75, 728.75, 737.83, 743.91, 749.41, 758.08,
    # patched swapon 4mr
    813.33, 819.00, 821.91, 829.33, 839.50, 846.75, 850.25, 857.00, 875.83, 878.66,
    # patched swapon 8mr
    929.41, 955.83, 961.16, 974.66, 988.75, 1004.00, 1009.08, 1019.91, 1030.58, 1040.00
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (swap in 1:2) {
    for (mr in 1:3) {
        r <- t.test(a[, mr, swap, 1], a[, mr, swap, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("swap%d mr%d: no significance", swap, mr)
        } else {
            s <- sprintf("swap%d mr%d: [%.2f, %.2f]%%", swap, mr, -p[2], -p[1])
        }
        print(s)
    }
}

# swapoff vs swapon
for (kern in 1:2) {
    for (mr in 1:3) {
        r <- t.test(a[, mr, 1, kern], a[, mr, 2, kern])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("kern%d mr%d: no significance", kern, mr)
        } else {
            s <- sprintf("kern%d mr%d: [%.2f, %.2f]%%", kern, mr, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_hadoop.r

        Welch Two Sample t-test

data:  a[, mr, swap, 1] and a[, mr, swap, 2]
t = 1.6677, df = 11.658, p-value = 0.122
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.604753 26.806753
sample estimates:
mean of x mean of y
  758.439   746.838

[1] "swap1 mr1: no significance"

        Welch Two Sample t-test

data:  a[, mr, swap, 1] and a[, mr, swap, 2]
t = 0.14071, df = 11.797, p-value = 0.8905
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -23.4695  26.7035
sample estimates:
mean of x mean of y
  911.813   910.196

[1] "swap1 mr2: no significance"

        Welch Two Sample t-test

data:  a[, mr, swap, 1] and a[, mr, swap, 2]
t = -0.53558, df = 12.32, p-value = 0.6018
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -42.18602  25.50002
sample estimates:
mean of x mean of y
  921.171   929.514

[1] "swap1 mr3: no significance"

        Welch Two Sample t-test

data:  a[, mr, swap, 1] and a[, mr, swap, 2]
t = 1.4568, df = 15.95, p-value = 0.1646
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.352318 18.070318
sample estimates:
mean of x mean of y
  740.780   733.421

[1] "swap2 mr1: no significance"

        Welch Two Sample t-test

data:  a[, mr, swap, 1] and a[, mr, swap, 2]
t = 7.204, df = 17.538, p-value = 1.229e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 48.39677 88.35323
sample estimates:
mean of x mean of y
  911.531   843.156

[1] "swap2 mr2: [-9.69, -5.31]%"

        Welch Two Sample t-test

data:  a[, mr, swap, 1] and a[, mr, swap, 2]
t = 3.5698, df = 17.125, p-value = 0.002336
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 21.08655 81.94945
sample estimates:
mean of x mean of y
 1042.856   991.338

[1] "swap2 mr3: [-7.86, -2.02]%"

        Welch Two Sample t-test

data:  a[, mr, 1, kern] and a[, mr, 2, kern]
t = 4.6319, df = 17.718, p-value = 0.0002153
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  9.640197 25.677803
sample estimates:
mean of x mean of y
  758.439   740.780

[1] "kern1 mr1: [-3.39, -1.27]%"

        Welch Two Sample t-test

data:  a[, mr, 1, kern] and a[, mr, 2, kern]
t = 0.0229, df = 14.372, p-value = 0.982
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -26.06533  26.62933
sample estimates:
mean of x mean of y
  911.813   911.531

[1] "kern1 mr2: no significance"

        Welch Two Sample t-test

data:  a[, mr, 1, kern] and a[, mr, 2, kern]
t = -11.129, df = 16.051, p-value = 5.874e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -144.8574  -98.5126
sample estimates:
mean of x mean of y
  921.171  1042.856

[1] "kern1 mr3: [10.69, 15.73]%"

        Welch Two Sample t-test

data:  a[, mr, 1, kern] and a[, mr, 2, kern]
t = 1.7413, df = 15.343, p-value = 0.1016
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.974529 29.808529
sample estimates:
mean of x mean of y
  746.838   733.421

[1] "kern2 mr1: no significance"

        Welch Two Sample t-test

data:  a[, mr, 1, kern] and a[, mr, 2, kern]
t = 7.9839, df = 14.571, p-value = 1.073e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 49.09637 84.98363
sample estimates:
mean of x mean of y
  910.196   843.156

[1] "kern2 mr2: [-9.34, -5.39]%"

        Welch Two Sample t-test

data:  a[, mr, 1, kern] and a[, mr, 2, kern]
t = -3.3962, df = 17.1, p-value = 0.003413
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -100.21425  -23.43375
sample estimates:
mean of x mean of y
  929.514   991.338

[1] "kern2 mr3: [2.52, 10.78]%"

bot@edi.works Dec. 18, 2021, 7:10 a.m. UTC | #4

Kernel / PostgreSQL benchmark with MGLRU

TLDR
====
With the MGLRU, PostgreSQL achieved 95% CI [1.75, 6.42]% more
transactions per minute (TPM) under the high-concurrency conditions,
when swap was off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more
TPM, respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically significant
changes in TPM for the rest of the test matrix.

Background
==========
Memory overcommit can increase utilization and, if carried out
properly, can also increase throughput. The challenges are to improve
working set estimation and to optimize page reclaim. The risks are
performance degradation and OOM kills. Short of overcoming the
challenges, the only way to reduce the risks is to underutilize
memory.

PostgreSQL is one of the most popular open-source RDBMSs. HammerDB is
the leading open-source benchmarking software derived from the TPC
specifications. OLTP is the most important use case for RDBMSs.

Matrix
======
Kernels: version [+ patchset]
* Baseline: 5.15
* Patched: 5.15 + MGLRU

Swap configurations:
* Off
* On (vm.swappiness=1)

Concurrency conditions: average # of users per CPU
* Low: ~4
* Medium: ~8
* High: ~12

Total configurations: 12
Data points per configuration: 10
Total run duration (minutes) per data point: ~50

Procedure
=========
The latest MGLRU patchset for the 5.15 kernel is available at
git fetch https://linux-mm.googlesource.com/page-reclaim \
    refs/changes/30/1430/2

The baseline and the patched 5.15 kernel images are available at
https://drive.google.com/drive/folders/1eMkQleAFGkP2vzM_JyRA21oKE0ESHBqp

<install and configure OS>
hammerdbcli auto prep_tpcc.tcl
systemctl stop postgresql
e2image <backup /mnt/data>

<for each kernel>
    grub-set-default <baseline, patched>
    <for each swap configuration>
        <swapoff, swapon>
        <update /etc/postgresql/13/main/postgresql.conf>
        <for each concurrency condition>
            <update run_tpcc.tcl>
            <for each data point>
                systemctl stop postgresql
                e2image <restore /mnt/data>
                reboot
                hammerdbcli auto run_tpcc.tcl
                <collect TPM>

Hardware
========
Memory (GB): 256
CPU (total #): 48
NVMe SSD (GB): 2048

OS
==
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=21.10
DISTRIB_CODENAME=impish
DISTRIB_DESCRIPTION="Ubuntu 21.10"

$ cat /proc/swaps
Filename          Type          Size          Used     Priority
/swap.img         partition     67108860      0        -2

$ cat /proc/sys/vm/overcommit_memory
1

$ cat /proc/sys/vm/swappiness
1

$ cat /sys/fs/cgroup/user.slice/memory.min
4294967296

PostgreSQL
==========
$ pg_config --version
PostgreSQL 13.5 (Ubuntu 13.5-0ubuntu0.21.10.1)

$ cat /etc/postgresql/13/main/postgresql.conf
<existing parameters>

data_directory = '/mnt/data'
max_connections = 1000
shared_buffers = <swapoff: 120GB, swapon: 150GB>
temp_buffers = 1GB
work_mem = 1GB
maintenance_work_mem = 1GB
logical_decoding_work_mem = 1GB
bgwriter_delay = 1000ms
bgwriter_lru_maxpages = 100000
bgwriter_lru_multiplier = 1.0
bgwriter_flush_after = 0
effective_io_concurrency = 1000
wal_compression = on
wal_writer_delay = 1000ms
wal_writer_flush_after = 128MB
max_wal_size = 100GB
min_wal_size = 10GB
checkpoint_flush_after = 0
effective_cache_size = 150GB

<existing parameters>

HammerDB
========
$ hammerdbcli -h
HammerDB CLI v4.3
Copyright (C) 2003-2021 Steve Shaw
Type "help" for a list of commands
Usage: hammerdbcli [ auto [ script_to_autoload.tcl  ] ]

$ cat prep_tpcc.tcl
dbset db pg
diset connection pg_host /var/run/postgresql
diset tpcc pg_count_ware 2400
diset tpcc pg_num_vu 48
buildschema
waittocomplete
quit

$ cat run_tpcc.tcl
dbset db pg
diset connection pg_host /var/run/postgresql
diset tpcc pg_total_iterations 20000000
diset tpcc pg_driver timed
diset tpcc pg_rampup 30
diset tpcc pg_duration 10
diset tpcc pg_allwarehouse true
vuset logtotemp 1
vuset unique 1
loadscript
vuset vu <200, 400, 600>
vucreate
vurun
runtimer 3000
vudestroy

Results
=======
Comparing the patched with the baseline kernel, PostgreSQL achieved
95% CI [1.75, 6.42]% more TPM under the high-concurrency conditions,
when swap was off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more
TPM, respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically significant
changes in TPM for the rest of the test matrix.

+--------------------+--------------------+--------------------+
| Mean TPM [95% CI]  | Swap off           | Swap on            |
+--------------------+--------------------+--------------------+
| Low concurrency    | 466430 / 467521    | 475060 / 475047    |
|                    | [-6931, 9112]      | [-7431, 7405]      |
+--------------------+--------------------+--------------------+
| Medium concurrency | 453871 / 459592    | 388245 / 449409    |
|                    | [-774, 12216]      | [49755, 72572]     |
+--------------------+--------------------+--------------------+
| High concurrency   | 443014 / 461112    | 157106 / 211752    |
|                    | [7771, 28423]      | [35664, 73627]     |
+--------------------+--------------------+--------------------+
Table 1. Comparison between the baseline and the patched kernels

Comparing swap on with swap off, PostgreSQL achieved 95% CIs [0.46,
3.24]%, [-16.91, -12.01]% and [-68.64, -60.43]% more TPM,
respectively, under the low-, the medium- and the high-concurrency
conditions, when using the baseline kernel; 95% CIs [-3.76, -0.67]%
and [-56.70, -51.46]% more TPM, respectively, under the medium- and
the high-concurrency conditions, when using the patched kernel. There
were no statistically significant changes in TPM for the rest of the
test matrix.

+--------------------+--------------------+--------------------+
| Mean TPM [95% CI]  | Baseline kernel    | Patched kernel     |
+--------------------+--------------------+--------------------+
| Low concurrency    | 466430 / 475060    | 467521 / 475047    |
|                    | [2160, 15100]      | [-1204, 16256]     |
+--------------------+--------------------+--------------------+
| Medium concurrency | 453871 / 388245    | 459592 / 449409    |
|                    | [-76757, -54494]   | [-17292, -3073]    |
+--------------------+--------------------+--------------------+
| High concurrency   | 443014 / 157106    | 461112 / 211752    |
|                    | [-304097, -267718] | [-261442, -237275] |
+--------------------+--------------------+--------------------+
Table 2. Comparison between swap off and swap on

Metrics collected during each run are available at
https://github.com/ediworks/KernelPerf/tree/master/mglru/postgres/5.15

Appendix
========
$ cat raw_data_postgres.r
v <- c(
    # baseline swapoff 200vu
    462379, 462998, 463363, 464949, 465605, 466977, 467290, 468658, 469682, 472404,
    # baseline swapoff 400vu
    446111, 446305, 447339, 448043, 450604, 452160, 453846, 461309, 465101, 467893,
    # baseline swapoff 600vu
    434061, 435645, 435974, 436026, 436581, 439138, 442121, 445990, 454687, 469926,
    # baseline swapon 200vu
    466546, 467298, 467882, 469185, 472114, 473868, 475217, 481319, 483246, 493931,
    # baseline swapon 400vu
    367605, 371855, 373991, 380763, 388456, 389768, 395270, 403536, 404457, 406749,
    # baseline swapon 600vu
    123036, 127174, 131863, 150724, 155572, 158938, 170892, 179302, 183783, 189785,
    # patched swapoff 200vu
    456088, 457197, 457341, 458069, 459630, 472291, 474782, 475727, 478015, 486071,
    # patched swapoff 400vu
    452681, 453758, 455800, 457675, 458812, 459304, 460897, 461252, 465269, 470475,
    # patched swapoff 600vu
    448009, 452465, 453655, 454333, 456111, 456304, 465371, 471431, 475092, 478351,
    # patched swapon 200vu
    465540, 468681, 471682, 473134, 473148, 474015, 475734, 476691, 481974, 489873,
    # patched swapon 400vu
    436300, 440202, 441955, 445214, 445817, 452176, 452379, 456931, 457724, 465393,
    # patched swapon 600vu
    195315, 197186, 199332, 199667, 209630, 211162, 214787, 222783, 230000, 237667
)

a <- array(v, dim = c(10, 3, 2, 2))

# baseline vs patched
for (swap in 1:2) {
    for (vu in 1:3) {
        r <- t.test(a[, vu, swap, 1], a[, vu, swap, 2])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("swap%d vu%d: no significance", swap, vu)
        } else {
            s <- sprintf("swap%d vu%d: [%.2f, %.2f]%%", swap, vu, -p[2], -p[1])
        }
        print(s)
    }
}

# swapoff vs swapon
for (kern in 1:2) {
    for (vu in 1:3) {
        r <- t.test(a[, vu, 1, kern], a[, vu, 2, kern])
        print(r)

        p <- r$conf.int * 100 / r$estimate[1]
        if ((p[1] > 0 && p[2] < 0) || (p[1] < 0 && p[2] > 0)) {
            s <- sprintf("kern%d vu%d: no significance", kern, vu)
        } else {
            s <- sprintf("kern%d vu%d: [%.2f, %.2f]%%", kern, vu, -p[2], -p[1])
        }
        print(s)
    }
}

$ R -q -s -f raw_data_postgres.r

        Welch Two Sample t-test

data:  a[, vu, swap, 1] and a[, vu, swap, 2]
t = -0.3009, df = 10.521, p-value = 0.7694
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9112.559  6931.359
sample estimates:
mean of x mean of y
 466430.5  467521.1

[1] "swap1 vu1: no significance"

        Welch Two Sample t-test

data:  a[, vu, swap, 1] and a[, vu, swap, 2]
t = -1.8711, df = 15.599, p-value = 0.08021
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -12216.64    774.24
sample estimates:
mean of x mean of y
 453871.1  459592.3

[1] "swap1 vu2: no significance"

        Welch Two Sample t-test

data:  a[, vu, swap, 1] and a[, vu, swap, 2]
t = -3.6832, df = 17.919, p-value = 0.001712
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -28423.515  -7771.085
sample estimates:
mean of x mean of y
 443014.9  461112.2

[1] "swap1 vu3: [1.75, 6.42]%"

        Welch Two Sample t-test

data:  a[, vu, swap, 1] and a[, vu, swap, 2]
t = 0.0038109, df = 17.001, p-value = 0.997
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7405.094  7431.894
sample estimates:
mean of x mean of y
 475060.6  475047.2

[1] "swap2 vu1: no significance"

        Welch Two Sample t-test

data:  a[, vu, swap, 1] and a[, vu, swap, 2]
t = -11.413, df = 15.222, p-value = 7.301e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -72572.5 -49755.7
sample estimates:
mean of x mean of y
 388245.0  449409.1

[1] "swap2 vu2: [12.82, 18.69]%"

        Welch Two Sample t-test

data:  a[, vu, swap, 1] and a[, vu, swap, 2]
t = -6.1414, df = 14.853, p-value = 1.97e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -73627.83 -35664.17
sample estimates:
mean of x mean of y
 157106.9  211752.9

[1] "swap2 vu3: [22.70, 46.86]%"

        Welch Two Sample t-test

data:  a[, vu, 1, kern] and a[, vu, 2, kern]
t = -2.9241, df = 11.372, p-value = 0.0134
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -15100.107  -2160.093
sample estimates:
mean of x mean of y
 466430.5  475060.6

[1] "kern1 vu1: [0.46, 3.24]%"

        Welch Two Sample t-test

data:  a[, vu, 1, kern] and a[, vu, 2, kern]
t = 12.629, df = 14.192, p-value = 4.129e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 54494.92 76757.28
sample estimates:
mean of x mean of y
 453871.1  388245.0

[1] "kern1 vu2: [-16.91, -12.01]%"

        Welch Two Sample t-test

data:  a[, vu, 1, kern] and a[, vu, 2, kern]
t = 34.005, df = 12.822, p-value = 5.981e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 267718.5 304097.5
sample estimates:
mean of x mean of y
 443014.9  157106.9

[1] "kern1 vu3: [-68.64, -60.43]%"

        Welch Two Sample t-test

data:  a[, vu, 1, kern] and a[, vu, 2, kern]
t = -1.8367, df = 15.057, p-value = 0.08607
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -16256.986   1204.786
sample estimates:
mean of x mean of y
 467521.1  475047.2

[1] "kern2 vu1: no significance"

        Welch Two Sample t-test

data:  a[, vu, 1, kern] and a[, vu, 2, kern]
t = 3.061, df = 14.554, p-value = 0.008153
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  3073.49 17292.91
sample estimates:
mean of x mean of y
 459592.3  449409.1

[1] "kern2 vu2: [-3.76, -0.67]%"

        Welch Two Sample t-test

data:  a[, vu, 1, kern] and a[, vu, 2, kern]
t = 43.656, df = 16.424, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 237275.9 261442.7
sample estimates:
mean of x mean of y
 461112.2  211752.9

[1] "kern2 vu3: [-56.70, -51.46]%"