[00/11] mm: fix page aging across multiple cgroups

Message ID	20190603210746.15800-1-hannes@cmpxchg.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.41 as permitted sender) client-ip=209.85.220.41; From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>, Suren Baghdasaryan <surenb@google.com>, Michal Hocko <mhocko@suse.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 00/11] mm: fix page aging across multiple cgroups Date: Mon, 3 Jun 2019 17:07:35 -0400 Message-Id: <20190603210746.15800-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: fix page aging across multiple cgroups \| expand [00/11] mm: fix page aging across multiple cgroups [01/11] mm: vmscan: move inactive_list_is_low() swap check to the caller [02/11] mm: clean up and clarify lruvec lookup procedure [03/11] mm: vmscan: simplify lruvec_lru_size() [04/11] mm: vmscan: naming fixes: cgroup_reclaim() and writeback_working() [05/11] mm: vmscan: replace shrink_node() loop with a retry jump [06/11] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec() [07/11] mm: vmscan: split shrink_node() into node part and memcgs part [08/11] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs [09/11] mm: vmscan: move file exhaustion detection to the node level [10/11] mm: vmscan: detect file thrashing at the reclaim root [11/11] mm: vmscan: enforce inactive:active ratio at the reclaim root

Message ID

20190603210746.15800-1-hannes@cmpxchg.org (mailing list archive)

Headers

Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates
 209.85.220.41 as permitted sender) client-ip=209.85.220.41;
From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org,
	cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: [PATCH 00/11] mm: fix page aging across multiple cgroups
Date: Mon,  3 Jun 2019 17:07:35 -0400
Message-Id: <20190603210746.15800-1-hannes@cmpxchg.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: fix page aging across multiple cgroups | expand

Message

Johannes Weiner June 3, 2019, 9:07 p.m. UTC

When applications are put into unconfigured cgroups for memory
accounting purposes, the cgrouping itself should not change the
behavior of the page reclaim code. We expect the VM to reclaim the
coldest pages in the system. But right now the VM can reclaim hot
pages in one cgroup while there is eligible cold cache in others.

This is because one part of the reclaim algorithm isn't truly cgroup
hierarchy aware: the inactive/active list balancing. That is the part
that is supposed to protect hot cache data from one-off streaming IO.

The recursive cgroup reclaim scheme will scan and rotate the physical
LRU lists of each eligible cgroup at the same rate in a round-robin
fashion, thereby establishing a relative order among the pages of all
those cgroups. However, the inactive/active balancing decisions are
made locally within each cgroup, so when a cgroup is running low on
cold pages, its hot pages will get reclaimed - even when sibling
cgroups have plenty of cold cache eligible in the same reclaim run.

For example:

   [root@ham ~]# head -n1 /proc/meminfo 
   MemTotal:        1016336 kB

   [root@ham ~]# ./reclaimtest2.sh 
   Establishing 50M active files in cgroup A...
   Hot pages cached: 12800/12800 workingset-a
   Linearly scanning through 18G of file data in cgroup B:
   real    0m4.269s
   user    0m0.051s
   sys     0m4.182s
   Hot pages cached: 134/12800 workingset-a

The streaming IO in B, which doesn't benefit from caching at all,
pushes out most of the workingset in A.

Solution

This series fixes the problem by elevating inactive/active balancing
decisions to the toplevel of the reclaim run. This is either a cgroup
that hit its limit, or straight-up global reclaim if there is physical
memory pressure. From there, it takes a recursive view of the cgroup
subtree to decide whether page deactivation is necessary.

In the test above, the VM will then recognize that cgroup B has plenty
of eligible cold cache, and that thet hot pages in A can be spared:

   [root@ham ~]# ./reclaimtest2.sh 
   Establishing 50M active files in cgroup A...
   Hot pages cached: 12800/12800 workingset-a
   Linearly scanning through 18G of file data in cgroup B:
   real    0m4.244s
   user    0m0.064s
   sys     0m4.177s
   Hot pages cached: 12800/12800 workingset-a

Implementation

Whether active pages can be deactivated or not is influenced by two
factors: the inactive list dropping below a minimum size relative to
the active list, and the occurence of refaults.

After some cleanups and preparations, this patch series first moves
refault detection to the reclaim root, then enforces the minimum
inactive size based on a recursive view of the cgroup tree's LRUs.

History

Note that this actually never worked correctly in Linux cgroups. In
the past it worked for global reclaim and leaf limit reclaim only (we
used to have two physical LRU linkages per page), but it never worked
for intermediate limit reclaim over multiple leaf cgroups.

We're noticing this now because 1) we're putting everything into
cgroups for accounting, not just the things we want to control and 2)
we're moving away from leaf limits that invoke reclaim on individual
cgroups, toward large tree reclaim, triggered by high-level limits or
physical memory pressure, that is influenced by local protections such
as memory.low and memory.min instead.

Requirements

These changes are based on the fast recursive memcg stats merged in
5.2-rc1. The patches are against v5.2-rc2-mmots-2019-05-29-20-56-12
plus the page cache fix in https://lkml.org/lkml/2019/5/24/813.

 include/linux/memcontrol.h |  37 +--
 include/linux/mmzone.h     |  30 +-
 include/linux/swap.h       |   2 +-
 mm/memcontrol.c            |   6 +-
 mm/page_alloc.c            |   2 +-
 mm/vmscan.c                | 667 ++++++++++++++++++++++---------------------
 mm/workingset.c            |  74 +++--
 7 files changed, 437 insertions(+), 381 deletions(-)

Comments

Shakeel Butt Nov. 7, 2019, 2:50 a.m. UTC | #1

On Mon, Jun 3, 2019 at 2:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> When applications are put into unconfigured cgroups for memory
> accounting purposes, the cgrouping itself should not change the
> behavior of the page reclaim code. We expect the VM to reclaim the
> coldest pages in the system. But right now the VM can reclaim hot
> pages in one cgroup while there is eligible cold cache in others.
>
> This is because one part of the reclaim algorithm isn't truly cgroup
> hierarchy aware: the inactive/active list balancing. That is the part
> that is supposed to protect hot cache data from one-off streaming IO.
>
> The recursive cgroup reclaim scheme will scan and rotate the physical
> LRU lists of each eligible cgroup at the same rate in a round-robin
> fashion, thereby establishing a relative order among the pages of all
> those cgroups. However, the inactive/active balancing decisions are
> made locally within each cgroup, so when a cgroup is running low on
> cold pages, its hot pages will get reclaimed - even when sibling
> cgroups have plenty of cold cache eligible in the same reclaim run.
>
> For example:
>
>    [root@ham ~]# head -n1 /proc/meminfo
>    MemTotal:        1016336 kB
>
>    [root@ham ~]# ./reclaimtest2.sh
>    Establishing 50M active files in cgroup A...
>    Hot pages cached: 12800/12800 workingset-a
>    Linearly scanning through 18G of file data in cgroup B:
>    real    0m4.269s
>    user    0m0.051s
>    sys     0m4.182s
>    Hot pages cached: 134/12800 workingset-a
>

Can you share reclaimtest2.sh as well? Maybe a selftest to
monitor/test future changes.


> The streaming IO in B, which doesn't benefit from caching at all,
> pushes out most of the workingset in A.
>
> Solution
>
> This series fixes the problem by elevating inactive/active balancing
> decisions to the toplevel of the reclaim run. This is either a cgroup
> that hit its limit, or straight-up global reclaim if there is physical
> memory pressure. From there, it takes a recursive view of the cgroup
> subtree to decide whether page deactivation is necessary.
>
> In the test above, the VM will then recognize that cgroup B has plenty
> of eligible cold cache, and that thet hot pages in A can be spared:
>
>    [root@ham ~]# ./reclaimtest2.sh
>    Establishing 50M active files in cgroup A...
>    Hot pages cached: 12800/12800 workingset-a
>    Linearly scanning through 18G of file data in cgroup B:
>    real    0m4.244s
>    user    0m0.064s
>    sys     0m4.177s
>    Hot pages cached: 12800/12800 workingset-a
>
> Implementation
>
> Whether active pages can be deactivated or not is influenced by two
> factors: the inactive list dropping below a minimum size relative to
> the active list, and the occurence of refaults.
>
> After some cleanups and preparations, this patch series first moves
> refault detection to the reclaim root, then enforces the minimum
> inactive size based on a recursive view of the cgroup tree's LRUs.
>
> History
>
> Note that this actually never worked correctly in Linux cgroups. In
> the past it worked for global reclaim and leaf limit reclaim only (we
> used to have two physical LRU linkages per page), but it never worked
> for intermediate limit reclaim over multiple leaf cgroups.
>
> We're noticing this now because 1) we're putting everything into
> cgroups for accounting, not just the things we want to control and 2)
> we're moving away from leaf limits that invoke reclaim on individual
> cgroups, toward large tree reclaim, triggered by high-level limits or
> physical memory pressure, that is influenced by local protections such
> as memory.low and memory.min instead.
>
> Requirements
>
> These changes are based on the fast recursive memcg stats merged in
> 5.2-rc1. The patches are against v5.2-rc2-mmots-2019-05-29-20-56-12
> plus the page cache fix in https://lkml.org/lkml/2019/5/24/813.
>
>  include/linux/memcontrol.h |  37 +--
>  include/linux/mmzone.h     |  30 +-
>  include/linux/swap.h       |   2 +-
>  mm/memcontrol.c            |   6 +-
>  mm/page_alloc.c            |   2 +-
>  mm/vmscan.c                | 667 ++++++++++++++++++++++---------------------
>  mm/workingset.c            |  74 +++--
>  7 files changed, 437 insertions(+), 381 deletions(-)
>
>

Johannes Weiner Nov. 7, 2019, 5:45 p.m. UTC | #2

On Wed, Nov 06, 2019 at 06:50:25PM -0800, Shakeel Butt wrote:
> On Mon, Jun 3, 2019 at 2:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > When applications are put into unconfigured cgroups for memory
> > accounting purposes, the cgrouping itself should not change the
> > behavior of the page reclaim code. We expect the VM to reclaim the
> > coldest pages in the system. But right now the VM can reclaim hot
> > pages in one cgroup while there is eligible cold cache in others.
> >
> > This is because one part of the reclaim algorithm isn't truly cgroup
> > hierarchy aware: the inactive/active list balancing. That is the part
> > that is supposed to protect hot cache data from one-off streaming IO.
> >
> > The recursive cgroup reclaim scheme will scan and rotate the physical
> > LRU lists of each eligible cgroup at the same rate in a round-robin
> > fashion, thereby establishing a relative order among the pages of all
> > those cgroups. However, the inactive/active balancing decisions are
> > made locally within each cgroup, so when a cgroup is running low on
> > cold pages, its hot pages will get reclaimed - even when sibling
> > cgroups have plenty of cold cache eligible in the same reclaim run.
> >
> > For example:
> >
> >    [root@ham ~]# head -n1 /proc/meminfo
> >    MemTotal:        1016336 kB
> >
> >    [root@ham ~]# ./reclaimtest2.sh
> >    Establishing 50M active files in cgroup A...
> >    Hot pages cached: 12800/12800 workingset-a
> >    Linearly scanning through 18G of file data in cgroup B:
> >    real    0m4.269s
> >    user    0m0.051s
> >    sys     0m4.182s
> >    Hot pages cached: 134/12800 workingset-a
> >
> 
> Can you share reclaimtest2.sh as well? Maybe a selftest to
> monitor/test future changes.

I wish it were more portable, but it really only does what it says in
the log output, in a pretty hacky way, with all parameters hard-coded
to my test environment:

---

#!/bin/bash

# this should protect workingset-a from workingset-b

set -e
#set -x

echo Establishing 50M active files in cgroup A...
rmdir /cgroup/workingset-a 2>/dev/null || true
mkdir /cgroup/workingset-a
echo $$ > /cgroup/workingset-a/cgroup.procs
rm -f workingset-a
dd of=workingset-a bs=1M count=0 seek=50 2>/dev/null >/dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
cat workingset-a > /dev/null
echo -n "Hot pages cached: "
./mincore workingset-a

echo -n Linearly scanning through 2G of file data cgroup B:
rmdir /cgroup/workingset-b >/dev/null || true
mkdir /cgroup/workingset-b
echo $$ > /cgroup/workingset-b/cgroup.procs
rm -f workingset-b
dd of=workingset-b bs=1M count=0 seek=2048 2>/dev/null >/dev/null
time (
  cat workingset-b > /dev/null
  cat workingset-b > /dev/null
  cat workingset-b > /dev/null
  cat workingset-b > /dev/null
  cat workingset-b > /dev/null
  cat workingset-b > /dev/null
  cat workingset-b > /dev/null
  cat workingset-b > /dev/null )
echo -n "Hot pages cached: "
./mincore workingset-a