Message ID | 20190603210746.15800-1-hannes@cmpxchg.org (mailing list archive) |
---|---|
Headers | show |
Series | mm: fix page aging across multiple cgroups | expand |
On Mon, Jun 3, 2019 at 2:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > When applications are put into unconfigured cgroups for memory > accounting purposes, the cgrouping itself should not change the > behavior of the page reclaim code. We expect the VM to reclaim the > coldest pages in the system. But right now the VM can reclaim hot > pages in one cgroup while there is eligible cold cache in others. > > This is because one part of the reclaim algorithm isn't truly cgroup > hierarchy aware: the inactive/active list balancing. That is the part > that is supposed to protect hot cache data from one-off streaming IO. > > The recursive cgroup reclaim scheme will scan and rotate the physical > LRU lists of each eligible cgroup at the same rate in a round-robin > fashion, thereby establishing a relative order among the pages of all > those cgroups. However, the inactive/active balancing decisions are > made locally within each cgroup, so when a cgroup is running low on > cold pages, its hot pages will get reclaimed - even when sibling > cgroups have plenty of cold cache eligible in the same reclaim run. > > For example: > > [root@ham ~]# head -n1 /proc/meminfo > MemTotal: 1016336 kB > > [root@ham ~]# ./reclaimtest2.sh > Establishing 50M active files in cgroup A... > Hot pages cached: 12800/12800 workingset-a > Linearly scanning through 18G of file data in cgroup B: > real 0m4.269s > user 0m0.051s > sys 0m4.182s > Hot pages cached: 134/12800 workingset-a > Can you share reclaimtest2.sh as well? Maybe a selftest to monitor/test future changes. > The streaming IO in B, which doesn't benefit from caching at all, > pushes out most of the workingset in A. > > Solution > > This series fixes the problem by elevating inactive/active balancing > decisions to the toplevel of the reclaim run. This is either a cgroup > that hit its limit, or straight-up global reclaim if there is physical > memory pressure. From there, it takes a recursive view of the cgroup > subtree to decide whether page deactivation is necessary. > > In the test above, the VM will then recognize that cgroup B has plenty > of eligible cold cache, and that thet hot pages in A can be spared: > > [root@ham ~]# ./reclaimtest2.sh > Establishing 50M active files in cgroup A... > Hot pages cached: 12800/12800 workingset-a > Linearly scanning through 18G of file data in cgroup B: > real 0m4.244s > user 0m0.064s > sys 0m4.177s > Hot pages cached: 12800/12800 workingset-a > > Implementation > > Whether active pages can be deactivated or not is influenced by two > factors: the inactive list dropping below a minimum size relative to > the active list, and the occurence of refaults. > > After some cleanups and preparations, this patch series first moves > refault detection to the reclaim root, then enforces the minimum > inactive size based on a recursive view of the cgroup tree's LRUs. > > History > > Note that this actually never worked correctly in Linux cgroups. In > the past it worked for global reclaim and leaf limit reclaim only (we > used to have two physical LRU linkages per page), but it never worked > for intermediate limit reclaim over multiple leaf cgroups. > > We're noticing this now because 1) we're putting everything into > cgroups for accounting, not just the things we want to control and 2) > we're moving away from leaf limits that invoke reclaim on individual > cgroups, toward large tree reclaim, triggered by high-level limits or > physical memory pressure, that is influenced by local protections such > as memory.low and memory.min instead. > > Requirements > > These changes are based on the fast recursive memcg stats merged in > 5.2-rc1. The patches are against v5.2-rc2-mmots-2019-05-29-20-56-12 > plus the page cache fix in https://lkml.org/lkml/2019/5/24/813. > > include/linux/memcontrol.h | 37 +-- > include/linux/mmzone.h | 30 +- > include/linux/swap.h | 2 +- > mm/memcontrol.c | 6 +- > mm/page_alloc.c | 2 +- > mm/vmscan.c | 667 ++++++++++++++++++++++--------------------- > mm/workingset.c | 74 +++-- > 7 files changed, 437 insertions(+), 381 deletions(-) > >
On Wed, Nov 06, 2019 at 06:50:25PM -0800, Shakeel Butt wrote: > On Mon, Jun 3, 2019 at 2:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > When applications are put into unconfigured cgroups for memory > > accounting purposes, the cgrouping itself should not change the > > behavior of the page reclaim code. We expect the VM to reclaim the > > coldest pages in the system. But right now the VM can reclaim hot > > pages in one cgroup while there is eligible cold cache in others. > > > > This is because one part of the reclaim algorithm isn't truly cgroup > > hierarchy aware: the inactive/active list balancing. That is the part > > that is supposed to protect hot cache data from one-off streaming IO. > > > > The recursive cgroup reclaim scheme will scan and rotate the physical > > LRU lists of each eligible cgroup at the same rate in a round-robin > > fashion, thereby establishing a relative order among the pages of all > > those cgroups. However, the inactive/active balancing decisions are > > made locally within each cgroup, so when a cgroup is running low on > > cold pages, its hot pages will get reclaimed - even when sibling > > cgroups have plenty of cold cache eligible in the same reclaim run. > > > > For example: > > > > [root@ham ~]# head -n1 /proc/meminfo > > MemTotal: 1016336 kB > > > > [root@ham ~]# ./reclaimtest2.sh > > Establishing 50M active files in cgroup A... > > Hot pages cached: 12800/12800 workingset-a > > Linearly scanning through 18G of file data in cgroup B: > > real 0m4.269s > > user 0m0.051s > > sys 0m4.182s > > Hot pages cached: 134/12800 workingset-a > > > > Can you share reclaimtest2.sh as well? Maybe a selftest to > monitor/test future changes. I wish it were more portable, but it really only does what it says in the log output, in a pretty hacky way, with all parameters hard-coded to my test environment: --- #!/bin/bash # this should protect workingset-a from workingset-b set -e #set -x echo Establishing 50M active files in cgroup A... rmdir /cgroup/workingset-a 2>/dev/null || true mkdir /cgroup/workingset-a echo $$ > /cgroup/workingset-a/cgroup.procs rm -f workingset-a dd of=workingset-a bs=1M count=0 seek=50 2>/dev/null >/dev/null cat workingset-a > /dev/null cat workingset-a > /dev/null cat workingset-a > /dev/null cat workingset-a > /dev/null cat workingset-a > /dev/null cat workingset-a > /dev/null cat workingset-a > /dev/null cat workingset-a > /dev/null echo -n "Hot pages cached: " ./mincore workingset-a echo -n Linearly scanning through 2G of file data cgroup B: rmdir /cgroup/workingset-b >/dev/null || true mkdir /cgroup/workingset-b echo $$ > /cgroup/workingset-b/cgroup.procs rm -f workingset-b dd of=workingset-b bs=1M count=0 seek=2048 2>/dev/null >/dev/null time ( cat workingset-b > /dev/null cat workingset-b > /dev/null cat workingset-b > /dev/null cat workingset-b > /dev/null cat workingset-b > /dev/null cat workingset-b > /dev/null cat workingset-b > /dev/null cat workingset-b > /dev/null ) echo -n "Hot pages cached: " ./mincore workingset-a