[v2] vfs: keep inodes with page cache off the inode shrinker LRU

From: Johannes Weiner <hannes@cmpxchg.org>

Changes since v1:
- retain inode-driven cache reclaim for CONFIG_HIGHMEM
- fix a compile bug in fs/dax.c (mapping->inode -> mapping->host)

I'm going to start wider-scale testing with this updated version.

---
From 6b76483fd64fd47e04eddbb2025e3d6a3b4aaec0 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 4 Feb 2020 18:12:48 -0500
Subject: [PATCH v2] vfs: keep inodes with page cache off the inode shrinker LRU

The VFS inode shrinker is currently allowed to reclaim inodes with
populated page cache. As a result it can drop gigabytes of hot and
active page cache on the floor without consulting the VM.

The reason for this goes back to highmem: page cache in the highmem
zones can pin struct inode objects in small lowmem zones and get the
whole system into trouble. As a result, the inode shrinker zaps the
page cache to free up the lowmem struct inodes.

Details: https://marc.info/?l=git-commits-head&m=103646757213266&w=2

But the cost of doing this isn't justifiable on the more prevalent
!CONFIG_HIGHMEM systems nowadays.

Consider for example how the VM would cache a source tree, such as the
Linux git tree. As large parts of the checked out files and the object
database are accessed repeatedly, the page cache holding this data
gets moved to the active list, where it's fully (and indefinitely)
insulated from one-off cache moving through the inactive list. But due
to the way users interact with the tree, no ongoing open file
descriptors into the source tree are maintained, and the inodes end up
on the shrinker LRU. A larger burst of one-off cache (find, updatedb,
etc.) can now drive the shrinkers to drop first the dentries and then
the inodes - inodes that contain the most valuable data currently held
by the page cache - while there is plenty of one-off cache that could
be reclaimed instead.

This may have been less of a concern when the VM itself didn't have
real workingset protection, and one-off cache would push out active
cache over time anyway. But we've come a long way since, and the inode
shrinker is now actively in conflict with the VM's caching strategy.

Previous proposals

As this keeps causing problems for people, there have been several
attempts to address this.

One recent attempt was to make the inode shrinker simply skip over
inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim
inodes with many attached pages").

However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
don't reclaim inodes with many attached pages"") because it caused
excessive pressure build up on the VFS objects: Inodes that sit on the
shrinker LRU are attracting reclaim pressure away from the page cache
and toward the VFS. If we then permanently exempt sizable portions of
this pool from actually getting reclaimed when looked at, this
pressure accumulates as deferred work (a mechanism for *temporarily*
unreclaimable objects) until it causes mayhem in the VFS cache pools.

In the bug quoted in 69056ee6a8a3 in particular, the excessive
pressure drove the XFS shrinker into dirty objects, where it caused
synchronous, IO-bound stalls, even as there was plenty of clean page
cache that should have been reclaimed instead.

Another variant of this problem was recently observed, where the
kernel violates cgroups' memory.low protection settings and reclaims
page cache way beyond the configured thresholds. It was followed by a
proposal of a modified form of the reverted commit above, that
implements memory.low-sensitive shrinker skipping over populated
inodes on the LRU [1]. However, this proposal continues to run the
risk of attracting disproportionate reclaim pressure to a pool of
still-used inodes, while not addressing the more generic reclaim
inversion problem outside of a very specific cgroup application.

[1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/

Solution

To fix the reclaim inversion in the shrinker, without reintroducing
the problems associated with shrinker LRU rotations, this patch keeps
populated inodes off the LRUs entirely on !CONFIG_HIGHMEM systems.

Currently, inodes are kept off the shrinker LRU as long as they have
an elevated i_count, indicating an active user. Unfortunately, the
page cache cannot simply hold an i_count reference, because unlink()
*should* result in the inode being dropped and its cache invalidated.

Instead, this patch makes iput_final() consult the state of the page
cache and punt the LRU linking to the VM if the inode is still
populated; the VM in turn checks the inode state when it depopulates
the page cache, and adds the inode to the LRU if necessary.

This is not unlike what we do for dirty inodes, which are moved off
the LRU permanently until writeback completion puts them back on (iff
still unused). We can reuse the same code -- inode_add_lru() - here.

This is also not unlike page reclaim, where the lower VM layer has to
negotiate state with the higher VFS layer. Follow existing precedence
and handle the inversion as much as possible on the VM side:

- introduce an I_PAGES flag that the VM maintains under the i_lock, so
  that any inode code holding that lock can check the page cache state
  without having to lock and inspect the struct address_space

- introduce inode_pages_set() and inode_pages_clear() to maintain the
  inode LRU state from the VM side, then update all cache mutators to
  use them when populating the first cache entry or clearing the last

With this, the concept of "inodesteal" - where the inode shrinker
drops page cache - is a thing of the past. The VM is in charge of the
page cache, the inode shrinker is in charge of freeing struct inode.

Footnotes

- For debuggability, add vmstat counters that track the number of
  times a new cache entry pulls a previously unused inode off the LRU
  (pginoderescue), as well as how many times existing cache deferred
  an LRU addition.

- Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
  cache. Not doing so has always been a bit strange, but since most
  people drop cache and metadata cache together, the inode shrinker
  would have taken care of them before - no more, so do it VM-side.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c                |   2 +-
 fs/dax.c                      |  14 +++++
 fs/drop_caches.c              |   2 +-
 fs/inode.c                    | 112 +++++++++++++++++++++++++++++++---
 fs/internal.h                 |   2 +-
 include/linux/fs.h            |  17 ++++++
 include/linux/pagemap.h       |   2 +-
 include/linux/vm_event_item.h |   3 +-
 mm/filemap.c                  |  39 +++++++++---
 mm/huge_memory.c              |   3 +-
 mm/truncate.c                 |  34 ++++++++---
 mm/vmscan.c                   |   6 +-
 mm/vmstat.c                   |   4 +-
 mm/workingset.c               |   4 ++
 14 files changed, 209 insertions(+), 35 deletions(-)

Message ID	20200213183459.GB216470@cmpxchg.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=KyPd=4B=kvack.org=owner-linux-mm@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2398B92A for <patchwork-linux-mm@patchwork.kernel.org>; Thu, 13 Feb 2020 18:35:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BC5702465D for <patchwork-linux-mm@patchwork.kernel.org>; Thu, 13 Feb 2020 18:35:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="wqm4Fonx" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BC5702465D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B94F16B05A0; Thu, 13 Feb 2020 13:35:04 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id B44A66B05A1; Thu, 13 Feb 2020 13:35:04 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A0BD66B05A2; Thu, 13 Feb 2020 13:35:04 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0171.hostedemail.com [216.40.44.171]) by kanga.kvack.org (Postfix) with ESMTP id 7F4B26B05A0 for <linux-mm@kvack.org>; Thu, 13 Feb 2020 13:35:04 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 17DC52C98 for <linux-mm@kvack.org>; Thu, 13 Feb 2020 18:35:04 +0000 (UTC) X-FDA: 76485955728.11.wheel35_8363fdb164706 X-Spam-Summary: 10,1,0,96d225c683c2fe61,d41d8cd98f00b204,hannes@cmpxchg.org,:linux-fsdevel@vger.kernel.org::linux-kernel@vger.kernel.org:david@fromorbit.com:laoar.shao@gmail.com:mhocko@suse.com:guro@fb.com:akpm@linux-foundation.org:torvalds@linux-foundation.org:viro@zeniv.linux.org.uk:kernel-team@fb.com,RULES_HIT:41:196:327:355:379:800:960:966:967:968:973:982:988:989:1260:1277:1312:1313:1314:1345:1359:1437:1516:1518:1519:1593:1594:1595:1596:1605:1622:1730:1747:1777:1792:1801:1981:2194:2196:2198:2199:2200:2201:2393:2525:2553:2560:2564:2682:2685:2689:2731:2740:2859:2895:2897:2898:2918:2924:2926:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4041:4250:4321:4385:4423:4470:4605:5007:6119:6261:6653:6671:7514:7576:7875:7903:8660:8784:8957:9010:9025:9149:9545:10004:10394:10967:11026:11232:11473:11658:11914:12043:12050:12262:12291:12296:12297:12438:12517:12519:12555:12663:1 2679:126 X-HE-Tag: wheel35_8363fdb164706 X-Filterd-Recvd-Size: 31887 Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180]) by imf22.hostedemail.com (Postfix) with ESMTP for <linux-mm@kvack.org>; Thu, 13 Feb 2020 18:35:02 +0000 (UTC) Received: by mail-qk1-f180.google.com with SMTP id b7so6654266qkl.7 for <linux-mm@kvack.org>; Thu, 13 Feb 2020 10:35:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=v+sR0wfsGC7e5rwevGNMhc7PSCH0yQLMQFuOH8OFyZo=; b=wqm4FonxxHYrjfB+q5Tc7SR2qSWkS1pvGIzMCG0xSef4Tng7E4F/9A6DmRmOa5vmbc JGTvBwW3PYCFq+g6iQOWl1AKwiRBhIkXwr+FcwkSX6PHHQKwwmdPxUVIrgrVtZVUZnk4 jRpKSqIrAG9uARSWUSVlW7lEgLB/TMljpKJYpnZ01mIJcuZ6EvWLoNcH4OVSflcx/D5T ixrbgV5Td12xSyW8nP6u5FlmNRn4TO6zNa5OusBSZzesTR+SP8aJIOpE/NEW0IFXSSdH 7FbEIrKCQjK2+Sn6x+V1HlscVfpZJAhUBe4Gb1gtfMlznc6RmAq2hVAN+HLllkvN3RVO UpwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=v+sR0wfsGC7e5rwevGNMhc7PSCH0yQLMQFuOH8OFyZo=; b=LFtwftOQqfX2AsYHSlbceWTJg2LnSoVvWW1AQzo01VzZ0PtELiioTNwTczjAJ9/91j tdajszfPAmMIDCuBkkaqnqS58qxLMYCjLMiaXwJGFKTDncoukHewskrp4gM/HG/o2VTE EMa5eFKdoS3rKZtMrAhZjJGXOMAKIx11rEK2l+YMGCeLivRSHxSQURgwKhd7W/dnmzVr mHCM46asoyLjIHY/d6ewLFVwKG8UxTGGyrUcAGo2ZeRQz+Kq5oRL1xyVZYB5haR+F6c9 t691Vbf+zswxmeflJNn2lf/gCBVHD/+7FRl6crGrNscN9Ek0tglvwnQRId7rdVGP+mtW RPBw== X-Gm-Message-State: APjAAAXRope6p0qn87kzRaL5axVMNdyQfnxBELzWVU7TkTViIRYH99Vx NgeLdNgikzVpdbWHbpwPanNPqg== X-Google-Smtp-Source: APXvYqw4Gn+qfKRo0zfyv39JquJSPnuhAVgZA1hTocDjyjWXsFde5BqdWRKsA5c9x9ygoX3H9TErWw== X-Received: by 2002:a05:620a:1036:: with SMTP id a22mr15902558qkk.338.1581618901428; Thu, 13 Feb 2020 10:35:01 -0800 (PST) Received: from localhost ([2620:10d:c091:500::d837]) by smtp.gmail.com with ESMTPSA id m95sm1882250qte.41.2020.02.13.10.35.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Feb 2020 10:35:00 -0800 (PST) Date: Thu, 13 Feb 2020 13:34:59 -0500 From: Johannes Weiner <hannes@cmpxchg.org> To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Dave Chinner <david@fromorbit.com>, Yafang Shao <laoar.shao@gmail.com>, Michal Hocko <mhocko@suse.com>, Roman Gushchin <guro@fb.com>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Al Viro <viro@zeniv.linux.org.uk>, kernel-team@fb.com Subject: [PATCH v2] vfs: keep inodes with page cache off the inode shrinker LRU Message-ID: <20200213183459.GB216470@cmpxchg.org> References: <20200211175507.178100-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200211175507.178100-1-hannes@cmpxchg.org> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	[v2] vfs: keep inodes with page cache off the inode shrinker LRU \| expand [v2] vfs: keep inodes with page cache off the inode shrinker LRU

[v2] vfs: keep inodes with page cache off the inode shrinker LRU

Commit Message

Patch