vfs: keep inodes with page cache off the inode shrinker LRU

From: Johannes Weiner <hannes@cmpxchg.org>

The VFS inode shrinker is currently allowed to reclaim inodes with
populated page cache. As a result it can drop gigabytes of hot and
active page cache on the floor without consulting the VM (recorded as
"inodesteal" events in /proc/vmstat).

This causes real problems in practice. Consider for example how the VM
would cache a source tree, such as the Linux git tree. As large parts
of the checked out files and the object database are accessed
repeatedly, the page cache holding this data gets moved to the active
list, where it's fully (and indefinitely) insulated from one-off cache
moving through the inactive list.

However, due to the way users interact with the tree, no ongoing open
file descriptors into the source tree are maintained, and the inodes
end up on the "unused inode" shrinker LRU. A larger burst of one-off
cache (find, updatedb, etc.) can now drive the VFS shrinkers to drop
first the dentries and then the inodes - inodes that contain the most
valuable data currently held by the page cache - while there is plenty
of one-off cache that could be reclaimed instead.

This doesn't make sense. The inodes aren't really "unused" as long as
the VM deems it worthwhile to hold on to their page cache. And the
shrinker can't possibly guess what is and isn't valuable to the VM
based on recent inode reference information alone (we could delete
several thousand lines of reclaim code if it could).

History

This behavior of invalidating page cache from the inode shrinker goes
back to even before the git import of the kernel tree. It may have
been less noticeable when the VM itself didn't have real workingset
protection, and floods of one-off cache would push out any active
cache over time anyway. But the VM has come a long way since then and
the inode shrinker is now actively subverting its caching strategy.

As this keeps causing problems for people, there have been several
attempts to address this.

One recent attempt was to make the inode shrinker simply skip over
inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim
inodes with many attached pages").

However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
don't reclaim inodes with many attached pages"") because it caused
severe reclaim performance problems: Inodes that sit on the shrinker
LRU are attracting reclaim pressure away from the page cache and
toward the VFS. If we then permanently exempt sizable portions of this
pool from actually getting reclaimed when looked at, this pressure
accumulates as deferred shrinker work (a mechanism for *temporarily*
unreclaimable objects) until it causes mayhem in the VFS cache pools.

In the bug quoted in 69056ee6a8a3 in particular, the excessive
pressure drove the XFS shrinker into dirty objects, where it caused
synchronous, IO-bound stalls, even as there was plenty of clean page
cache that should have been reclaimed instead.

Another variant of this problem was recently observed, where the
kernel violates cgroups' memory.low protection settings and reclaims
page cache way beyond the configured thresholds. It was followed by a
proposal of a modified form of the reverted commit above, that
implements memory.low-sensitive shrinker skipping over populated
inodes on the LRU [1]. However, this proposal continues to run the
risk of attracting disproportionate reclaim pressure to a pool of
still-used inodes, while not addressing the more generic reclaim
inversion problem outside of a very specific cgroup application.

[1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/

Solution

To fix the reclaim inversion described in the beginning, without
reintroducing the problems associated with shrinker LRU rotations,
this patch keeps populated inodes off the LRUs entirely.

Currently, inodes are kept off the shrinker LRU as long as they have
an elevated i_count, indicating an active user. Unfortunately, the
page cache cannot simply hold an i_count reference, because unlink()
*should* result in the inode being dropped and its cache invalidated.

Instead, this patch makes iput_final() consult the state of the page
cache and punt the LRU linking to the VM if the inode is still
populated; the VM in turn checks the inode state when it depopulates
the page cache, and adds the inode to the LRU if necessary.

This is not unlike what we do for dirty inodes, which are moved off
the LRU permanently until writeback completion puts them back on (iff
still unused). We can reuse the same code -- inode_add_lru() - here.

This is also not unlike page reclaim, where the lower VM layer has to
negotiate state with the higher VFS layer. Follow existing precedence
and handle the inversion as much as possible on the VM side:

- introduce an I_PAGES flag that the VM maintains under the i_lock, so
  that any inode code holding that lock can check the page cache state
  without having to lock and inspect the struct address_space

- introduce inode_pages_set() and inode_pages_clear() to maintain the
  inode LRU state from the VM side, then update all cache mutators to
  use them when populating the first cache entry or clearing the last

With this, the concept of "inodesteal" - where the inode shrinker
drops page cache - is a thing of the past. The VM is in charge of the
page cache, the inode shrinker is in charge of freeing struct inode.

Footnotes

- For debuggability, add vmstat counters that track the number of
  times a new cache entry pulls a previously unused inode off the LRU
  (pginoderescue), as well as how many times existing cache deferred
  an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters
  for backwards compatibility, but they'll just show 0 now.

- Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
  cache. Not doing so has always been a bit strange, but since most
  people drop cache and metadata cache together, the inode shrinker
  would have taken care of them before - no more, so do it VM-side.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c                |   2 +-
 fs/dax.c                      |  14 +++++
 fs/drop_caches.c              |   2 +-
 fs/inode.c                    | 106 ++++++++++++++++++++++++++--------
 fs/internal.h                 |   2 +-
 include/linux/fs.h            |  12 ++++
 include/linux/pagemap.h       |   2 +-
 include/linux/vm_event_item.h |   3 +-
 mm/filemap.c                  |  39 ++++++++++---
 mm/huge_memory.c              |   3 +-
 mm/truncate.c                 |  34 ++++++++---
 mm/vmscan.c                   |   6 +-
 mm/vmstat.c                   |   4 +-
 mm/workingset.c               |   4 ++
 14 files changed, 183 insertions(+), 50 deletions(-)

Message ID	20200211175507.178100-1-hannes@cmpxchg.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=+WHD=37=vger.kernel.org=linux-fsdevel-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 723801395 for <patchwork-linux-fsdevel@patchwork.kernel.org>; Tue, 11 Feb 2020 17:55:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 36494206CC for <patchwork-linux-fsdevel@patchwork.kernel.org>; Tue, 11 Feb 2020 17:55:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="QMTXv/vg" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730969AbgBKRzM (ORCPT <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>); Tue, 11 Feb 2020 12:55:12 -0500 Received: from mail-qt1-f173.google.com ([209.85.160.173]:42275 "EHLO mail-qt1-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729205AbgBKRzL (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>); Tue, 11 Feb 2020 12:55:11 -0500 Received: by mail-qt1-f173.google.com with SMTP id r5so7268657qtt.9 for <linux-fsdevel@vger.kernel.org>; Tue, 11 Feb 2020 09:55:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=GLSKIJjQ1kft1gJ/qKtxM4amgzFG9wE8AKsRtRYht/Y=; b=QMTXv/vgPNtb2MsNkUjTpvjjg1Vk4SEu3+RlU0eZE/V16XKXbBPDsGunK62bP9CqiA FRLQetK0pQ/VaNSZ87bskQgk0xJZx7Y2V4yYb9wwbjsPmyBMa8fwrAnpphkQ6J6A2/eL UjG1G25kYd/OGWKD83MqbgOWEqGiNjq34djuc/4dckvnPTNN/M+qbDUv27q3SCGJKj/o 1J2lhg801kPPtRuHlvApLw0fvf/JlAqTWeLECp1FpSYBcdKZZbFj+j5zXRTFNRkLbCKn QqNnuaIds7fRggoa1sTlQj6q3pH5THgWSC6v3ZYUQAQBgCQuObmI+lrVeQUL+6kVOdSq wMGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=GLSKIJjQ1kft1gJ/qKtxM4amgzFG9wE8AKsRtRYht/Y=; b=VlX0fbrSvWCze4B/5NifFJ0pND4VKDwiJN3446Vgd6lYCjYXxzxNmxx+Utfhn3ZtIS H3iMB/y4yCmqfC+ZLrkEDY6AzBlALNb2oMj2S0wPnGsmCDnmRHTr9COCoDnquIJZIqWX yYbmjFiIYIpG52eskvta6mV0fkJE/sXfUNb5tCDT8ETjAW6p7mBWOt/uxolGt2eKNBk+ ng92k2MyJ72dHmt7MPWS1bnnADPrdPoqvvdJwqkSNJ333lZIIqDggtN4A7p3ufqGVT6X iJuugFyVUPSvGwnKuoZA8/X0Enys17s5tJ0FeMnJtqMVtmgDn3nV11GJ4MVeSsvw96zd 9N4Q== X-Gm-Message-State: APjAAAXReqhpTQd/h1cRLxmLe1lF0vFnnIUCNlQfCve9MOzI9CHQisKH o0D7BSuD/QfyEVzUXnUxa3FoltQZ+Wc= X-Google-Smtp-Source: APXvYqxl/dCeI5uKoA4fVduINodfjBAz5C5rJp74yhkE3YmywZRh3gcymu/U9QumNjvskwj8+5Rs+g== X-Received: by 2002:ac8:6054:: with SMTP id k20mr3443837qtm.92.1581443708963; Tue, 11 Feb 2020 09:55:08 -0800 (PST) Received: from localhost ([2620:10d:c091:500::3:3189]) by smtp.gmail.com with ESMTPSA id z1sm2503434qtq.69.2020.02.11.09.55.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Feb 2020 09:55:08 -0800 (PST) From: Johannes Weiner <hannes@cmpxchg.org> To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Dave Chinner <david@fromorbit.com>, Yafang Shao <laoar.shao@gmail.com>, Michal Hocko <mhocko@suse.com>, Roman Gushchin <guro@fb.com>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Al Viro <viro@zeniv.linux.org.uk>, kernel-team@fb.com Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU Date: Tue, 11 Feb 2020 12:55:07 -0500 Message-Id: <20200211175507.178100-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.24.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: <linux-fsdevel.vger.kernel.org> X-Mailing-List: linux-fsdevel@vger.kernel.org
Series	vfs: keep inodes with page cache off the inode shrinker LRU \| expand vfs: keep inodes with page cache off the inode shrinker LRU

vfs: keep inodes with page cache off the inode shrinker LRU

Commit Message

Comments

Patch