diff mbox series

vfs: keep inodes with page cache off the inode shrinker LRU

Message ID 20200211175507.178100-1-hannes@cmpxchg.org (mailing list archive)
State New, archived
Headers show
Series vfs: keep inodes with page cache off the inode shrinker LRU | expand

Commit Message

Johannes Weiner Feb. 11, 2020, 5:55 p.m. UTC
The VFS inode shrinker is currently allowed to reclaim inodes with
populated page cache. As a result it can drop gigabytes of hot and
active page cache on the floor without consulting the VM (recorded as
"inodesteal" events in /proc/vmstat).

This causes real problems in practice. Consider for example how the VM
would cache a source tree, such as the Linux git tree. As large parts
of the checked out files and the object database are accessed
repeatedly, the page cache holding this data gets moved to the active
list, where it's fully (and indefinitely) insulated from one-off cache
moving through the inactive list.

However, due to the way users interact with the tree, no ongoing open
file descriptors into the source tree are maintained, and the inodes
end up on the "unused inode" shrinker LRU. A larger burst of one-off
cache (find, updatedb, etc.) can now drive the VFS shrinkers to drop
first the dentries and then the inodes - inodes that contain the most
valuable data currently held by the page cache - while there is plenty
of one-off cache that could be reclaimed instead.

This doesn't make sense. The inodes aren't really "unused" as long as
the VM deems it worthwhile to hold on to their page cache. And the
shrinker can't possibly guess what is and isn't valuable to the VM
based on recent inode reference information alone (we could delete
several thousand lines of reclaim code if it could).

History

This behavior of invalidating page cache from the inode shrinker goes
back to even before the git import of the kernel tree. It may have
been less noticeable when the VM itself didn't have real workingset
protection, and floods of one-off cache would push out any active
cache over time anyway. But the VM has come a long way since then and
the inode shrinker is now actively subverting its caching strategy.

As this keeps causing problems for people, there have been several
attempts to address this.

One recent attempt was to make the inode shrinker simply skip over
inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim
inodes with many attached pages").

However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
don't reclaim inodes with many attached pages"") because it caused
severe reclaim performance problems: Inodes that sit on the shrinker
LRU are attracting reclaim pressure away from the page cache and
toward the VFS. If we then permanently exempt sizable portions of this
pool from actually getting reclaimed when looked at, this pressure
accumulates as deferred shrinker work (a mechanism for *temporarily*
unreclaimable objects) until it causes mayhem in the VFS cache pools.

In the bug quoted in 69056ee6a8a3 in particular, the excessive
pressure drove the XFS shrinker into dirty objects, where it caused
synchronous, IO-bound stalls, even as there was plenty of clean page
cache that should have been reclaimed instead.

Another variant of this problem was recently observed, where the
kernel violates cgroups' memory.low protection settings and reclaims
page cache way beyond the configured thresholds. It was followed by a
proposal of a modified form of the reverted commit above, that
implements memory.low-sensitive shrinker skipping over populated
inodes on the LRU [1]. However, this proposal continues to run the
risk of attracting disproportionate reclaim pressure to a pool of
still-used inodes, while not addressing the more generic reclaim
inversion problem outside of a very specific cgroup application.

[1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/

Solution

To fix the reclaim inversion described in the beginning, without
reintroducing the problems associated with shrinker LRU rotations,
this patch keeps populated inodes off the LRUs entirely.

Currently, inodes are kept off the shrinker LRU as long as they have
an elevated i_count, indicating an active user. Unfortunately, the
page cache cannot simply hold an i_count reference, because unlink()
*should* result in the inode being dropped and its cache invalidated.

Instead, this patch makes iput_final() consult the state of the page
cache and punt the LRU linking to the VM if the inode is still
populated; the VM in turn checks the inode state when it depopulates
the page cache, and adds the inode to the LRU if necessary.

This is not unlike what we do for dirty inodes, which are moved off
the LRU permanently until writeback completion puts them back on (iff
still unused). We can reuse the same code -- inode_add_lru() - here.

This is also not unlike page reclaim, where the lower VM layer has to
negotiate state with the higher VFS layer. Follow existing precedence
and handle the inversion as much as possible on the VM side:

- introduce an I_PAGES flag that the VM maintains under the i_lock, so
  that any inode code holding that lock can check the page cache state
  without having to lock and inspect the struct address_space

- introduce inode_pages_set() and inode_pages_clear() to maintain the
  inode LRU state from the VM side, then update all cache mutators to
  use them when populating the first cache entry or clearing the last

With this, the concept of "inodesteal" - where the inode shrinker
drops page cache - is a thing of the past. The VM is in charge of the
page cache, the inode shrinker is in charge of freeing struct inode.

Footnotes

- For debuggability, add vmstat counters that track the number of
  times a new cache entry pulls a previously unused inode off the LRU
  (pginoderescue), as well as how many times existing cache deferred
  an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters
  for backwards compatibility, but they'll just show 0 now.

- Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
  cache. Not doing so has always been a bit strange, but since most
  people drop cache and metadata cache together, the inode shrinker
  would have taken care of them before - no more, so do it VM-side.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c                |   2 +-
 fs/dax.c                      |  14 +++++
 fs/drop_caches.c              |   2 +-
 fs/inode.c                    | 106 ++++++++++++++++++++++++++--------
 fs/internal.h                 |   2 +-
 include/linux/fs.h            |  12 ++++
 include/linux/pagemap.h       |   2 +-
 include/linux/vm_event_item.h |   3 +-
 mm/filemap.c                  |  39 ++++++++++---
 mm/huge_memory.c              |   3 +-
 mm/truncate.c                 |  34 ++++++++---
 mm/vmscan.c                   |   6 +-
 mm/vmstat.c                   |   4 +-
 mm/workingset.c               |   4 ++
 14 files changed, 183 insertions(+), 50 deletions(-)

Comments

Johannes Weiner Feb. 11, 2020, 6:20 p.m. UTC | #1
On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote:
> However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
> don't reclaim inodes with many attached pages"") because it caused
> severe reclaim performance problems: Inodes that sit on the shrinker
> LRU are attracting reclaim pressure away from the page cache and
> toward the VFS. If we then permanently exempt sizable portions of this
> pool from actually getting reclaimed when looked at, this pressure
> accumulates as deferred shrinker work (a mechanism for *temporarily*
> unreclaimable objects) until it causes mayhem in the VFS cache pools.
> 
> In the bug quoted in 69056ee6a8a3 in particular, the excessive
> pressure drove the XFS shrinker into dirty objects, where it caused
> synchronous, IO-bound stalls, even as there was plenty of clean page
> cache that should have been reclaimed instead.

A note on testing: the patch behaves much better on my machine and the
inode shrinker doesn't drop hot page cache anymore, without noticable
downsides so far.

However, I tried to reproduce the xfs case that caused the
69056ee6a8a3 revert and haven't managed to do so yet on 5.5 plus the
reverted patch. I cannot provoke higher inode sync stalls in the xfs
shrinker regardless of shrinker strategy. Maybe something else changed
since 4.19 and it's less of a concern now.

Nonetheless, I'm interested in opinions on the premise of this
patch. And Yafang is working on his memcg-specific fix for this issue,
so I wanted to put this proposal on the table sooner than later.

Thanks
Rik van Riel Feb. 11, 2020, 7:05 p.m. UTC | #2
On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote:
> The VFS inode shrinker is currently allowed to reclaim inodes with
> populated page cache. As a result it can drop gigabytes of hot and
> active page cache on the floor without consulting the VM (recorded as
> "inodesteal" events in /proc/vmstat).
> 
> This causes real problems in practice. Consider for example how the
> VM
> would cache a source tree, such as the Linux git tree. As large parts
> of the checked out files and the object database are accessed
> repeatedly, the page cache holding this data gets moved to the active
> list, where it's fully (and indefinitely) insulated from one-off
> cache
> moving through the inactive list.

> This behavior of invalidating page cache from the inode shrinker goes
> back to even before the git import of the kernel tree. It may have
> been less noticeable when the VM itself didn't have real workingset
> protection, and floods of one-off cache would push out any active
> cache over time anyway. But the VM has come a long way since then and
> the inode shrinker is now actively subverting its caching strategy.

Two things come to mind when looking at this:
- highmem
- NUMA

IIRC one of the reasons reclaim is done in this way is
because a page cache page in one area of memory (highmem,
or a NUMA node) can end up pinning inode slab memory in
another memory area (normal zone, other NUMA node).

I do not know how much of a concern that still is nowadays,
but it seemed something worth bringing up.
Johannes Weiner Feb. 11, 2020, 7:31 p.m. UTC | #3
On Tue, Feb 11, 2020 at 02:05:38PM -0500, Rik van Riel wrote:
> On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote:
> > The VFS inode shrinker is currently allowed to reclaim inodes with
> > populated page cache. As a result it can drop gigabytes of hot and
> > active page cache on the floor without consulting the VM (recorded as
> > "inodesteal" events in /proc/vmstat).
> > 
> > This causes real problems in practice. Consider for example how the
> > VM
> > would cache a source tree, such as the Linux git tree. As large parts
> > of the checked out files and the object database are accessed
> > repeatedly, the page cache holding this data gets moved to the active
> > list, where it's fully (and indefinitely) insulated from one-off
> > cache
> > moving through the inactive list.
> 
> > This behavior of invalidating page cache from the inode shrinker goes
> > back to even before the git import of the kernel tree. It may have
> > been less noticeable when the VM itself didn't have real workingset
> > protection, and floods of one-off cache would push out any active
> > cache over time anyway. But the VM has come a long way since then and
> > the inode shrinker is now actively subverting its caching strategy.
> 
> Two things come to mind when looking at this:
> - highmem
> - NUMA
> 
> IIRC one of the reasons reclaim is done in this way is
> because a page cache page in one area of memory (highmem,
> or a NUMA node) can end up pinning inode slab memory in
> another memory area (normal zone, other NUMA node).

That's a good point, highmem does ring a bell now that you mention it.

If we still care, I think this could be solved by doing something
similar to what we do with buffer_heads_over_limit: allow a lowmem
allocation to reclaim page cache inside the highmem zone if the bhs
(or inodes in this case) have accumulated excessively.

AFAICS, we haven't done anything similar for NUMA, so it might not be
much of a problem there. I could imagine this is in part because NUMA
nodes tend to be more balanced in size, and the ratio between cache
memory and inode/bh memory means that these objects won't turn into a
significant externality. Whereas with extreme highmem:lowmem ratios,
they can.
Andrew Morton Feb. 11, 2020, 11:44 p.m. UTC | #4
On Tue, 11 Feb 2020 14:31:01 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Feb 11, 2020 at 02:05:38PM -0500, Rik van Riel wrote:
> > On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote:
> > > The VFS inode shrinker is currently allowed to reclaim inodes with
> > > populated page cache. As a result it can drop gigabytes of hot and
> > > active page cache on the floor without consulting the VM (recorded as
> > > "inodesteal" events in /proc/vmstat).
> > > 
> > > This causes real problems in practice. Consider for example how the
> > > VM
> > > would cache a source tree, such as the Linux git tree. As large parts
> > > of the checked out files and the object database are accessed
> > > repeatedly, the page cache holding this data gets moved to the active
> > > list, where it's fully (and indefinitely) insulated from one-off
> > > cache
> > > moving through the inactive list.
> > 
> > > This behavior of invalidating page cache from the inode shrinker goes
> > > back to even before the git import of the kernel tree. It may have
> > > been less noticeable when the VM itself didn't have real workingset
> > > protection, and floods of one-off cache would push out any active
> > > cache over time anyway. But the VM has come a long way since then and
> > > the inode shrinker is now actively subverting its caching strategy.
> > 
> > Two things come to mind when looking at this:
> > - highmem
> > - NUMA
> > 
> > IIRC one of the reasons reclaim is done in this way is
> > because a page cache page in one area of memory (highmem,
> > or a NUMA node) can end up pinning inode slab memory in
> > another memory area (normal zone, other NUMA node).
> 
> That's a good point, highmem does ring a bell now that you mention it.

Yup, that's why this mechanism exists.  Here:

https://marc.info/?l=git-commits-head&m=103646757213266&w=2

> If we still care, I think this could be solved by doing something
> similar to what we do with buffer_heads_over_limit: allow a lowmem
> allocation to reclaim page cache inside the highmem zone if the bhs
> (or inodes in this case) have accumulated excessively.

Well, reclaiming highmem pagecache at random would be a painful way to
reclaim lowmem inodes.  Better to pick an inode then shoot down all its
pagecache.  Perhaps we could take its pagecache's aging into account.

Testing this will be a challenge, but the issue was real - a 7GB
highmem machine isn't crazy and I expect the inode has become larger
since those days.

> AFAICS, we haven't done anything similar for NUMA, so it might not be
> much of a problem there. I could imagine this is in part because NUMA
> nodes tend to be more balanced in size, and the ratio between cache
> memory and inode/bh memory means that these objects won't turn into a
> significant externality. Whereas with extreme highmem:lowmem ratios,
> they can.
Linus Torvalds Feb. 12, 2020, 12:28 a.m. UTC | #5
On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Testing this will be a challenge, but the issue was real - a 7GB
> highmem machine isn't crazy and I expect the inode has become larger
> since those days.

Hmm. I would say that in the intening years a 7GB highmem machine has
indeed become crazy.

It used to be something we kind of supported.

But we really should consider HIGHMEM to be something that is on the
deprecation list. In this day and age, there is no excuse for running
a 32-bit kernel with lots of physical memory.

And if you really want to do that, and have some legacy hardware with
a legacy use case, maybe you should be using a legacy kernel.

I'd personally be perfectly happy to start removing HIGHMEM support again.

                   Linus
Andrew Morton Feb. 12, 2020, 12:47 a.m. UTC | #6
On Tue, 11 Feb 2020 16:28:39 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Testing this will be a challenge, but the issue was real - a 7GB
> > highmem machine isn't crazy and I expect the inode has become larger
> > since those days.
> 
> Hmm. I would say that in the intening years a 7GB highmem machine has
> indeed become crazy.
> 
> It used to be something we kind of supported.
> 
> But we really should consider HIGHMEM to be something that is on the
> deprecation list. In this day and age, there is no excuse for running
> a 32-bit kernel with lots of physical memory.
> 
> And if you really want to do that, and have some legacy hardware with
> a legacy use case, maybe you should be using a legacy kernel.
> 
> I'd personally be perfectly happy to start removing HIGHMEM support again.
> 

That would be nice.

What's the situation with highmem on ARM?
Linus Torvalds Feb. 12, 2020, 1:03 a.m. UTC | #7
On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> What's the situation with highmem on ARM?

Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
needed it, and I was ranting at some people for repeating all the
mistakes Intel did.

But arm64 doesn't need it, and while 32-bit arm is obviosuly still
selling, I think that in many ways the switch-over to 64-bit has been
quicker on ARM than it was on x86. Partly because it happened later
(so all the 64-bit teething pains were dealt with), but largely
because everybody ended up actively discouraging 32-bit on the Android
side.

There were a couple of unfortunate early 32-bit arm server attempts,
but they were - predictably - complete garbage and nobody bought them.
They don't exist any more.

So at least my gut feel is that the arm people don't have any big
reason to push for maintaining HIGHMEM support either.

But I'm adding a couple of arm people and the arm list just in case
they have some input.

[ Obvious background for newly added people: we're talking about
making CONFIG_HIGHMEM a deprecated feature and saying that if you want
to run with lots of memory on a 32-bit kernel, you're doing legacy
stuff and can use a legacy kernel ]

              Linus
Matthew Wilcox Feb. 12, 2020, 3:58 a.m. UTC | #8
On Tue, Feb 11, 2020 at 04:28:39PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > Testing this will be a challenge, but the issue was real - a 7GB
> > highmem machine isn't crazy and I expect the inode has become larger
> > since those days.
> 
> Hmm. I would say that in the intening years a 7GB highmem machine has
> indeed become crazy.
> 
> It used to be something we kind of supported.
> 
> But we really should consider HIGHMEM to be something that is on the
> deprecation list. In this day and age, there is no excuse for running
> a 32-bit kernel with lots of physical memory.
> 
> And if you really want to do that, and have some legacy hardware with
> a legacy use case, maybe you should be using a legacy kernel.
> 
> I'd personally be perfectly happy to start removing HIGHMEM support again.

Do we have a use case where people want to run modern 32-bit guest kernels
with more than 896MB of memory to support some horrendous legacy app?
Or is our 32-bit compat story good enough that nobody actually does this?
(Contrariwise, maybe those people are also fine with running a ten year
old kernel because a `uname -r` starting with '3' breaks said app)
Michal Hocko Feb. 12, 2020, 8:09 a.m. UTC | #9
On Tue 11-02-20 16:28:39, Linus Torvalds wrote:
> On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Testing this will be a challenge, but the issue was real - a 7GB
> > highmem machine isn't crazy and I expect the inode has become larger
> > since those days.
> 
> Hmm. I would say that in the intening years a 7GB highmem machine has
> indeed become crazy.

Absolutely agreed.

> It used to be something we kind of supported.

And it's been few years since we have been actively discouraging people
from using 32b kernels with a lot of memory. There are bug reports
popping out from time to time but I do not remember any case where using
64b kernel would be a no-go. So my strong suspicion is that people
simply keep their kernels on 32b without a good reason because it tends
to work most of the time until they hit one of the lowmem problems and
they move over to 64b.

> But we really should consider HIGHMEM to be something that is on the
> deprecation list. In this day and age, there is no excuse for running
> a 32-bit kernel with lots of physical memory.
> 
> And if you really want to do that, and have some legacy hardware with
> a legacy use case, maybe you should be using a legacy kernel.
> 
> I'd personally be perfectly happy to start removing HIGHMEM support again.

I wouldn't be opposed at all.
Russell King (Oracle) Feb. 12, 2020, 8:50 a.m. UTC | #10
On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > What's the situation with highmem on ARM?
> 
> Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
> needed it, and I was ranting at some people for repeating all the
> mistakes Intel did.
> 
> But arm64 doesn't need it, and while 32-bit arm is obviosuly still
> selling, I think that in many ways the switch-over to 64-bit has been
> quicker on ARM than it was on x86. Partly because it happened later
> (so all the 64-bit teething pains were dealt with), but largely
> because everybody ended up actively discouraging 32-bit on the Android
> side.
> 
> There were a couple of unfortunate early 32-bit arm server attempts,
> but they were - predictably - complete garbage and nobody bought them.
> They don't exist any more.
> 
> So at least my gut feel is that the arm people don't have any big
> reason to push for maintaining HIGHMEM support either.
> 
> But I'm adding a couple of arm people and the arm list just in case
> they have some input.
> 
> [ Obvious background for newly added people: we're talking about
> making CONFIG_HIGHMEM a deprecated feature and saying that if you want
> to run with lots of memory on a 32-bit kernel, you're doing legacy
> stuff and can use a legacy kernel ]

Well, the recent 32-bit ARM systems generally have more than 1G
of memory, so make use of highmem as a rule.  You're probably
talking about crippling support for any 32-bit ARM system produced
in the last 8 to 10 years.
Yafang Shao Feb. 12, 2020, 12:25 p.m. UTC | #11
On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> The VFS inode shrinker is currently allowed to reclaim inodes with
> populated page cache. As a result it can drop gigabytes of hot and
> active page cache on the floor without consulting the VM (recorded as
> "inodesteal" events in /proc/vmstat).
>
> This causes real problems in practice. Consider for example how the VM
> would cache a source tree, such as the Linux git tree. As large parts
> of the checked out files and the object database are accessed
> repeatedly, the page cache holding this data gets moved to the active
> list, where it's fully (and indefinitely) insulated from one-off cache
> moving through the inactive list.
>
> However, due to the way users interact with the tree, no ongoing open
> file descriptors into the source tree are maintained, and the inodes
> end up on the "unused inode" shrinker LRU. A larger burst of one-off
> cache (find, updatedb, etc.) can now drive the VFS shrinkers to drop
> first the dentries and then the inodes - inodes that contain the most
> valuable data currently held by the page cache - while there is plenty
> of one-off cache that could be reclaimed instead.
>
> This doesn't make sense. The inodes aren't really "unused" as long as
> the VM deems it worthwhile to hold on to their page cache. And the
> shrinker can't possibly guess what is and isn't valuable to the VM
> based on recent inode reference information alone (we could delete
> several thousand lines of reclaim code if it could).
>
> History
>
> This behavior of invalidating page cache from the inode shrinker goes
> back to even before the git import of the kernel tree. It may have
> been less noticeable when the VM itself didn't have real workingset
> protection, and floods of one-off cache would push out any active
> cache over time anyway. But the VM has come a long way since then and
> the inode shrinker is now actively subverting its caching strategy.
>
> As this keeps causing problems for people, there have been several
> attempts to address this.
>
> One recent attempt was to make the inode shrinker simply skip over
> inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim
> inodes with many attached pages").
>
> However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
> don't reclaim inodes with many attached pages"") because it caused
> severe reclaim performance problems: Inodes that sit on the shrinker
> LRU are attracting reclaim pressure away from the page cache and
> toward the VFS. If we then permanently exempt sizable portions of this
> pool from actually getting reclaimed when looked at, this pressure
> accumulates as deferred shrinker work (a mechanism for *temporarily*
> unreclaimable objects) until it causes mayhem in the VFS cache pools.
>
> In the bug quoted in 69056ee6a8a3 in particular, the excessive
> pressure drove the XFS shrinker into dirty objects, where it caused
> synchronous, IO-bound stalls, even as there was plenty of clean page
> cache that should have been reclaimed instead.
>
> Another variant of this problem was recently observed, where the
> kernel violates cgroups' memory.low protection settings and reclaims
> page cache way beyond the configured thresholds. It was followed by a
> proposal of a modified form of the reverted commit above, that
> implements memory.low-sensitive shrinker skipping over populated
> inodes on the LRU [1]. However, this proposal continues to run the
> risk of attracting disproportionate reclaim pressure to a pool of
> still-used inodes,

Hi Johannes,

If you really think that is a risk, what about bellow additional patch
to fix this risk ?

diff --git a/fs/inode.c b/fs/inode.c
index 80dddbc..61862d9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
                goto out;

        cgroup_size = mem_cgroup_size(memcg);
-       if (inode->i_data.nrpages + protection >= cgroup_size)
+       if (inode->i_data.nrpages)
                reclaimable = false;

 out:

With this additional patch, we skip all inodes in this memcg until all
its page cache pages are reclaimed.

> while not addressing the more generic reclaim
> inversion problem outside of a very specific cgroup application.
>

But I have a different understanding.
This method works like a knob. If you really care about your
workingset (data), you should turn it on (i.e. by using memcg
protection to protect them), while if you don't care about your
workingset (data) then you'd better turn it off. That would be more
flexible.
Regaring your case in the commit log, why not protect your linux git
tree with memcg protection ?

> [1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/
>
> Solution
>
> To fix the reclaim inversion described in the beginning, without
> reintroducing the problems associated with shrinker LRU rotations,
> this patch keeps populated inodes off the LRUs entirely.
>
> Currently, inodes are kept off the shrinker LRU as long as they have
> an elevated i_count, indicating an active user. Unfortunately, the
> page cache cannot simply hold an i_count reference, because unlink()
> *should* result in the inode being dropped and its cache invalidated.
>
> Instead, this patch makes iput_final() consult the state of the page
> cache and punt the LRU linking to the VM if the inode is still
> populated; the VM in turn checks the inode state when it depopulates
> the page cache, and adds the inode to the LRU if necessary.
>
> This is not unlike what we do for dirty inodes, which are moved off
> the LRU permanently until writeback completion puts them back on (iff
> still unused). We can reuse the same code -- inode_add_lru() - here.
>
> This is also not unlike page reclaim, where the lower VM layer has to
> negotiate state with the higher VFS layer. Follow existing precedence
> and handle the inversion as much as possible on the VM side:
>
> - introduce an I_PAGES flag that the VM maintains under the i_lock, so
>   that any inode code holding that lock can check the page cache state
>   without having to lock and inspect the struct address_space
>
> - introduce inode_pages_set() and inode_pages_clear() to maintain the
>   inode LRU state from the VM side, then update all cache mutators to
>   use them when populating the first cache entry or clearing the last
>
> With this, the concept of "inodesteal" - where the inode shrinker
> drops page cache - is a thing of the past. The VM is in charge of the
> page cache, the inode shrinker is in charge of freeing struct inode.
>
> Footnotes
>
> - For debuggability, add vmstat counters that track the number of
>   times a new cache entry pulls a previously unused inode off the LRU
>   (pginoderescue), as well as how many times existing cache deferred
>   an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters
>   for backwards compatibility, but they'll just show 0 now.
>
> - Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
>   cache. Not doing so has always been a bit strange, but since most
>   people drop cache and metadata cache together, the inode shrinker
>   would have taken care of them before - no more, so do it VM-side.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/block_dev.c                |   2 +-
>  fs/dax.c                      |  14 +++++
>  fs/drop_caches.c              |   2 +-
>  fs/inode.c                    | 106 ++++++++++++++++++++++++++--------
>  fs/internal.h                 |   2 +-
>  include/linux/fs.h            |  12 ++++
>  include/linux/pagemap.h       |   2 +-
>  include/linux/vm_event_item.h |   3 +-
>  mm/filemap.c                  |  39 ++++++++++---
>  mm/huge_memory.c              |   3 +-
>  mm/truncate.c                 |  34 ++++++++---
>  mm/vmscan.c                   |   6 +-
>  mm/vmstat.c                   |   4 +-
>  mm/workingset.c               |   4 ++
>  14 files changed, 183 insertions(+), 50 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 69bf2fb6f7cd..46f67147ad20 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev)
>  {
>         struct address_space *mapping = bdev->bd_inode->i_mapping;
>
> -       if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +       if (mapping_empty(mapping))
>                 return;
>
>         invalidate_bh_lrus();
> diff --git a/fs/dax.c b/fs/dax.c
> index 35da144375a0..7f3ce4612272 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas,
>  {
>         unsigned long index = xas->xa_index;
>         bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
> +       int populated;
>         void *entry;
>
>  retry:
> +       populated = 0;
>         xas_lock_irq(xas);
>         entry = get_unlocked_entry(xas, order);
>
> @@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas,
>                 xas_store(xas, NULL);   /* undo the PMD join */
>                 dax_wake_entry(xas, entry, true);
>                 mapping->nrexceptional--;
> +               if (mapping_empty(mapping))
> +                       populated = -1;
>                 entry = NULL;
>                 xas_set(xas, index);
>         }
> @@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas,
>                 dax_lock_entry(xas, entry);
>                 if (xas_error(xas))
>                         goto out_unlock;
> +               if (mapping_empty(mapping))
> +                       populated++;
>                 mapping->nrexceptional++;
>         }
>
>  out_unlock:
>         xas_unlock_irq(xas);
> +       if (populated == -1)
> +               inode_pages_clear(mapping->inode);
> +       else if (populated == 1)
> +               inode_pages_set(mapping->inode);
>         if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
>                 goto retry;
>         if (xas->xa_node == XA_ERROR(-ENOMEM))
> @@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping,
>                                           pgoff_t index, bool trunc)
>  {
>         XA_STATE(xas, &mapping->i_pages, index);
> +       bool empty = false;
>         int ret = 0;
>         void *entry;
>
> @@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping,
>         dax_disassociate_entry(entry, mapping, trunc);
>         xas_store(&xas, NULL);
>         mapping->nrexceptional--;
> +       empty = mapping_empty(mapping);
>         ret = 1;
>  out:
>         put_unlocked_entry(&xas, entry);
>         xas_unlock_irq(&xas);
> +       if (empty)
> +               inode_pages_clear(mapping->host);
>         return ret;
>  }
>
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index dc1a1d5d825b..a5e9e9053474 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
>                  * we need to reschedule to avoid softlockups.
>                  */
>                 if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -                   (inode->i_mapping->nrpages == 0 && !need_resched())) {
> +                   (mapping_empty(inode->i_mapping) && !need_resched())) {
>                         spin_unlock(&inode->i_lock);
>                         continue;
>                 }
> diff --git a/fs/inode.c b/fs/inode.c
> index 7d57068b6b7a..575b780fa9bb 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -430,25 +430,87 @@ static void inode_lru_list_add(struct inode *inode)
>                 inode->i_state |= I_REFERENCED;
>  }
>
> +static void inode_lru_list_del(struct inode *inode)
> +{
> +       if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> +               this_cpu_dec(nr_unused);
> +}
> +
>  /*
>   * Add inode to LRU if needed (inode is unused and clean).
>   *
>   * Needs inode->i_lock held.
>   */
> -void inode_add_lru(struct inode *inode)
> +bool inode_add_lru(struct inode *inode)
>  {
> -       if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
> -                               I_FREEING | I_WILL_FREE)) &&
> -           !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
> -               inode_lru_list_add(inode);
> +       if (inode->i_state &
> +           (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES))
> +               return false;
> +       if (atomic_read(&inode->i_count))
> +               return false;
> +       if (!(inode->i_sb->s_flags & SB_ACTIVE))
> +               return false;
> +       inode_lru_list_add(inode);
> +       return true;
>  }
>
> +/**
> + * inode_pages_set - mark the inode as holding page cache
> + * @inode: the inode whose first cache page was just added
> + *
> + * Tell the VFS that this inode has populated page cache and must not
> + * be reclaimed by the inode shrinker.
> + *
> + * The caller must hold the page lock of the just-added page: by
> + * pinning the page, the page cache cannot become depopulated, and we
> + * can safely set I_PAGES without a race check under the i_pages lock.
> + *
> + * This function acquires the i_lock.
> + */
> +void inode_pages_set(struct inode *inode)
> +{
> +       spin_lock(&inode->i_lock);
> +       if (!(inode->i_state & I_PAGES)) {
> +               inode->i_state |= I_PAGES;
> +               if (!list_empty(&inode->i_lru)) {
> +                       count_vm_event(PGINODERESCUE);
> +                       inode_lru_list_del(inode);
> +               }
> +       }
> +       spin_unlock(&inode->i_lock);
> +}
>
> -static void inode_lru_list_del(struct inode *inode)
> +/**
> + * inode_pages_clear - mark the inode as not holding page cache
> + * @inode: the inode whose last cache page was just removed
> + *
> + * Tell the VFS that the inode no longer holds page cache and that its
> + * lifetime is to be handed over to the inode shrinker LRU.
> + *
> + * This function acquires the i_lock and the i_pages lock.
> + */
> +void inode_pages_clear(struct inode *inode)
>  {
> +       struct address_space *mapping = &inode->i_data;
> +       bool add_to_lru = false;
> +       unsigned long flags;
>
> -       if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> -               this_cpu_dec(nr_unused);
> +       spin_lock(&inode->i_lock);
> +
> +       xa_lock_irqsave(&mapping->i_pages, flags);
> +       if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) {
> +               inode->i_state &= ~I_PAGES;
> +               add_to_lru = true;
> +       }
> +       xa_unlock_irqrestore(&mapping->i_pages, flags);
> +
> +       if (add_to_lru) {
> +               WARN_ON_ONCE(!list_empty(&inode->i_lru));
> +               if (inode_add_lru(inode))
> +                       __count_vm_event(PGINODEDELAYED);
> +       }
> +
> +       spin_unlock(&inode->i_lock);
>  }
>
>  /**
> @@ -742,6 +804,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>         if (!spin_trylock(&inode->i_lock))
>                 return LRU_SKIP;
>
> +       WARN_ON_ONCE(inode->i_state & I_PAGES);
> +
>         /*
>          * Referenced or dirty inodes are still in use. Give them another pass
>          * through the LRU as we canot reclaim them now.
> @@ -761,23 +825,17 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>                 return LRU_ROTATE;
>         }
>
> -       if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> -               __iget(inode);
> +       /*
> +        * Populated inodes shouldn't be on the shrinker LRU, but they
> +        * can be briefly visible when a new page is added to an inode
> +        * that was already linked but inode_pages_set() hasn't run
> +        * yet to move them off.
> +        */
> +       if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
> +               list_lru_isolate(lru, &inode->i_lru);
>                 spin_unlock(&inode->i_lock);
> -               spin_unlock(lru_lock);
> -               if (remove_inode_buffers(inode)) {
> -                       unsigned long reap;
> -                       reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
> -                       if (current_is_kswapd())
> -                               __count_vm_events(KSWAPD_INODESTEAL, reap);
> -                       else
> -                               __count_vm_events(PGINODESTEAL, reap);
> -                       if (current->reclaim_state)
> -                               current->reclaim_state->reclaimed_slab += reap;
> -               }
> -               iput(inode);
> -               spin_lock(lru_lock);
> -               return LRU_RETRY;
> +               this_cpu_dec(nr_unused);
> +               return LRU_REMOVED;
>         }
>
>         WARN_ON(inode->i_state & I_NEW);
> diff --git a/fs/internal.h b/fs/internal.h
> index f3f280b952a3..4a9dc77e817b 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -139,7 +139,7 @@ extern int vfs_open(const struct path *, struct file *);
>   * inode.c
>   */
>  extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
> -extern void inode_add_lru(struct inode *inode);
> +extern bool inode_add_lru(struct inode *inode);
>  extern int dentry_needs_remove_privs(struct dentry *dentry);
>
>  /*
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3cd4fe6b845e..a98d9dee39f4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -585,6 +585,11 @@ static inline void mapping_allow_writable(struct address_space *mapping)
>         atomic_inc(&mapping->i_mmap_writable);
>  }
>
> +static inline bool mapping_empty(struct address_space *mapping)
> +{
> +       return mapping->nrpages + mapping->nrexceptional == 0;
> +}
> +
>  /*
>   * Use sequence counter to get consistent i_size on 32-bit processors.
>   */
> @@ -2150,6 +2155,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
>   *
>   * I_CREATING          New object's inode in the middle of setting up.
>   *
> + * I_PAGES             Inode is holding page cache that needs to get reclaimed
> + *                     first before the inode can go onto the shrinker LRU.
> + *
>   * Q: What is the difference between I_WILL_FREE and I_FREEING?
>   */
>  #define I_DIRTY_SYNC           (1 << 0)
> @@ -2172,6 +2180,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
>  #define I_WB_SWITCH            (1 << 13)
>  #define I_OVL_INUSE            (1 << 14)
>  #define I_CREATING             (1 << 15)
> +#define I_PAGES                        (1 << 16)
>
>  #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
>  #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> @@ -3097,6 +3106,9 @@ static inline void remove_inode_hash(struct inode *inode)
>                 __remove_inode_hash(inode);
>  }
>
> +extern void inode_pages_set(struct inode *inode);
> +extern void inode_pages_clear(struct inode *inode);
> +
>  extern void inode_sb_list_add(struct inode *inode);
>
>  #ifdef CONFIG_BLOCK
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..ae4d90bd0873 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -609,7 +609,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>                                 pgoff_t index, gfp_t gfp_mask);
>  extern void delete_from_page_cache(struct page *page);
> -extern void __delete_from_page_cache(struct page *page, void *shadow);
> +extern bool __delete_from_page_cache(struct page *page, void *shadow);
>  int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
>  void delete_from_page_cache_batch(struct address_space *mapping,
>                                   struct pagevec *pvec);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 47a3441cf4c4..f31026ccf590 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_NUMA
>                 PGSCAN_ZONE_RECLAIM_FAILED,
>  #endif
> -               PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
> +               SLABS_SCANNED,
> +               PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED,
>                 KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>                 PAGEOUTRUN, PGROTATED,
>                 DROP_PAGECACHE, DROP_SLAB,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1784478270e1..fcc24b3b3540 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -116,8 +116,8 @@
>   *   ->tasklist_lock            (memory_failure, collect_procs_ao)
>   */
>
> -static void page_cache_delete(struct address_space *mapping,
> -                                  struct page *page, void *shadow)
> +static bool __must_check page_cache_delete(struct address_space *mapping,
> +                                          struct page *page, void *shadow)
>  {
>         XA_STATE(xas, &mapping->i_pages, page->index);
>         unsigned int nr = 1;
> @@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping,
>                 smp_wmb();
>         }
>         mapping->nrpages -= nr;
> +
> +       return mapping_empty(mapping);
>  }
>
>  static void unaccount_page_cache_page(struct address_space *mapping,
> @@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping,
>   * Delete a page from the page cache and free it. Caller has to make
>   * sure the page is locked and that nobody else uses it - or that usage
>   * is safe.  The caller must hold the i_pages lock.
> + *
> + * If this returns true, the caller must call inode_pages_clear()
> + * after dropping the i_pages lock.
>   */
> -void __delete_from_page_cache(struct page *page, void *shadow)
> +bool __must_check __delete_from_page_cache(struct page *page, void *shadow)
>  {
>         struct address_space *mapping = page->mapping;
>
>         trace_mm_filemap_delete_from_page_cache(page);
>
>         unaccount_page_cache_page(mapping, page);
> -       page_cache_delete(mapping, page, shadow);
> +       return page_cache_delete(mapping, page, shadow);
>  }
>
>  static void page_cache_free_page(struct address_space *mapping,
> @@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page)
>  {
>         struct address_space *mapping = page_mapping(page);
>         unsigned long flags;
> +       bool empty;
>
>         BUG_ON(!PageLocked(page));
>         xa_lock_irqsave(&mapping->i_pages, flags);
> -       __delete_from_page_cache(page, NULL);
> +       empty = __delete_from_page_cache(page, NULL);
>         xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         page_cache_free_page(mapping, page);
>  }
>  EXPORT_SYMBOL(delete_from_page_cache);
> @@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache);
>   *
>   * The function expects the i_pages lock to be held.
>   */
> -static void page_cache_delete_batch(struct address_space *mapping,
> -                            struct pagevec *pvec)
> +static bool __must_check page_cache_delete_batch(struct address_space *mapping,
> +                                                struct pagevec *pvec)
>  {
>         XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
>         int total_pages = 0;
> @@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping,
>                 total_pages++;
>         }
>         mapping->nrpages -= total_pages;
> +
> +       return mapping_empty(mapping);
>  }
>
>  void delete_from_page_cache_batch(struct address_space *mapping,
>                                   struct pagevec *pvec)
>  {
>         int i;
> +       bool empty;
>         unsigned long flags;
>
>         if (!pagevec_count(pvec))
> @@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping,
>
>                 unaccount_page_cache_page(mapping, pvec->pages[i]);
>         }
> -       page_cache_delete_batch(mapping, pvec);
> +       empty = page_cache_delete_batch(mapping, pvec);
>         xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         for (i = 0; i < pagevec_count(pvec); i++)
>                 page_cache_free_page(mapping, pvec->pages[i]);
>  }
> @@ -831,9 +846,10 @@ static int __add_to_page_cache_locked(struct page *page,
>                                       void **shadowp)
>  {
>         XA_STATE(xas, &mapping->i_pages, offset);
> +       int error;
>         int huge = PageHuge(page);
>         struct mem_cgroup *memcg;
> -       int error;
> +       bool populated = false;
>         void *old;
>
>         VM_BUG_ON_PAGE(!PageLocked(page), page);
> @@ -860,6 +876,7 @@ static int __add_to_page_cache_locked(struct page *page,
>                 if (xas_error(&xas))
>                         goto unlock;
>
> +               populated = mapping_empty(mapping);
>                 if (xa_is_value(old)) {
>                         mapping->nrexceptional--;
>                         if (shadowp)
> @@ -880,6 +897,10 @@ static int __add_to_page_cache_locked(struct page *page,
>         if (!huge)
>                 mem_cgroup_commit_charge(page, memcg, false, false);
>         trace_mm_filemap_add_to_page_cache(page);
> +
> +       if (populated)
> +               inode_pages_set(mapping->host);
> +
>         return 0;
>  error:
>         page->mapping = NULL;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b08b199f9a11..8b3e33a52d93 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2535,7 +2535,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                 /* Some pages can be beyond i_size: drop them from page cache */
>                 if (head[i].index >= end) {
>                         ClearPageDirty(head + i);
> -                       __delete_from_page_cache(head + i, NULL);
> +                       /* We know we're not removing the last page */
> +                       (void)__delete_from_page_cache(head + i, NULL);
>                         if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
>                                 shmem_uncharge(head->mapping->host, 1);
>                         put_page(head + i);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index dd9ebc1da356..8fb6c2f762bc 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -31,24 +31,31 @@
>   * itself locked.  These unlocked entries need verification under the tree
>   * lock.
>   */
> -static inline void __clear_shadow_entry(struct address_space *mapping,
> -                               pgoff_t index, void *entry)
> +static bool __must_check __clear_shadow_entry(struct address_space *mapping,
> +                                             pgoff_t index, void *entry)
>  {
>         XA_STATE(xas, &mapping->i_pages, index);
>
>         xas_set_update(&xas, workingset_update_node);
>         if (xas_load(&xas) != entry)
> -               return;
> +               return 0;
>         xas_store(&xas, NULL);
>         mapping->nrexceptional--;
> +
> +       return mapping_empty(mapping);
>  }
>
>  static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
>                                void *entry)
>  {
> +       bool empty;
> +
>         xa_lock_irq(&mapping->i_pages);
> -       __clear_shadow_entry(mapping, index, entry);
> +       empty = __clear_shadow_entry(mapping, index, entry);
>         xa_unlock_irq(&mapping->i_pages);
> +
> +       if (empty)
> +               inode_pages_clear(mapping->host);
>  }
>
>  /*
> @@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
>                                 pgoff_t end)
>  {
>         int i, j;
> -       bool dax, lock;
> +       bool dax, lock, empty = false;
>
>         /* Handled by shmem itself */
>         if (shmem_mapping(mapping))
> @@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
>                         continue;
>                 }
>
> -               __clear_shadow_entry(mapping, index, page);
> +               if (__clear_shadow_entry(mapping, index, page))
> +                       empty = true;
>         }
>
>         if (lock)
>                 xa_unlock_irq(&mapping->i_pages);
> +
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         pvec->nr = j;
>  }
>
> @@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>         pgoff_t         index;
>         int             i;
>
> -       if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +       if (mapping_empty(mapping))
>                 goto out;
>
>         /* Offsets within partial pages */
> @@ -636,6 +648,7 @@ static int
>  invalidate_complete_page2(struct address_space *mapping, struct page *page)
>  {
>         unsigned long flags;
> +       bool empty;
>
>         if (page->mapping != mapping)
>                 return 0;
> @@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
>                 goto failed;
>
>         BUG_ON(page_has_private(page));
> -       __delete_from_page_cache(page, NULL);
> +       empty = __delete_from_page_cache(page, NULL);
>         xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         if (mapping->a_ops->freepage)
>                 mapping->a_ops->freepage(page);
>
> @@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>         int ret2 = 0;
>         int did_range_unmap = 0;
>
> -       if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +       if (mapping_empty(mapping))
>                 goto out;
>
>         pagevec_init(&pvec);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c05eb9efec07..c82e9831003f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>         } else {
>                 void (*freepage)(struct page *);
>                 void *shadow = NULL;
> +               int empty;
>
>                 freepage = mapping->a_ops->freepage;
>                 /*
> @@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>                 if (reclaimed && page_is_file_cache(page) &&
>                     !mapping_exiting(mapping) && !dax_mapping(mapping))
>                         shadow = workingset_eviction(page, target_memcg);
> -               __delete_from_page_cache(page, shadow);
> +               empty = __delete_from_page_cache(page, shadow);
>                 xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +               if (empty)
> +                       inode_pages_clear(mapping->host);
> +
>                 if (freepage != NULL)
>                         freepage(page);
>         }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 78d53378db99..ae802253f71c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1203,9 +1203,11 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_NUMA
>         "zone_reclaim_failed",
>  #endif
> -       "pginodesteal",
>         "slabs_scanned",
> +       "pginodesteal",
>         "kswapd_inodesteal",
> +       "pginoderescue",
> +       "pginodedelayed",
>         "kswapd_low_wmark_hit_quickly",
>         "kswapd_high_wmark_hit_quickly",
>         "pageoutrun",
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 474186b76ced..7ce9c74ebfdb 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>         struct xa_node *node = container_of(item, struct xa_node, private_list);
>         XA_STATE(xas, node->array, 0);
>         struct address_space *mapping;
> +       bool empty = false;
>         int ret;
>
>         /*
> @@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>         if (WARN_ON_ONCE(node->count != node->nr_values))
>                 goto out_invalid;
>         mapping->nrexceptional -= node->nr_values;
> +       empty = mapping_empty(mapping);
>         xas.xa_node = xa_parent_locked(&mapping->i_pages, node);
>         xas.xa_offset = node->offset;
>         xas.xa_shift = node->shift + XA_CHUNK_SHIFT;
> @@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>
>  out_invalid:
>         xa_unlock_irq(&mapping->i_pages);
> +       if (empty)
> +               inode_pages_clear(mapping->host);
>         ret = LRU_REMOVED_RETRY;
>  out:
>         cond_resched();
> --
> 2.24.1
>
>
Johannes Weiner Feb. 12, 2020, 4:35 p.m. UTC | #12
On Tue, Feb 11, 2020 at 03:44:38PM -0800, Andrew Morton wrote:
> On Tue, 11 Feb 2020 14:31:01 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, Feb 11, 2020 at 02:05:38PM -0500, Rik van Riel wrote:
> > > On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote:
> > > > The VFS inode shrinker is currently allowed to reclaim inodes with
> > > > populated page cache. As a result it can drop gigabytes of hot and
> > > > active page cache on the floor without consulting the VM (recorded as
> > > > "inodesteal" events in /proc/vmstat).
> > > > 
> > > > This causes real problems in practice. Consider for example how the
> > > > VM
> > > > would cache a source tree, such as the Linux git tree. As large parts
> > > > of the checked out files and the object database are accessed
> > > > repeatedly, the page cache holding this data gets moved to the active
> > > > list, where it's fully (and indefinitely) insulated from one-off
> > > > cache
> > > > moving through the inactive list.
> > > 
> > > > This behavior of invalidating page cache from the inode shrinker goes
> > > > back to even before the git import of the kernel tree. It may have
> > > > been less noticeable when the VM itself didn't have real workingset
> > > > protection, and floods of one-off cache would push out any active
> > > > cache over time anyway. But the VM has come a long way since then and
> > > > the inode shrinker is now actively subverting its caching strategy.
> > > 
> > > Two things come to mind when looking at this:
> > > - highmem
> > > - NUMA
> > > 
> > > IIRC one of the reasons reclaim is done in this way is
> > > because a page cache page in one area of memory (highmem,
> > > or a NUMA node) can end up pinning inode slab memory in
> > > another memory area (normal zone, other NUMA node).
> > 
> > That's a good point, highmem does ring a bell now that you mention it.
> 
> Yup, that's why this mechanism exists.  Here:
> 
> https://marc.info/?l=git-commits-head&m=103646757213266&w=2

Ah, thanks for digging that up, I did not know that.

> > If we still care, I think this could be solved by doing something
> > similar to what we do with buffer_heads_over_limit: allow a lowmem
> > allocation to reclaim page cache inside the highmem zone if the bhs
> > (or inodes in this case) have accumulated excessively.
> 
> Well, reclaiming highmem pagecache at random would be a painful way to
> reclaim lowmem inodes.  Better to pick an inode then shoot down all its
> pagecache.  Perhaps we could take its pagecache's aging into account.

That reminds me of trying to correlate inode pages in reclaim to batch
the cache tree lock, slab page objects in the shrinker to free whole
pages etc. We never managed to actually do that. :-)

> Testing this will be a challenge, but the issue was real - a 7GB
> highmem machine isn't crazy and I expect the inode has become larger
> since those days.

Since the cache purging code was written for highmem scenarios, how
about making it specific to CONFIG_HIGHMEM at least?

That way we improve the situation for the more common setups, without
regressing highmem configurations. And if somebody wanted to improve
the CONFIG_HIGHMEM behavior as well, they could still do so.

Somethig like the below delta on top of my patch?

---
 fs/inode.c         | 44 ++++++++++++++++++++++++++++++++++++++++----
 include/linux/fs.h |  5 +++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 575b780fa9bb..45b2abd4fef6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -454,6 +454,18 @@ bool inode_add_lru(struct inode *inode)
 	return true;
 }
 
+/*
+ * Usually, inodes become reclaimable when they are no longer
+ * referenced and their page cache has been reclaimed. The following
+ * API allows the VM to communicate cache population state to the VFS.
+ *
+ * However, on CONFIG_HIGHMEM we can't wait for the page cache to go
+ * away: cache pages allocated in a large highmem zone could pin
+ * struct inode memory allocated in relatively small lowmem zones. So
+ * when CONFIG_HIGHMEM is enabled, we tie cache to the inode lifetime.
+ */
+
+#ifndef CONFIG_HIGHMEM
 /**
  * inode_pages_set - mark the inode as holding page cache
  * @inode: the inode whose first cache page was just added
@@ -512,6 +524,7 @@ void inode_pages_clear(struct inode *inode)
 
 	spin_unlock(&inode->i_lock);
 }
+#endif /* !CONFIG_HIGHMEM */
 
 /**
  * inode_sb_list_add - add inode to the superblock list of inodes
@@ -826,16 +839,39 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	}
 
 	/*
-	 * Populated inodes shouldn't be on the shrinker LRU, but they
-	 * can be briefly visible when a new page is added to an inode
-	 * that was already linked but inode_pages_set() hasn't run
-	 * yet to move them off.
+	 * Usually, populated inodes shouldn't be on the shrinker LRU,
+	 * but they can be briefly visible when a new page is added to
+	 * an inode that was already linked but inode_pages_set()
+	 * hasn't run yet to move them off.
+	 *
+	 * The other exception is on HIGHMEM systems: highmem cache
+	 * can pin lowmem struct inodes, and we might be in dire
+	 * straits in the lower zones. Purge cache to free the inode.
 	 */
 	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
+#ifdef CONFIG_HIGHMEM
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(lru_lock);
+		if (remove_inode_buffers(inode)) {
+			unsigned long reap;
+			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
+			if (current_is_kswapd())
+				__count_vm_events(KSWAPD_INODESTEAL, reap);
+			else
+				__count_vm_events(PGINODESTEAL, reap);
+			if (current->reclaim_state)
+				current->reclaim_state->reclaimed_slab += reap;
+		}
+		iput(inode);
+		spin_lock(lru_lock);
+		return LRU_RETRY;
+#else
 		list_lru_isolate(lru, &inode->i_lru);
 		spin_unlock(&inode->i_lock);
 		this_cpu_dec(nr_unused);
 		return LRU_REMOVED;
+#endif
 	}
 
 	WARN_ON(inode->i_state & I_NEW);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a98d9dee39f4..abdb3fd3432f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3106,8 +3106,13 @@ static inline void remove_inode_hash(struct inode *inode)
 		__remove_inode_hash(inode);
 }
 
+#ifndef CONFIG_HIGHMEM
 extern void inode_pages_set(struct inode *inode);
 extern void inode_pages_clear(struct inode *inode);
+#else
+static inline void inode_pages_set(struct inode *inode) {}
+static inline void inode_pages_clear(struct inode *inode) {}
+#endif
 
 extern void inode_sb_list_add(struct inode *inode);
Johannes Weiner Feb. 12, 2020, 4:42 p.m. UTC | #13
On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote:
> On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Another variant of this problem was recently observed, where the
> > kernel violates cgroups' memory.low protection settings and reclaims
> > page cache way beyond the configured thresholds. It was followed by a
> > proposal of a modified form of the reverted commit above, that
> > implements memory.low-sensitive shrinker skipping over populated
> > inodes on the LRU [1]. However, this proposal continues to run the
> > risk of attracting disproportionate reclaim pressure to a pool of
> > still-used inodes,
> 
> Hi Johannes,
> 
> If you really think that is a risk, what about bellow additional patch
> to fix this risk ?
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 80dddbc..61862d9 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
>                 goto out;
> 
>         cgroup_size = mem_cgroup_size(memcg);
> -       if (inode->i_data.nrpages + protection >= cgroup_size)
> +       if (inode->i_data.nrpages)
>                 reclaimable = false;
> 
>  out:
> 
> With this additional patch, we skip all inodes in this memcg until all
> its page cache pages are reclaimed.

Well that's something we've tried and had to revert because it caused
issues in slab reclaim. See the History part of my changelog.

> > while not addressing the more generic reclaim
> > inversion problem outside of a very specific cgroup application.
> >
> 
> But I have a different understanding.  This method works like a
> knob. If you really care about your workingset (data), you should
> turn it on (i.e. by using memcg protection to protect them), while
> if you don't care about your workingset (data) then you'd better
> turn it off. That would be more flexible.  Regaring your case in the
> commit log, why not protect your linux git tree with memcg
> protection ?

I can't imagine a scenario where I *wouldn't* care about my
workingset, though. Why should it be opt-in, not the default?
Andrew Morton Feb. 12, 2020, 6:26 p.m. UTC | #14
On Wed, 12 Feb 2020 11:35:40 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Since the cache purging code was written for highmem scenarios, how
> about making it specific to CONFIG_HIGHMEM at least?

Why do I have memories of suggesting this a couple of weeks ago ;)

> That way we improve the situation for the more common setups, without
> regressing highmem configurations. And if somebody wanted to improve
> the CONFIG_HIGHMEM behavior as well, they could still do so.
> 
> Somethig like the below delta on top of my patch?

Does it need to be that complicated?  What's wrong with

--- a/fs/inode.c~a
+++ a/fs/inode.c
@@ -761,6 +761,10 @@ static enum lru_status inode_lru_isolate
 		return LRU_ROTATE;
 	}
 
+#ifdef CONFIG_HIGHMEM
+	/*
+	 * lengthy blah
+	 */
 	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
@@ -779,6 +783,7 @@ static enum lru_status inode_lru_isolate
 		spin_lock(lru_lock);
 		return LRU_RETRY;
 	}
+#endif
 
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
Johannes Weiner Feb. 12, 2020, 6:52 p.m. UTC | #15
On Wed, Feb 12, 2020 at 10:26:45AM -0800, Andrew Morton wrote:
> On Wed, 12 Feb 2020 11:35:40 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Since the cache purging code was written for highmem scenarios, how
> > about making it specific to CONFIG_HIGHMEM at least?
> 
> Why do I have memories of suggesting this a couple of weeks ago ;)

Sorry, you did. I went back and found your email now. It completely
slipped my mind after that thread went off into another direction.

> > That way we improve the situation for the more common setups, without
> > regressing highmem configurations. And if somebody wanted to improve
> > the CONFIG_HIGHMEM behavior as well, they could still do so.
> > 
> > Somethig like the below delta on top of my patch?
> 
> Does it need to be that complicated?  What's wrong with
> 
> --- a/fs/inode.c~a
> +++ a/fs/inode.c
> @@ -761,6 +761,10 @@ static enum lru_status inode_lru_isolate
>  		return LRU_ROTATE;
>  	}
>  
> +#ifdef CONFIG_HIGHMEM
> +	/*
> +	 * lengthy blah
> +	 */
>  	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> @@ -779,6 +783,7 @@ static enum lru_status inode_lru_isolate
>  		spin_lock(lru_lock);
>  		return LRU_RETRY;
>  	}
> +#endif

Pages can show up here even under !CONFIG_HIGHMEM. Because of the lock
order to maintain LRU state (i_lock -> xa_lock), when the page cache
inserts new pages it doesn't unlink the inode from the LRU atomically,
and the shrinker might get here before inode_pages_set(). In that case
we need the shrinker to punt the inode off the LRU (the #else branch).

>  	WARN_ON(inode->i_state & I_NEW);
>  	inode->i_state |= I_FREEING;
> _
> 
> Whatever we do will need plenty of testing.  It wouldn't surprise me
> if there are people who unknowingly benefit from this code on
> 64-bit machines.

If we agree this is the way to go, I can put the patch into our tree
and gather data from the Facebook fleet before we merge it.
Yafang Shao Feb. 13, 2020, 1:47 a.m. UTC | #16
On Thu, Feb 13, 2020 at 12:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote:
> > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > Another variant of this problem was recently observed, where the
> > > kernel violates cgroups' memory.low protection settings and reclaims
> > > page cache way beyond the configured thresholds. It was followed by a
> > > proposal of a modified form of the reverted commit above, that
> > > implements memory.low-sensitive shrinker skipping over populated
> > > inodes on the LRU [1]. However, this proposal continues to run the
> > > risk of attracting disproportionate reclaim pressure to a pool of
> > > still-used inodes,
> >
> > Hi Johannes,
> >
> > If you really think that is a risk, what about bellow additional patch
> > to fix this risk ?
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 80dddbc..61862d9 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
> >                 goto out;
> >
> >         cgroup_size = mem_cgroup_size(memcg);
> > -       if (inode->i_data.nrpages + protection >= cgroup_size)
> > +       if (inode->i_data.nrpages)
> >                 reclaimable = false;
> >
> >  out:
> >
> > With this additional patch, we skip all inodes in this memcg until all
> > its page cache pages are reclaimed.
>
> Well that's something we've tried and had to revert because it caused
> issues in slab reclaim. See the History part of my changelog.
>

You misuderstood it.
The reverted patch skips all inodes in the system, while this patch
only works when you turn on memcg.{min, low} protection.
IOW, that is not a default behavior, while it only works when you want
it and only effect your targeted memcg rather than the whole system.

> > > while not addressing the more generic reclaim
> > > inversion problem outside of a very specific cgroup application.
> > >
> >
> > But I have a different understanding.  This method works like a
> > knob. If you really care about your workingset (data), you should
> > turn it on (i.e. by using memcg protection to protect them), while
> > if you don't care about your workingset (data) then you'd better
> > turn it off. That would be more flexible.  Regaring your case in the
> > commit log, why not protect your linux git tree with memcg
> > protection ?
>
> I can't imagine a scenario where I *wouldn't* care about my
> workingset, though. Why should it be opt-in, not the default?

Because the default behavior has caused the XFS performace hit.
(I haven't  checked your patch carefully, so I don't know whehter your
patch fix it yet.)


Thanks

Yafang
Lucas Stach Feb. 13, 2020, 9:50 a.m. UTC | #17
On Mi, 2020-02-12 at 08:50 +0000, Russell King - ARM Linux admin wrote:
> On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
> > On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > What's the situation with highmem on ARM?
> > 
> > Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
> > needed it, and I was ranting at some people for repeating all the
> > mistakes Intel did.
> > 
> > But arm64 doesn't need it, and while 32-bit arm is obviosuly still
> > selling, I think that in many ways the switch-over to 64-bit has been
> > quicker on ARM than it was on x86. Partly because it happened later
> > (so all the 64-bit teething pains were dealt with), but largely
> > because everybody ended up actively discouraging 32-bit on the Android
> > side.
> > 
> > There were a couple of unfortunate early 32-bit arm server attempts,
> > but they were - predictably - complete garbage and nobody bought them.
> > They don't exist any more.
> > 
> > So at least my gut feel is that the arm people don't have any big
> > reason to push for maintaining HIGHMEM support either.
> > 
> > But I'm adding a couple of arm people and the arm list just in case
> > they have some input.
> > 
> > [ Obvious background for newly added people: we're talking about
> > making CONFIG_HIGHMEM a deprecated feature and saying that if you want
> > to run with lots of memory on a 32-bit kernel, you're doing legacy
> > stuff and can use a legacy kernel ]
> 
> Well, the recent 32-bit ARM systems generally have more than 1G
> of memory, so make use of highmem as a rule.  You're probably
> talking about crippling support for any 32-bit ARM system produced
> in the last 8 to 10 years.

I know of quite a few embedded ARMv7 systems that will need to be
supported for at least 10 years from now, with RAM sizes between 1GB
and even the full 4GB (well 3.75GB due to MMIO needing some address
space). Deprecating highmem would severely cripple our ability to
support those platforms with new kernels.

Regards,
Lucas
Johannes Weiner Feb. 13, 2020, 1:46 p.m. UTC | #18
On Thu, Feb 13, 2020 at 09:47:29AM +0800, Yafang Shao wrote:
> On Thu, Feb 13, 2020 at 12:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote:
> > > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > Another variant of this problem was recently observed, where the
> > > > kernel violates cgroups' memory.low protection settings and reclaims
> > > > page cache way beyond the configured thresholds. It was followed by a
> > > > proposal of a modified form of the reverted commit above, that
> > > > implements memory.low-sensitive shrinker skipping over populated
> > > > inodes on the LRU [1]. However, this proposal continues to run the
> > > > risk of attracting disproportionate reclaim pressure to a pool of
> > > > still-used inodes,
> > >
> > > Hi Johannes,
> > >
> > > If you really think that is a risk, what about bellow additional patch
> > > to fix this risk ?
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 80dddbc..61862d9 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
> > >                 goto out;
> > >
> > >         cgroup_size = mem_cgroup_size(memcg);
> > > -       if (inode->i_data.nrpages + protection >= cgroup_size)
> > > +       if (inode->i_data.nrpages)
> > >                 reclaimable = false;
> > >
> > >  out:
> > >
> > > With this additional patch, we skip all inodes in this memcg until all
> > > its page cache pages are reclaimed.
> >
> > Well that's something we've tried and had to revert because it caused
> > issues in slab reclaim. See the History part of my changelog.
> 
> You misuderstood it.
> The reverted patch skips all inodes in the system, while this patch
> only works when you turn on memcg.{min, low} protection.
> IOW, that is not a default behavior, while it only works when you want
> it and only effect your targeted memcg rather than the whole system.

I understand perfectly well.

Keeping unreclaimable inodes on the shrinker LRU causes the shrinker
to build up excessive pressure on all VFS objects. This is a
bug. Making it cgroup-specific doesn't make it less of a bug, it just
means you only hit the bug when you use cgroup memory protection.

> > > > while not addressing the more generic reclaim
> > > > inversion problem outside of a very specific cgroup application.
> > > >
> > >
> > > But I have a different understanding.  This method works like a
> > > knob. If you really care about your workingset (data), you should
> > > turn it on (i.e. by using memcg protection to protect them), while
> > > if you don't care about your workingset (data) then you'd better
> > > turn it off. That would be more flexible.  Regaring your case in the
> > > commit log, why not protect your linux git tree with memcg
> > > protection ?
> >
> > I can't imagine a scenario where I *wouldn't* care about my
> > workingset, though. Why should it be opt-in, not the default?
> 
> Because the default behavior has caused the XFS performace hit.

That means that with your proposal you cannot use cgroup memory
protection for workloads that run on xfs.

(And if I remember the bug report correctly, this wasn't just xfs. It
also caused metadata caches on other filesystems to get trashed. xfs
was just more pronounced because it does sync inode flushing from the
shrinker, adding write stalls to the mix of metadata cache misses.)

What I'm proposing is an implementation that protects hot page cache
without causing excessive shrinker pressure and rotations.
Arnd Bergmann Feb. 13, 2020, 4:52 p.m. UTC | #19
On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
> > On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > What's the situation with highmem on ARM?
> >
> > Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
> > needed it, and I was ranting at some people for repeating all the
> > mistakes Intel did.
> >
> > But arm64 doesn't need it, and while 32-bit arm is obviosuly still
> > selling, I think that in many ways the switch-over to 64-bit has been
> > quicker on ARM than it was on x86. Partly because it happened later
> > (so all the 64-bit teething pains were dealt with), but largely
> > because everybody ended up actively discouraging 32-bit on the Android
> > side.
> >
> > There were a couple of unfortunate early 32-bit arm server attempts,
> > but they were - predictably - complete garbage and nobody bought them.
> > They don't exist any more.

I'd generally agree with that, the systems with more than 4GB tended to
be high-end systems predating the Cortex-A53/A57 that quickly got
replaced once there were actual 64-bit parts, this would include axm5516
(replaced with x86-64 cores after sale to Intel), hip04 (replaced
with arm64), or ecx-2000 (Calxeda bankruptcy).

The one 32-bit SoC that I can think of that can actually drive lots of
RAM and is still actively marketed is TI Keystone-2/AM5K2.
The embedded AM5K2 is listed supporting up to 8GB of RAM, but
the verison in the HPE ProLiant m800 server could take up to 32GB (!).

I added Santosh and Kishon to Cc, they can probably comment on how
long they think users will upgrade kernels on these. I suspect these
devices can live for a very long time in things like wireless base stations,
but it's possible that they all run on old kernels anyway by now (and are
not worried about y2038).

> > So at least my gut feel is that the arm people don't have any big
> > reason to push for maintaining HIGHMEM support either.
> >
> > But I'm adding a couple of arm people and the arm list just in case
> > they have some input.
> >
> > [ Obvious background for newly added people: we're talking about
> > making CONFIG_HIGHMEM a deprecated feature and saying that if you want
> > to run with lots of memory on a 32-bit kernel, you're doing legacy
> > stuff and can use a legacy kernel ]
>
> Well, the recent 32-bit ARM systems generally have more than 1G
> of memory, so make use of highmem as a rule.  You're probably
> talking about crippling support for any 32-bit ARM system produced
> in the last 8 to 10 years.

What I'm observing in the newly added board support is that memory
configurations are actually going down, driven by component cost.
512MB is really cheap (~$4) these days with a single 256Mx16 DDR3
chip or two 128Mx16. Going beyond 1GB is where things get expensive
with either 4+ chips or LPDDR3/LPDDR4 memory.

For designs with 1GB, we're probably better off just using
CONFIG_VMSPLIT_3G_OPT (without LPAE) anyway, completely
avoiding highmem. That is particularly true on systems with a custom
kernel configuration.

2GB machines are less common, but are definitely important, e.g.
MT6580 based Android phones and some industrial embedded machines
that will live a long time. I've recently seen reports of odd behavior
with CONFIG_VMSPLIT_2G and plus CONFIG_HIGHMEM and a 7:1
ratio of lowmem to highmem that apparently causes OOM despite lots
of lowmem being free. I suspect a lot of those workloads would still be
better off with a CONFIG_VMSPLIT_2G_OPT (1.75 GB user, 2GB
linear map). That config unfortunately has a few problems, too:
- nobody has implemented it
- it won't work with LPAE and therefore cannot support hardware
  that relies on high physical addresses for RAM or MMIO
  (those could run CONFIG_VMSPLIT_2G at the cost of wasting
  12.5% of RAM).
- any workload that requires the full 3GB of virtual address space won't
  work at all. This might be e.g. MAP_FIXED users, or build servers
  linking large binaries.
It will take a while to find out what kinds of workloads suffer the most
from a different vmsplit and what can be done to address that, but we
could start by changing the kernel defconfig and distro builds to see
who complains ;-)

I think 32-bit ARM machines with 3GB or more are getting very rare,
but some still exist:
- The Armada XP development board had a DIMM slot that could take
  large memory (possibly up to 8GB with LPAE). This never shipped as
  a commercial product, but distro build servers sometimes still run on
  this, or on the old Calxeda or Keystone server systems.
- a few early i.MX6 boards  (e.g. HummingBoard) came had 4GB of
  RAM, though none of these seem to be available any more.
- High-end phones from 2013/2014 had 3GB LPDDR3 before getting
  obsoleted by 64-bit phones. Presumably none of these ever ran
  Linux-4.x or newer.
- My main laptop is a RK3288 based Chromebook with 4GB that just
  got updated to linux-4.19 by Google. Official updates apparently
  stop this summer, but it could easily run Debian later on.
- Some people run 32-bit kernels on a 64-bit Raspberry Pi 4 or on
  arm64 KVM with lots of RAM. These should probably all
  migrate to 64-bit kernels with compat user space anyway.
In theory these could also run on a VMSPLIT_4G_4G-like setup,
but I don't think anyone wants to go there. Deprecating highmem
definitely impacts any such users significantly, though staying on
an LTS kernel may be an option if there are only few of them.

           Arnd
Yafang Shao Feb. 14, 2020, 2:02 a.m. UTC | #20
On Thu, Feb 13, 2020 at 9:46 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Feb 13, 2020 at 09:47:29AM +0800, Yafang Shao wrote:
> > On Thu, Feb 13, 2020 at 12:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote:
> > > > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > Another variant of this problem was recently observed, where the
> > > > > kernel violates cgroups' memory.low protection settings and reclaims
> > > > > page cache way beyond the configured thresholds. It was followed by a
> > > > > proposal of a modified form of the reverted commit above, that
> > > > > implements memory.low-sensitive shrinker skipping over populated
> > > > > inodes on the LRU [1]. However, this proposal continues to run the
> > > > > risk of attracting disproportionate reclaim pressure to a pool of
> > > > > still-used inodes,
> > > >
> > > > Hi Johannes,
> > > >
> > > > If you really think that is a risk, what about bellow additional patch
> > > > to fix this risk ?
> > > >
> > > > diff --git a/fs/inode.c b/fs/inode.c
> > > > index 80dddbc..61862d9 100644
> > > > --- a/fs/inode.c
> > > > +++ b/fs/inode.c
> > > > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode,
> > > >                 goto out;
> > > >
> > > >         cgroup_size = mem_cgroup_size(memcg);
> > > > -       if (inode->i_data.nrpages + protection >= cgroup_size)
> > > > +       if (inode->i_data.nrpages)
> > > >                 reclaimable = false;
> > > >
> > > >  out:
> > > >
> > > > With this additional patch, we skip all inodes in this memcg until all
> > > > its page cache pages are reclaimed.
> > >
> > > Well that's something we've tried and had to revert because it caused
> > > issues in slab reclaim. See the History part of my changelog.
> >
> > You misuderstood it.
> > The reverted patch skips all inodes in the system, while this patch
> > only works when you turn on memcg.{min, low} protection.
> > IOW, that is not a default behavior, while it only works when you want
> > it and only effect your targeted memcg rather than the whole system.
>
> I understand perfectly well.
>
> Keeping unreclaimable inodes on the shrinker LRU causes the shrinker
> to build up excessive pressure on all VFS objects. This is a
> bug. Making it cgroup-specific doesn't make it less of a bug, it just
> means you only hit the bug when you use cgroup memory protection.
>

What I mean to fix is really a cgroup-specific issue, but this issue
may be different with what you're meaning to fix.
(I will explain it bellow)
Considering the excessive pressure the protected inodes may give to
the shrinker, the protected page cache pages will give much more
pressure on the reclaimer. If you mean to remove the protecrted inodes
from the shrinker LRU, why not removing the protected page cache pages
from the page cache LRU as well ? Well, what I really to mean is, that
is how the memcg proctection works.

> > > > > while not addressing the more generic reclaim
> > > > > inversion problem outside of a very specific cgroup application.
> > > > >
> > > >
> > > > But I have a different understanding.  This method works like a
> > > > knob. If you really care about your workingset (data), you should
> > > > turn it on (i.e. by using memcg protection to protect them), while
> > > > if you don't care about your workingset (data) then you'd better
> > > > turn it off. That would be more flexible.  Regaring your case in the
> > > > commit log, why not protect your linux git tree with memcg
> > > > protection ?
> > >
> > > I can't imagine a scenario where I *wouldn't* care about my
> > > workingset, though. Why should it be opt-in, not the default?
> >
> > Because the default behavior has caused the XFS performace hit.
>
> That means that with your proposal you cannot use cgroup memory
> protection for workloads that run on xfs.
>

Well, if you set memory.min to protect your workload inside a specific
memcg, it means that you already know these memroy can't be used by
your workload outside the memcg. That means, the performace of the
workload outside the memcg may not as good as before. Then you should
adjust your SLA or migrating this protected memcgs to other host or
just killing this protected memcg.
IOW, the result is *expected*.

> (And if I remember the bug report correctly, this wasn't just xfs. It
> also caused metadata caches on other filesystems to get trashed. xfs
> was just more pronounced because it does sync inode flushing from the
> shrinker, adding write stalls to the mix of metadata cache misses.)
>
> What I'm proposing is an implementation that protects hot page cache
> without causing excessive shrinker pressure and rotations.

That's the different between your issue and my issue.
You're trying to fix the issue around the hot  page cache, but what I
want to fix may be cold page cache and it really is a memcg protection
specific issue.
Becuase the memcg protection can protect all page cache pages, even if
the page cache pages are cold and the inodes are cold (in the tail of
the list lru) as well.  That is one of the reasons why memcg protect
exist. (I know you are the author of memcg protection, but I have to
clarify what memcg protect is.)

Regarding your issue around the hot page cache  pages, I have another
question. If the page cache pages are hot, why are the inode of these
page cahe pages cold (in the tail of the list lru) ?  Per my
understanding, if the page cache pages are hot, the inodes of them
should be hot (not in the tail of the list lur) as well. That should
be how the LRU works.

Well, that doesn't mean I object to your patch.  What I really want to
clarify is that our issues are really different.

Thanks
Yafang
kernel test robot Feb. 14, 2020, 4:53 p.m. UTC | #21
Hi Johannes,

I love your patch! Yet something to improve:

[auto build test ERROR on vfs/for-next]
[also build test ERROR on linux/master linus/master v5.6-rc1 next-20200213]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-LRU/20200214-083756
base:   https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-next
config: m68k-allmodconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   fs/dax.c: In function 'grab_mapping_entry':
>> fs/dax.c:556:28: error: 'struct address_space' has no member named 'inode'
      inode_pages_clear(mapping->inode);
                               ^~
   fs/dax.c:558:26: error: 'struct address_space' has no member named 'inode'
      inode_pages_set(mapping->inode);
                             ^~

vim +556 fs/dax.c

   446	
   447	/*
   448	 * Find page cache entry at given index. If it is a DAX entry, return it
   449	 * with the entry locked. If the page cache doesn't contain an entry at
   450	 * that index, add a locked empty entry.
   451	 *
   452	 * When requesting an entry with size DAX_PMD, grab_mapping_entry() will
   453	 * either return that locked entry or will return VM_FAULT_FALLBACK.
   454	 * This will happen if there are any PTE entries within the PMD range
   455	 * that we are requesting.
   456	 *
   457	 * We always favor PTE entries over PMD entries. There isn't a flow where we
   458	 * evict PTE entries in order to 'upgrade' them to a PMD entry.  A PMD
   459	 * insertion will fail if it finds any PTE entries already in the tree, and a
   460	 * PTE insertion will cause an existing PMD entry to be unmapped and
   461	 * downgraded to PTE entries.  This happens for both PMD zero pages as
   462	 * well as PMD empty entries.
   463	 *
   464	 * The exception to this downgrade path is for PMD entries that have
   465	 * real storage backing them.  We will leave these real PMD entries in
   466	 * the tree, and PTE writes will simply dirty the entire PMD entry.
   467	 *
   468	 * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
   469	 * persistent memory the benefit is doubtful. We can add that later if we can
   470	 * show it helps.
   471	 *
   472	 * On error, this function does not return an ERR_PTR.  Instead it returns
   473	 * a VM_FAULT code, encoded as an xarray internal entry.  The ERR_PTR values
   474	 * overlap with xarray value entries.
   475	 */
   476	static void *grab_mapping_entry(struct xa_state *xas,
   477			struct address_space *mapping, unsigned int order)
   478	{
   479		unsigned long index = xas->xa_index;
   480		bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
   481		int populated;
   482		void *entry;
   483	
   484	retry:
   485		populated = 0;
   486		xas_lock_irq(xas);
   487		entry = get_unlocked_entry(xas, order);
   488	
   489		if (entry) {
   490			if (dax_is_conflict(entry))
   491				goto fallback;
   492			if (!xa_is_value(entry)) {
   493				xas_set_err(xas, EIO);
   494				goto out_unlock;
   495			}
   496	
   497			if (order == 0) {
   498				if (dax_is_pmd_entry(entry) &&
   499				    (dax_is_zero_entry(entry) ||
   500				     dax_is_empty_entry(entry))) {
   501					pmd_downgrade = true;
   502				}
   503			}
   504		}
   505	
   506		if (pmd_downgrade) {
   507			/*
   508			 * Make sure 'entry' remains valid while we drop
   509			 * the i_pages lock.
   510			 */
   511			dax_lock_entry(xas, entry);
   512	
   513			/*
   514			 * Besides huge zero pages the only other thing that gets
   515			 * downgraded are empty entries which don't need to be
   516			 * unmapped.
   517			 */
   518			if (dax_is_zero_entry(entry)) {
   519				xas_unlock_irq(xas);
   520				unmap_mapping_pages(mapping,
   521						xas->xa_index & ~PG_PMD_COLOUR,
   522						PG_PMD_NR, false);
   523				xas_reset(xas);
   524				xas_lock_irq(xas);
   525			}
   526	
   527			dax_disassociate_entry(entry, mapping, false);
   528			xas_store(xas, NULL);	/* undo the PMD join */
   529			dax_wake_entry(xas, entry, true);
   530			mapping->nrexceptional--;
   531			if (mapping_empty(mapping))
   532				populated = -1;
   533			entry = NULL;
   534			xas_set(xas, index);
   535		}
   536	
   537		if (entry) {
   538			dax_lock_entry(xas, entry);
   539		} else {
   540			unsigned long flags = DAX_EMPTY;
   541	
   542			if (order > 0)
   543				flags |= DAX_PMD;
   544			entry = dax_make_entry(pfn_to_pfn_t(0), flags);
   545			dax_lock_entry(xas, entry);
   546			if (xas_error(xas))
   547				goto out_unlock;
   548			if (mapping_empty(mapping))
   549				populated++;
   550			mapping->nrexceptional++;
   551		}
   552	
   553	out_unlock:
   554		xas_unlock_irq(xas);
   555		if (populated == -1)
 > 556			inode_pages_clear(mapping->inode);
   557		else if (populated == 1)
   558			inode_pages_set(mapping->inode);
   559		if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
   560			goto retry;
   561		if (xas->xa_node == XA_ERROR(-ENOMEM))
   562			return xa_mk_internal(VM_FAULT_OOM);
   563		if (xas_error(xas))
   564			return xa_mk_internal(VM_FAULT_SIGBUS);
   565		return entry;
   566	fallback:
   567		xas_unlock_irq(xas);
   568		return xa_mk_internal(VM_FAULT_FALLBACK);
   569	}
   570	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot Feb. 14, 2020, 9:30 p.m. UTC | #22
Hi Johannes,

I love your patch! Perhaps something to improve:

[auto build test WARNING on vfs/for-next]
[also build test WARNING on linux/master linus/master v5.6-rc1 next-20200214]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-LRU/20200214-083756
base:   https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-next

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>


coccinelle warnings: (new ones prefixed by >>)

>> mm/truncate.c:41:9-10: WARNING: return of 0/1 in function '__clear_shadow_entry' with return type bool

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Geert Uytterhoeven Feb. 15, 2020, 11:25 a.m. UTC | #23
Hi Arnd,

On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> wrote:
> On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
> > > So at least my gut feel is that the arm people don't have any big
> > > reason to push for maintaining HIGHMEM support either.
> > >
> > > But I'm adding a couple of arm people and the arm list just in case
> > > they have some input.
> > >
> > > [ Obvious background for newly added people: we're talking about
> > > making CONFIG_HIGHMEM a deprecated feature and saying that if you want
> > > to run with lots of memory on a 32-bit kernel, you're doing legacy
> > > stuff and can use a legacy kernel ]
> >
> > Well, the recent 32-bit ARM systems generally have more than 1G
> > of memory, so make use of highmem as a rule.  You're probably
> > talking about crippling support for any 32-bit ARM system produced
> > in the last 8 to 10 years.
>
> What I'm observing in the newly added board support is that memory
> configurations are actually going down, driven by component cost.
> 512MB is really cheap (~$4) these days with a single 256Mx16 DDR3
> chip or two 128Mx16. Going beyond 1GB is where things get expensive
> with either 4+ chips or LPDDR3/LPDDR4 memory.
>
> For designs with 1GB, we're probably better off just using
> CONFIG_VMSPLIT_3G_OPT (without LPAE) anyway, completely
> avoiding highmem. That is particularly true on systems with a custom
> kernel configuration.
>
> 2GB machines are less common, but are definitely important, e.g.
> MT6580 based Android phones and some industrial embedded machines
> that will live a long time. I've recently seen reports of odd behavior
> with CONFIG_VMSPLIT_2G and plus CONFIG_HIGHMEM and a 7:1
> ratio of lowmem to highmem that apparently causes OOM despite lots
> of lowmem being free. I suspect a lot of those workloads would still be
> better off with a CONFIG_VMSPLIT_2G_OPT (1.75 GB user, 2GB
> linear map). That config unfortunately has a few problems, too:
> - nobody has implemented it
> - it won't work with LPAE and therefore cannot support hardware
>   that relies on high physical addresses for RAM or MMIO
>   (those could run CONFIG_VMSPLIT_2G at the cost of wasting
>   12.5% of RAM).
> - any workload that requires the full 3GB of virtual address space won't
>   work at all. This might be e.g. MAP_FIXED users, or build servers
>   linking large binaries.
> It will take a while to find out what kinds of workloads suffer the most
> from a different vmsplit and what can be done to address that, but we
> could start by changing the kernel defconfig and distro builds to see
> who complains ;-)
>
> I think 32-bit ARM machines with 3GB or more are getting very rare,
> but some still exist:
> - The Armada XP development board had a DIMM slot that could take
>   large memory (possibly up to 8GB with LPAE). This never shipped as
>   a commercial product, but distro build servers sometimes still run on
>   this, or on the old Calxeda or Keystone server systems.
> - a few early i.MX6 boards  (e.g. HummingBoard) came had 4GB of
>   RAM, though none of these seem to be available any more.
> - High-end phones from 2013/2014 had 3GB LPDDR3 before getting
>   obsoleted by 64-bit phones. Presumably none of these ever ran
>   Linux-4.x or newer.
> - My main laptop is a RK3288 based Chromebook with 4GB that just
>   got updated to linux-4.19 by Google. Official updates apparently
>   stop this summer, but it could easily run Debian later on.
> - Some people run 32-bit kernels on a 64-bit Raspberry Pi 4 or on
>   arm64 KVM with lots of RAM. These should probably all
>   migrate to 64-bit kernels with compat user space anyway.
> In theory these could also run on a VMSPLIT_4G_4G-like setup,
> but I don't think anyone wants to go there. Deprecating highmem
> definitely impacts any such users significantly, though staying on
> an LTS kernel may be an option if there are only few of them.

The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
for 1 GiB or 2 GiB configurations) in two parts, one below and one above
the 32-bit physical limit.

Gr{oetje,eeting}s,

                        Geert
Arnd Bergmann Feb. 15, 2020, 4:59 p.m. UTC | #24
On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
>
> The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
> for 1 GiB or 2 GiB configurations) in two parts, one below and one above
> the 32-bit physical limit.

Good to know. I think there are several other chips that have dual-channel
DDR3 and thus /can/ support this configuration, but this rarely happens.
Are you aware of commercial products that use a 4GB configuration, aside from
the reference board?

For TI AM54x, there is apparently a variant of the Dragonbox Pyro with 4G,
which is said to be shipping in the near future, see
https://en.wikipedia.org/wiki/DragonBox_Pyra

     Arnd
Geert Uytterhoeven Feb. 16, 2020, 9:44 a.m. UTC | #25
Hi Arnd,

On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
> On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
> <geert@linux-m68k.org> wrote:
> > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > > <linux@armlinux.org.uk> wrote:
> >
> > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
> > for 1 GiB or 2 GiB configurations) in two parts, one below and one above
> > the 32-bit physical limit.
>
> Good to know. I think there are several other chips that have dual-channel
> DDR3 and thus /can/ support this configuration, but this rarely happens.
> Are you aware of commercial products that use a 4GB configuration, aside from
> the reference board?

Unfortunately I don't know.
Chris Paterson might know.

Gr{oetje,eeting}s,

                        Geert
Chris Paterson Feb. 16, 2020, 7:54 p.m. UTC | #26
Hello Arnd, Geert,

> From: Geert Uytterhoeven <geert@linux-m68k.org>
> Sent: 16 February 2020 09:45
> To: Arnd Bergmann <arnd@arndb.de>
> 
> Hi Arnd,
> 
> On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
> > <geert@linux-m68k.org> wrote:
> > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de>
> wrote:
> > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > > > <linux@armlinux.org.uk> wrote:
> > >
> > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
> > > for 1 GiB or 2 GiB configurations) in two parts, one below and one above
> > > the 32-bit physical limit.

Yep. One example is r8a7743-iwg20m.dtsi.

> >
> > Good to know. I think there are several other chips that have dual-channel
> > DDR3 and thus /can/ support this configuration, but this rarely happens.
> > Are you aware of commercial products that use a 4GB configuration, aside
> from
> > the reference board?

iWave Systems make a range of SOM modules using the RZ/G1 SoCs.
I believe there are options for some of these to use 4 GB, although 1 or 2 GB is used in the boards we've upstreamed support for.

There are also other SOM vendors (e.g. Emtrion) and end users of RZ/G1, but I'm not sure of the details.

Kind regards, Chris

> 
> Unfortunately I don't know.
> Chris Paterson might know.
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-
> m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
Arnd Bergmann Feb. 16, 2020, 8:38 p.m. UTC | #27
On Sun, Feb 16, 2020 at 8:54 PM Chris Paterson
<Chris.Paterson2@renesas.com> wrote:
>
> Hello Arnd, Geert,
>
> > From: Geert Uytterhoeven <geert@linux-m68k.org>
> > Sent: 16 February 2020 09:45
> > To: Arnd Bergmann <arnd@arndb.de>
> >
> > Hi Arnd,
> >
> > On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
> > > <geert@linux-m68k.org> wrote:
> > > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de>
> > wrote:
> > > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > > > > <linux@armlinux.org.uk> wrote:
> > > >
> > > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even
> > > > for 1 GiB or 2 GiB configurations) in two parts, one below and one above
> > > > the 32-bit physical limit.
>
> Yep. One example is r8a7743-iwg20m.dtsi.

This one has 2x512MB, with half above the 4GiB limit. This means it needs
LPAE to address high physical addresses (which is fine), but it does not need
highmem if one uses an appropriate CONFIG_VMSPLIT_* option.

> > > Good to know. I think there are several other chips that have dual-channel
> > > DDR3 and thus /can/ support this configuration, but this rarely happens.
> > > Are you aware of commercial products that use a 4GB configuration, aside
> > from
> > > the reference board?
>
> iWave Systems make a range of SOM modules using the RZ/G1 SoCs.
> I believe there are options for some of these to use 4 GB, although 1 or 2 GB is
> used in the boards we've upstreamed support for.
>
> There are also other SOM vendors (e.g. Emtrion) and end users of RZ/G1,
> but I'm not sure of the details.

Both iWave and Emtrion only seem to list boards with 2GB or less on their
websites today (with up to 15 year availability). My guess is that they had
the same problem as everyone else in finding the right memory chips in
the required quantities and/or long-term availability. iWave lists "By default
1GB DDR3 and 4GB eMMC only supported. Contact iWave for memory
expansion support." on some boards, but that doesn't mean they ever
shipped a 4GB configuration.

       Arnd
Pavel Machek Feb. 17, 2020, 1:31 p.m. UTC | #28
Hi!

> > Testing this will be a challenge, but the issue was real - a 7GB
> > highmem machine isn't crazy and I expect the inode has become larger
> > since those days.
> 
> Hmm. I would say that in the intening years a 7GB highmem machine has
> indeed become crazy.
> 
> It used to be something we kind of supported.
> 
> But we really should consider HIGHMEM to be something that is on the
> deprecation list. In this day and age, there is no excuse for running
> a 32-bit kernel with lots of physical memory.
> 
> And if you really want to do that, and have some legacy hardware with
> a legacy use case, maybe you should be using a legacy kernel.
> 
> I'd personally be perfectly happy to start removing HIGHMEM support again.

7GB HIGHMEM machine may be unusual, but AFAICT HIGHMEM is need for
configurations like 1GB x86 machine, and definitely for 3GB x86
machine.

32-bit machines with 1.5 to 4GB of RAM are still pretty common (I have
two of those), and dropping HIGHMEM support will limit them to 0.8GB
RAM and probably make them unusable even for simple web browsing. I
have two such machines here, please don't break them :-).

Best regards,
									Pavel
Chris Paterson Feb. 20, 2020, 2:35 p.m. UTC | #29
Hello,

> From: Arnd Bergmann <arnd@arndb.de>
> Sent: 16 February 2020 20:38
> 
> On Sun, Feb 16, 2020 at 8:54 PM Chris Paterson
> <Chris.Paterson2@renesas.com> wrote:
> >
> > Hello Arnd, Geert,
> >
> > > From: Geert Uytterhoeven <geert@linux-m68k.org>
> > > Sent: 16 February 2020 09:45
> > > To: Arnd Bergmann <arnd@arndb.de>
> > >
> > > Hi Arnd,
> > >
> > > On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven
> > > > <geert@linux-m68k.org> wrote:
> > > > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de>
> > > wrote:
> > > > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > > > > > <linux@armlinux.org.uk> wrote:
> > > > >
> > > > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split
> (even
> > > > > for 1 GiB or 2 GiB configurations) in two parts, one below and one
> above
> > > > > the 32-bit physical limit.
> >
> > Yep. One example is r8a7743-iwg20m.dtsi.
> 
> This one has 2x512MB, with half above the 4GiB limit. This means it needs
> LPAE to address high physical addresses (which is fine), but it does not need
> highmem if one uses an appropriate CONFIG_VMSPLIT_* option.
> 
> > > > Good to know. I think there are several other chips that have dual-
> channel
> > > > DDR3 and thus /can/ support this configuration, but this rarely happens.
> > > > Are you aware of commercial products that use a 4GB configuration,
> aside
> > > from
> > > > the reference board?
> >
> > iWave Systems make a range of SOM modules using the RZ/G1 SoCs.
> > I believe there are options for some of these to use 4 GB, although 1 or 2
> GB is
> > used in the boards we've upstreamed support for.
> >
> > There are also other SOM vendors (e.g. Emtrion) and end users of RZ/G1,
> > but I'm not sure of the details.
> 
> Both iWave and Emtrion only seem to list boards with 2GB or less on their
> websites today (with up to 15 year availability). My guess is that they had
> the same problem as everyone else in finding the right memory chips in
> the required quantities and/or long-term availability. iWave lists "By default
> 1GB DDR3 and 4GB eMMC only supported. Contact iWave for memory
> expansion support." on some boards, but that doesn't mean they ever
> shipped a 4GB configuration.

I probably should have been clearer before - I actually have a couple of iWave RZ/G1M SOM modules with 4GB DDR on them.
However I've never booted them nor do I know what the memory mapping is.

Kind regards, Chris

> 
>        Arnd
Santosh Shilimkar Feb. 26, 2020, 6:04 p.m. UTC | #30
On 2/13/20 8:52 AM, Arnd Bergmann wrote:
> On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
>>
>> On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote:
>>> On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>>
>>>> What's the situation with highmem on ARM?
>>>
>>> Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever
>>> needed it, and I was ranting at some people for repeating all the
>>> mistakes Intel did.
>>>
>>> But arm64 doesn't need it, and while 32-bit arm is obviosuly still
>>> selling, I think that in many ways the switch-over to 64-bit has been
>>> quicker on ARM than it was on x86. Partly because it happened later
>>> (so all the 64-bit teething pains were dealt with), but largely
>>> because everybody ended up actively discouraging 32-bit on the Android
>>> side.
>>>
>>> There were a couple of unfortunate early 32-bit arm server attempts,
>>> but they were - predictably - complete garbage and nobody bought them.
>>> They don't exist any more.
> 
> I'd generally agree with that, the systems with more than 4GB tended to
> be high-end systems predating the Cortex-A53/A57 that quickly got
> replaced once there were actual 64-bit parts, this would include axm5516
> (replaced with x86-64 cores after sale to Intel), hip04 (replaced
> with arm64), or ecx-2000 (Calxeda bankruptcy).
> 
> The one 32-bit SoC that I can think of that can actually drive lots of
> RAM and is still actively marketed is TI Keystone-2/AM5K2.
> The embedded AM5K2 is listed supporting up to 8GB of RAM, but
> the verison in the HPE ProLiant m800 server could take up to 32GB (!).
> 
> I added Santosh and Kishon to Cc, they can probably comment on how
> long they think users will upgrade kernels on these. I suspect these
> devices can live for a very long time in things like wireless base stations,
> but it's possible that they all run on old kernels anyway by now (and are
> not worried about y2038).
> 
>>> So at least my gut feel is that the arm people don't have any big
>>> reason to push for maintaining HIGHMEM support either.
>>>
>>> But I'm adding a couple of arm people and the arm list just in case
>>> they have some input.
The Keystone generations of SOCs have been used in different areas and
they will be used for long unless says otherwise.

Apart from just split of lowmem and highmem, one of the peculiar thing
with Keystome family of SOCs is the DDR is addressable from two
addressing ranges. The lowmem address range is actually non-cached
range and the higher range is the cacheable.

So apart from LPAE, another change I needed to do back then is to boot
the linux from lowmem with bootstrap MMU tables and then re-create
MMU tables early (mmu init) and use higher range for entire memory so
that L3 cache can be used.

AFAIK, all the derived SOCs from Keystone architecture inherently
use this feature.

Regards,
Santosh
Arnd Bergmann Feb. 26, 2020, 9:01 p.m. UTC | #31
On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote:
>
> On 2/13/20 8:52 AM, Arnd Bergmann wrote:
> > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
>
> The Keystone generations of SOCs have been used in different areas and
> they will be used for long unless says otherwise.
>
> Apart from just split of lowmem and highmem, one of the peculiar thing
> with Keystome family of SOCs is the DDR is addressable from two
> addressing ranges. The lowmem address range is actually non-cached
> range and the higher range is the cacheable.

I'm aware of Keystone's special physical memory layout, but for the
discussion here, this is actually irrelevant for the discussion about
highmem here, which is only about the way we map all or part of the
available physical memory into the 4GB of virtual address space.

The far more important question is how much memory any users
(in particular the subset that are going to update their kernels
several years from now) actually have installed. Keystone-II is
one of the rare 32-bit chips with fairly wide memory interfaces,
having two 72-bit (with ECC) channels rather than the usual one
 or two channels of 32-bit DDR3. This means a relatively cheap
4GB configuration using eight 256Mx16 chips is possible, or
even a 8GB using sixteen or eighteen 512Mx8.

Do you have an estimate on how common these 4GB and 8GB
configurations are in practice outside of the TI evaluation
board?

       Arnd
Santosh Shilimkar Feb. 26, 2020, 9:11 p.m. UTC | #32
+Nishant, Tero

On 2/26/20 1:01 PM, Arnd Bergmann wrote:
> On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote:
>>
>> On 2/13/20 8:52 AM, Arnd Bergmann wrote:
>>> On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
>>> <linux@armlinux.org.uk> wrote:
>>
>> The Keystone generations of SOCs have been used in different areas and
>> they will be used for long unless says otherwise.
>>
>> Apart from just split of lowmem and highmem, one of the peculiar thing
>> with Keystome family of SOCs is the DDR is addressable from two
>> addressing ranges. The lowmem address range is actually non-cached
>> range and the higher range is the cacheable.
> 
> I'm aware of Keystone's special physical memory layout, but for the
> discussion here, this is actually irrelevant for the discussion about
> highmem here, which is only about the way we map all or part of the
> available physical memory into the 4GB of virtual address space.
> 
> The far more important question is how much memory any users
> (in particular the subset that are going to update their kernels
> several years from now) actually have installed. Keystone-II is
> one of the rare 32-bit chips with fairly wide memory interfaces,
> having two 72-bit (with ECC) channels rather than the usual one
>   or two channels of 32-bit DDR3. This means a relatively cheap
> 4GB configuration using eight 256Mx16 chips is possible, or
> even a 8GB using sixteen or eighteen 512Mx8.
> 
> Do you have an estimate on how common these 4GB and 8GB
> configurations are in practice outside of the TI evaluation
> board?
> 
 From my TI memories, many K2 customers were going to install
more than 2G memory. Don't remember 8G, but 4G was the dominant
one afair. Will let Nishant/Tero elaborate latest on this.

regards,
Santosh
Nishanth Menon March 6, 2020, 8:34 p.m. UTC | #33
On 13:11-20200226, santosh.shilimkar@oracle.com wrote:
> +Nishant, Tero
> 
> On 2/26/20 1:01 PM, Arnd Bergmann wrote:
> > On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote:
> > > 
> > > On 2/13/20 8:52 AM, Arnd Bergmann wrote:
> > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
> > > > <linux@armlinux.org.uk> wrote:
> > > 
> > > The Keystone generations of SOCs have been used in different areas and
> > > they will be used for long unless says otherwise.
> > > 
> > > Apart from just split of lowmem and highmem, one of the peculiar thing
> > > with Keystome family of SOCs is the DDR is addressable from two
> > > addressing ranges. The lowmem address range is actually non-cached
> > > range and the higher range is the cacheable.
> > 
> > I'm aware of Keystone's special physical memory layout, but for the
> > discussion here, this is actually irrelevant for the discussion about
> > highmem here, which is only about the way we map all or part of the
> > available physical memory into the 4GB of virtual address space.
> > 
> > The far more important question is how much memory any users
> > (in particular the subset that are going to update their kernels
> > several years from now) actually have installed. Keystone-II is
> > one of the rare 32-bit chips with fairly wide memory interfaces,
> > having two 72-bit (with ECC) channels rather than the usual one
> >   or two channels of 32-bit DDR3. This means a relatively cheap
> > 4GB configuration using eight 256Mx16 chips is possible, or
> > even a 8GB using sixteen or eighteen 512Mx8.
> > 
> > Do you have an estimate on how common these 4GB and 8GB
> > configurations are in practice outside of the TI evaluation
> > board?
> > 
> From my TI memories, many K2 customers were going to install
> more than 2G memory. Don't remember 8G, but 4G was the dominant
> one afair. Will let Nishant/Tero elaborate latest on this.
> 

Thanks for the headsup, it took a little to dig up the current
situation:

~few 1000s still relevant spread between 4G and 8G (confirmed that both
are present, relevant and in use).

I wish we could sunset, but unfortunately, I am told(and agree)
that we should'nt just leave products (and these are long term
products stuck in critical parts in our world) hanging in the air, and
migrations to newer kernel do still take place periodically (the best
I can talk in public forum at least).
Santosh Shilimkar March 7, 2020, 1:08 a.m. UTC | #34
On 3/6/20 12:34 PM, Nishanth Menon wrote:
> On 13:11-20200226, santosh.shilimkar@oracle.com wrote:
>> +Nishant, Tero
>>
>> On 2/26/20 1:01 PM, Arnd Bergmann wrote:
>>> On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote:
>>>>
>>>> On 2/13/20 8:52 AM, Arnd Bergmann wrote:
>>>>> On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin
>>>>> <linux@armlinux.org.uk> wrote:
>>>>
>>>> The Keystone generations of SOCs have been used in different areas and
>>>> they will be used for long unless says otherwise.
>>>>
>>>> Apart from just split of lowmem and highmem, one of the peculiar thing
>>>> with Keystome family of SOCs is the DDR is addressable from two
>>>> addressing ranges. The lowmem address range is actually non-cached
>>>> range and the higher range is the cacheable.
>>>
>>> I'm aware of Keystone's special physical memory layout, but for the
>>> discussion here, this is actually irrelevant for the discussion about
>>> highmem here, which is only about the way we map all or part of the
>>> available physical memory into the 4GB of virtual address space.
>>>
>>> The far more important question is how much memory any users
>>> (in particular the subset that are going to update their kernels
>>> several years from now) actually have installed. Keystone-II is
>>> one of the rare 32-bit chips with fairly wide memory interfaces,
>>> having two 72-bit (with ECC) channels rather than the usual one
>>>    or two channels of 32-bit DDR3. This means a relatively cheap
>>> 4GB configuration using eight 256Mx16 chips is possible, or
>>> even a 8GB using sixteen or eighteen 512Mx8.
>>>
>>> Do you have an estimate on how common these 4GB and 8GB
>>> configurations are in practice outside of the TI evaluation
>>> board?
>>>
>>  From my TI memories, many K2 customers were going to install
>> more than 2G memory. Don't remember 8G, but 4G was the dominant
>> one afair. Will let Nishant/Tero elaborate latest on this.
>>
> 
> Thanks for the headsup, it took a little to dig up the current
> situation:
> 
> ~few 1000s still relevant spread between 4G and 8G (confirmed that both
> are present, relevant and in use).
> 
> I wish we could sunset, but unfortunately, I am told(and agree)
> that we should'nt just leave products (and these are long term
> products stuck in critical parts in our world) hanging in the air, and
> migrations to newer kernel do still take place periodically (the best
> I can talk in public forum at least).
> 
Thanks Nishant !!
Arnd Bergmann March 8, 2020, 10:58 a.m. UTC | #35
On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote:
> On 13:11-20200226, santosh.shilimkar@oracle.com wrote:

>
> ~few 1000s still relevant spread between 4G and 8G (confirmed that both
> are present, relevant and in use).
>
> I wish we could sunset, but unfortunately, I am told(and agree)
> that we should'nt just leave products (and these are long term
> products stuck in critical parts in our world) hanging in the air, and
> migrations to newer kernel do still take place periodically (the best
> I can talk in public forum at least).

Thank you for the clear answer!

I agree we should certainly not break any such use cases, and for the
8GB case there is not really a good replacement (using zram/zswap
instead of highmem could work for some new workloads, but would be a
rather risky change for an upgrade on already deployed systems).

I hope it's ok to ask the same question every few years until you are
reasonably sure that the users are ready to stop upgrading kernels
beyond the following LTS kernel version. We can also do the same
thing for the other 32-bit platforms that exceed the maximum amount
of lowmem, and document which ones are known.

In the meantime, it would seem useful to increase the amount of
lowmem that can be used by default, using a combination of some
of the changes mentioned earlier

- add a VMSPLIT_2G_OPT config option for non-LPAE ARM kernels
  to handle the common i.MX6 case with 2GB of RAM without highmem

- make VMSPLIT_2G_OPT (without LPAE) or VMSPLIT_2G (with
  LPAE) the default in most ARM defconfig files as well as distros,
  and disable highmem where possible, to see what breaks.

- extend zswap to use all the available high memory for swap space
  when highmem is disabled.

- revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32)
  to see if it can be done, and what the overhead is. This is probably
  more work than the others combined, but also the most promising
  as it allows the most user address space and physical ram to be used.

       Arnd
Russell King (Oracle) March 8, 2020, 2:19 p.m. UTC | #36
On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote:
> On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote:
> > On 13:11-20200226, santosh.shilimkar@oracle.com wrote:
> 
> >
> > ~few 1000s still relevant spread between 4G and 8G (confirmed that both
> > are present, relevant and in use).
> >
> > I wish we could sunset, but unfortunately, I am told(and agree)
> > that we should'nt just leave products (and these are long term
> > products stuck in critical parts in our world) hanging in the air, and
> > migrations to newer kernel do still take place periodically (the best
> > I can talk in public forum at least).
> 
> Thank you for the clear answer!
> 
> I agree we should certainly not break any such use cases, and for the
> 8GB case there is not really a good replacement (using zram/zswap
> instead of highmem could work for some new workloads, but would be a
> rather risky change for an upgrade on already deployed systems).
> 
> I hope it's ok to ask the same question every few years until you are
> reasonably sure that the users are ready to stop upgrading kernels
> beyond the following LTS kernel version. We can also do the same
> thing for the other 32-bit platforms that exceed the maximum amount
> of lowmem, and document which ones are known.
> 
> In the meantime, it would seem useful to increase the amount of
> lowmem that can be used by default, using a combination of some
> of the changes mentioned earlier
> 
> - add a VMSPLIT_2G_OPT config option for non-LPAE ARM kernels
>   to handle the common i.MX6 case with 2GB of RAM without highmem
> 
> - make VMSPLIT_2G_OPT (without LPAE) or VMSPLIT_2G (with
>   LPAE) the default in most ARM defconfig files as well as distros,
>   and disable highmem where possible, to see what breaks.
> 
> - extend zswap to use all the available high memory for swap space
>   when highmem is disabled.

I don't think that's a good idea.  Running debian stable kernels on my
8GB laptop, I have problems when leaving firefox running long before
even half the 16GB of swap gets consumed - the entire machine slows
down very quickly when it starts swapping more than about 2 or so GB.
It seems either the kernel has become quite bad at selecting pages to
evict.

It gets to the point where any git operation has a battle to fight
for RAM, despite not touching anything else other than git.

The behaviour is much like firefox is locking memory into core, but
that doesn't seem to be what's actually going on.  I've never really
got to the bottom of it though.

This is with 64-bit kernel and userspace.

So, I'd suggest that trading off RAM available through highmem for VM
space available through zswap is likely a bad idea if you have a
workload that requires 4GB of RAM on a 32-bit machine.
Arnd Bergmann March 9, 2020, 1:33 p.m. UTC | #37
On Sun, Mar 8, 2020 at 3:20 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
> On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote:
> > On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote:
> > > On 13:11-20200226, santosh.shilimkar@oracle.com wrote:
>
> > - extend zswap to use all the available high memory for swap space
> >   when highmem is disabled.
>
> I don't think that's a good idea.  Running debian stable kernels on my
> 8GB laptop, I have problems when leaving firefox running long before
> even half the 16GB of swap gets consumed - the entire machine slows
> down very quickly when it starts swapping more than about 2 or so GB.
> It seems either the kernel has become quite bad at selecting pages to
> evict.
>
> It gets to the point where any git operation has a battle to fight
> for RAM, despite not touching anything else other than git.
>
> The behaviour is much like firefox is locking memory into core, but
> that doesn't seem to be what's actually going on.  I've never really
> got to the bottom of it though.
>
> This is with 64-bit kernel and userspace.

I agree there is something going wrong on your machine, but I
don't really see how that relates to my suggestion.

> So, I'd suggest that trading off RAM available through highmem for VM
> space available through zswap is likely a bad idea if you have a
> workload that requires 4GB of RAM on a 32-bit machine.

Aside from every workload being different, I was thinking of
these general observations:

- If we are looking at a future without highmem, then it's better to use
  the extra memory for something than not using it. zswap seems like
  a reasonable use.

- A lot of embedded systems are configured to have no swap at all,
  which can be for good or not-so-good reasons. Having some
  swap space available often improves things, even if it comes
  out of RAM.

- A particularly important case to optimize for is 2GB of RAM with
  LPAE enabled. With CONFIG_VMSPLIT_2G and highmem, this
  leads to the paradox -ENOMEM when 256MB of highmem are
  full while plenty of lowmem is available. With highmem disabled,
  you avoid that at the cost of losing 12% of RAM.

- With 4GB+ of RAM and CONFIG_VMSPLIT_2G or
  CONFIG_VMSPLIT_3G, using gigabytes of RAM for swap
  space would usually be worse than highmem, but once
  we have VMSPLIT_4G_4G, it's the same situation as above
  with 6% of RAM used for zswap instead of highmem.

       Arnd
Russell King (Oracle) March 9, 2020, 2:04 p.m. UTC | #38
On Mon, Mar 09, 2020 at 02:33:26PM +0100, Arnd Bergmann wrote:
> On Sun, Mar 8, 2020 at 3:20 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote:
> > > On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote:
> > > > On 13:11-20200226, santosh.shilimkar@oracle.com wrote:
> >
> > > - extend zswap to use all the available high memory for swap space
> > >   when highmem is disabled.
> >
> > I don't think that's a good idea.  Running debian stable kernels on my
> > 8GB laptop, I have problems when leaving firefox running long before
> > even half the 16GB of swap gets consumed - the entire machine slows
> > down very quickly when it starts swapping more than about 2 or so GB.
> > It seems either the kernel has become quite bad at selecting pages to
> > evict.
> >
> > It gets to the point where any git operation has a battle to fight
> > for RAM, despite not touching anything else other than git.
> >
> > The behaviour is much like firefox is locking memory into core, but
> > that doesn't seem to be what's actually going on.  I've never really
> > got to the bottom of it though.
> >
> > This is with 64-bit kernel and userspace.
> 
> I agree there is something going wrong on your machine, but I
> don't really see how that relates to my suggestion.

You are suggesting for a 4GB machine to use 2GB of RAM for normal
usage via an optimised virtual space layout, and 2GB of RAM for
swap using ZSWAP, rather than having 4GB of RAM available via the
present highmem / lowmem system.

I'm saying that is quite risky given the behaviours I'm seeing,
where modern Linux seems to be struggling to work out what pages
it should be evicting.

I think Linux performs way better when it doesn't have to use
swap.

> > So, I'd suggest that trading off RAM available through highmem for VM
> > space available through zswap is likely a bad idea if you have a
> > workload that requires 4GB of RAM on a 32-bit machine.
> 
> Aside from every workload being different, I was thinking of
> these general observations:
> 
> - If we are looking at a future without highmem, then it's better to use
>   the extra memory for something than not using it. zswap seems like
>   a reasonable use.

I think that's still a very open question, one which hasn't been
answered yet.

> - A lot of embedded systems are configured to have no swap at all,
>   which can be for good or not-so-good reasons. Having some
>   swap space available often improves things, even if it comes
>   out of RAM.

How do you come up with that assertion?  What is the evidence behind
it?

This is kind of the crux of my point in the previous email: Linux
with swap performs way worse for me - if I had 16GB of RAM in my
laptop, I bet it would perform better than my current 8GB with a
16GB swap file - where, when the swap file gets around 8GB full,
the system as a whole starts to struggle.

That's about a 50/50 split of VM space between RAM and swap.

Proposing 2GB+ swap 2GB RAM would potentially place these machines
into the same situation as my laptop, and if it also results in
a loss of performance, we could end up with regression reports.

> - A particularly important case to optimize for is 2GB of RAM with
>   LPAE enabled. With CONFIG_VMSPLIT_2G and highmem, this
>   leads to the paradox -ENOMEM when 256MB of highmem are
>   full while plenty of lowmem is available. With highmem disabled,
>   you avoid that at the cost of losing 12% of RAM.

What happened to requests for memory from highmem being able to be
sourced from lowmem if highmem wasn't available?  That used to be
standard kernel behaviour.

> - With 4GB+ of RAM and CONFIG_VMSPLIT_2G or
>   CONFIG_VMSPLIT_3G, using gigabytes of RAM for swap
>   space would usually be worse than highmem, but once
>   we have VMSPLIT_4G_4G, it's the same situation as above
>   with 6% of RAM used for zswap instead of highmem.

I think the chances of that happening are very slim - I doubt there
is the will to invest the energy amongst what is left of the 32-bit
ARM community.
Arnd Bergmann March 9, 2020, 3:04 p.m. UTC | #39
On Mon, Mar 9, 2020 at 3:05 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Mon, Mar 09, 2020 at 02:33:26PM +0100, Arnd Bergmann wrote:
> > On Sun, Mar 8, 2020 at 3:20 PM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> > > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote:
> > > > On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote:
> > > > > On 13:11-20200226, santosh.shilimkar@oracle.com wrote:
> > >
> > > > - extend zswap to use all the available high memory for swap space
> > > >   when highmem is disabled.
> > >
> > > I don't think that's a good idea.  Running debian stable kernels on my
> > > 8GB laptop, I have problems when leaving firefox running long before
> > > even half the 16GB of swap gets consumed - the entire machine slows
> > > down very quickly when it starts swapping more than about 2 or so GB.
> > > It seems either the kernel has become quite bad at selecting pages to
> > > evict.
> > >
> > > It gets to the point where any git operation has a battle to fight
> > > for RAM, despite not touching anything else other than git.
> > >
> > > The behaviour is much like firefox is locking memory into core, but
> > > that doesn't seem to be what's actually going on.  I've never really
> > > got to the bottom of it though.
> > >
> > > This is with 64-bit kernel and userspace.
> >
> > I agree there is something going wrong on your machine, but I
> > don't really see how that relates to my suggestion.
>
> You are suggesting for a 4GB machine to use 2GB of RAM for normal
> usage via an optimised virtual space layout, and 2GB of RAM for
> swap using ZSWAP, rather than having 4GB of RAM available via the
> present highmem / lowmem system.

No, that would not be good. The cases where I would hope
to get improvements out of zswap are:

- 1GB of RAM with VMSPLIT_3G, when VMSPLIT_3G_OPT
  and VMSPLIT_2G don't work because of user address space
  requirements

- 2GB of RAM with VMSPLIT_2G

- 4GB of RAM if we add VMSPLIT_4G_4G

> > - A lot of embedded systems are configured to have no swap at all,
> >   which can be for good or not-so-good reasons. Having some
> >   swap space available often improves things, even if it comes
> >   out of RAM.
>
> How do you come up with that assertion?  What is the evidence behind
> it?

The idea of zswap is that it's faster to compress/uncompress
data than to actually access a slow disk. So if you already have
a swap space, it gives you another performance tier inbetween
direct-mapped pages and the slow swap.

If you don't have a physical swap space, then reserving a little
bit of RAM for compressed swap means that rarely used pages
take up less space and you end up with more RAM available
for the workload you want to run.

> This is kind of the crux of my point in the previous email: Linux
> with swap performs way worse for me - if I had 16GB of RAM in my
> laptop, I bet it would perform better than my current 8GB with a
> 16GB swap file - where, when the swap file gets around 8GB full,
> the system as a whole starts to struggle.
>
> That's about a 50/50 split of VM space between RAM and swap.

As I said above I agree that very few workloads would behave
better from using using 1.75GB RAM plus 2.25GB zswap (storing
maybe 6GB of data) compared to highmem. To deal with 4GB
systems, we probably need either highmem or VMSPLIT_4G_4G.

> > - A particularly important case to optimize for is 2GB of RAM with
> >   LPAE enabled. With CONFIG_VMSPLIT_2G and highmem, this
> >   leads to the paradox -ENOMEM when 256MB of highmem are
> >   full while plenty of lowmem is available. With highmem disabled,
> >   you avoid that at the cost of losing 12% of RAM.
>
> What happened to requests for memory from highmem being able to be
> sourced from lowmem if highmem wasn't available?  That used to be
> standard kernel behaviour.

AFAICT this is how it's supposed to work, but for some reason it
doesn't always. I don't know the details, but have heard of recent
complaints about it. I don't think it's the actual get_free_pages
failing, but rather some heuristic looking at the number of free pages.

> > - With 4GB+ of RAM and CONFIG_VMSPLIT_2G or
> >   CONFIG_VMSPLIT_3G, using gigabytes of RAM for swap
> >   space would usually be worse than highmem, but once
> >   we have VMSPLIT_4G_4G, it's the same situation as above
> >   with 6% of RAM used for zswap instead of highmem.
>
> I think the chances of that happening are very slim - I doubt there
> is the will to invest the energy amongst what is left of the 32-bit
> ARM community.

Right. But I think it makes sense to discuss what it would take
to do it anyway, and to see who would be interested in funding or
implementing VMSPLIT_4G_4G. Whether it happens or not comes
down to another tradeoff: Without it, we have to keep highmem
around for a much long timer to support systems with 4GB of
RAM along with systems that need both 2GB of physical RAM
and 3GB of user address space, while adding VMSPLIT_4G_4G
soon means we can probably kill off highmem after everybody
with more 8GB of RAM or more has stopped upgrading kernels.
Unlike the 2GB case, this is something we can realistically
plan for.

What is going to be much harder I fear is to find someone to
implement it on MIPS32, which seems to be a decade ahead
of 32-bit ARM in its decline, and also has a small number of users
with 4GB or more, and architecturally it seems harder to
implement or impossible depending on the type of MIPS
MMU.

        Arnd
Catalin Marinas March 9, 2020, 3:59 p.m. UTC | #40
On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote:
> - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32)
>   to see if it can be done, and what the overhead is. This is probably
>   more work than the others combined, but also the most promising
>   as it allows the most user address space and physical ram to be used.

A rough outline of such support (and likely to miss some corner cases):

1. Kernel runs with its own ASID and non-global page tables.

2. Trampoline code on exception entry/exit to handle the TTBR0 switching
   between user and kernel.

3. uaccess routines need to be reworked to pin the user pages in memory
   (get_user_pages()) and access them via the kernel address space.

Point 3 is probably the ugliest and it would introduce a noticeable
slowdown in certain syscalls.
Russell King (Oracle) March 9, 2020, 4:09 p.m. UTC | #41
On Mon, Mar 09, 2020 at 03:59:45PM +0000, Catalin Marinas wrote:
> On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote:
> > - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32)
> >   to see if it can be done, and what the overhead is. This is probably
> >   more work than the others combined, but also the most promising
> >   as it allows the most user address space and physical ram to be used.
> 
> A rough outline of such support (and likely to miss some corner cases):
> 
> 1. Kernel runs with its own ASID and non-global page tables.
> 
> 2. Trampoline code on exception entry/exit to handle the TTBR0 switching
>    between user and kernel.
> 
> 3. uaccess routines need to be reworked to pin the user pages in memory
>    (get_user_pages()) and access them via the kernel address space.
> 
> Point 3 is probably the ugliest and it would introduce a noticeable
> slowdown in certain syscalls.

We also need to consider that it has implications for the single-kernel
support; a kernel doing this kind of switching would likely be horrid
for a kernel supporting v6+ with VIPT aliasing caches.  Would we be
adding a new red line between kernels supporting VIPT-aliasing caches
(present in earlier v6 implementations) and kernels using this system?
Catalin Marinas March 9, 2020, 4:57 p.m. UTC | #42
On Mon, Mar 09, 2020 at 04:09:19PM +0000, Russell King wrote:
> On Mon, Mar 09, 2020 at 03:59:45PM +0000, Catalin Marinas wrote:
> > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote:
> > > - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32)
> > >   to see if it can be done, and what the overhead is. This is probably
> > >   more work than the others combined, but also the most promising
> > >   as it allows the most user address space and physical ram to be used.
> > 
> > A rough outline of such support (and likely to miss some corner cases):
> > 
> > 1. Kernel runs with its own ASID and non-global page tables.
> > 
> > 2. Trampoline code on exception entry/exit to handle the TTBR0 switching
> >    between user and kernel.
> > 
> > 3. uaccess routines need to be reworked to pin the user pages in memory
> >    (get_user_pages()) and access them via the kernel address space.
> > 
> > Point 3 is probably the ugliest and it would introduce a noticeable
> > slowdown in certain syscalls.
> 
> We also need to consider that it has implications for the single-kernel
> support; a kernel doing this kind of switching would likely be horrid
> for a kernel supporting v6+ with VIPT aliasing caches.

Good point. I think with VIPT aliasing cache uaccess would have to flush
the cache before/after access, depending on direction.

> Would we be adding a new red line between kernels supporting
> VIPT-aliasing caches (present in earlier v6 implementations) and
> kernels using this system?

get_user_pages() should handle the flush_dcache_page() call and the
latter would dial with the aliases. But this adds heavily to the cost of
the uaccess.

Maybe some trick with temporarily locking the user page table and
copying the user pmd into a dedicated kernel pmd, then accessing the
user via this location. The fault handler would need to figure out the
real user address and I'm not sure how we deal with the page table lock
(or mmap_sem).

An alternative to the above would be to have all uaccess routines in a
trampoline which restores the user pgd but with only a couple of pmds
for mapping the kernel address temporarily. This would avoid the issue
of concurrent modification of the user page tables.

Anyway, I don't think any of the above looks better than highmem.
Arnd Bergmann March 9, 2020, 7:46 p.m. UTC | #43
On Mon, Mar 9, 2020 at 5:09 PM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
> On Mon, Mar 09, 2020 at 03:59:45PM +0000, Catalin Marinas wrote:
> > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote:
> > > - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32)
> > >   to see if it can be done, and what the overhead is. This is probably
> > >   more work than the others combined, but also the most promising
> > >   as it allows the most user address space and physical ram to be used.
> >
> > A rough outline of such support (and likely to miss some corner cases):
> >
> > 1. Kernel runs with its own ASID and non-global page tables.
> >
> > 2. Trampoline code on exception entry/exit to handle the TTBR0 switching
> >    between user and kernel.
> >
> > 3. uaccess routines need to be reworked to pin the user pages in memory
> >    (get_user_pages()) and access them via the kernel address space.
> >
> > Point 3 is probably the ugliest and it would introduce a noticeable
> > slowdown in certain syscalls.

There are probably a number of ways to do the basic design. The idea
I had (again, probably missing more corner cases than either of you
two that actually understand the details of the mmu):

- Assuming we have LPAE, run the kernel vmlinux and modules inside
  the vmalloc space, in the top 256MB or 512MB on TTBR1

- Map all the physical RAM (up to 3.75GB) into a reserved ASID
  with TTBR0

- Flip TTBR0 on kernel entry/exit, and again during user access.

This is probably more work to implement than your idea, but
I would hope this has a lower overhead on most microarchitectures
as it doesn't require pinning the pages. Depending on the
microarchitecture, I'd hope the overhead would be comparable
to that of ARM64_SW_TTBR0_PAN.

> We also need to consider that it has implications for the single-kernel
> support; a kernel doing this kind of switching would likely be horrid
> for a kernel supporting v6+ with VIPT aliasing caches.  Would we be
> adding a new red line between kernels supporting VIPT-aliasing caches
> (present in earlier v6 implementations) and kernels using this system?

I would initially do it for LPAE only, given that this is already an
incompatible config option. I don't think there are any v6 machines with
more than 1GB of RAM (the maximum for AST2500), and the only distro
that ships a v6+ multiplatform kernel is Raspbian, which in turn needs
a separate LPAE kernel for the large-memory machines anyway.

Only doing it for LPAE would still cover the vast majority of systems that
actually shipped with more than 2GB. There are a couple of exceptions,
i.e. early  Cubox i4x4, the Calxeda Highbank developer system and the
Novena Laptop, which I would guess have a limited life expectancy
(before users stop updating kernels) no longer than the 8GB
Keystone-2.

Based on that, I would hope that the ARMv7 distros can keep shipping
the two kernel images they already ship:

- The non-LPAE kernel modified to VMSPLIT_2G_OPT, not using highmem
  on anything up to 2GB, but still supporting the handful of remaining
  Cortex-A9s with 4GB using highmem until they are completely obsolete.

- The LPAE kernel modified to use a newly added VMSPLIT_4G_4G,
   with details to be worked out.

Most new systems tend to be based on Cortex-A7 with no more than 2GB,
so those could run either configuration well.  If we find the 2GB of user
address space too limiting for the non-LPAE config, or I missed some
important pre-LPAE systems with 4GB that need to be supported for longer
than other highmem systems, that can probably be added later.

    Arnd
Michal Hocko March 10, 2020, 9:16 a.m. UTC | #44
I am worried this went quite tangent to the original patch under
discussion here, but let me clarify at least one point.

On Mon 09-03-20 16:04:54, Arnd Bergmann wrote:
> On Mon, Mar 9, 2020 at 3:05 PM Russell King - ARM Linux admin
[...]
> > What happened to requests for memory from highmem being able to be
> > sourced from lowmem if highmem wasn't available?  That used to be
> > standard kernel behaviour.
> 
> AFAICT this is how it's supposed to work, but for some reason it
> doesn't always. I don't know the details, but have heard of recent
> complaints about it. I don't think it's the actual get_free_pages
> failing, but rather some heuristic looking at the number of free pages.

This is indeed the case. There are low memory reserves which are not
allowed for requests which can be satisfied from higher zones. This is
the case for many many years. Just have a look at lowmem_reserve and
their usage in __zone_watermark_ok. The layout of the reserves can be
configured by /proc/sys/vm/lowmem_reserve_ratio.

HTH
Catalin Marinas March 11, 2020, 2:29 p.m. UTC | #45
On Mon, Mar 09, 2020 at 08:46:18PM +0100, Arnd Bergmann wrote:
> On Mon, Mar 9, 2020 at 5:09 PM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> > On Mon, Mar 09, 2020 at 03:59:45PM +0000, Catalin Marinas wrote:
> > > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote:
> > > > - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32)
> > > >   to see if it can be done, and what the overhead is. This is probably
> > > >   more work than the others combined, but also the most promising
> > > >   as it allows the most user address space and physical ram to be used.
> > >
> > > A rough outline of such support (and likely to miss some corner cases):
> > >
> > > 1. Kernel runs with its own ASID and non-global page tables.
> > >
> > > 2. Trampoline code on exception entry/exit to handle the TTBR0 switching
> > >    between user and kernel.
> > >
> > > 3. uaccess routines need to be reworked to pin the user pages in memory
> > >    (get_user_pages()) and access them via the kernel address space.
> > >
> > > Point 3 is probably the ugliest and it would introduce a noticeable
> > > slowdown in certain syscalls.
> 
> There are probably a number of ways to do the basic design. The idea
> I had (again, probably missing more corner cases than either of you
> two that actually understand the details of the mmu):
> 
> - Assuming we have LPAE, run the kernel vmlinux and modules inside
>   the vmalloc space, in the top 256MB or 512MB on TTBR1
> 
> - Map all the physical RAM (up to 3.75GB) into a reserved ASID
>   with TTBR0
> 
> - Flip TTBR0 on kernel entry/exit, and again during user access.
> 
> This is probably more work to implement than your idea, but
> I would hope this has a lower overhead on most microarchitectures
> as it doesn't require pinning the pages. Depending on the
> microarchitecture, I'd hope the overhead would be comparable
> to that of ARM64_SW_TTBR0_PAN.

This still doesn't solve the copy_{from,to}_user() case where both
address spaces need to be available during copy. So you either pin the
user pages in memory and access them via the kernel mapping or you
temporarily map (kmap?) the destination/source kernel address. The
overhead I'd expect to be significantly greater than ARM64_SW_TTBR0_PAN
for the uaccess routines. For user entry/exit, your suggestion is
probably comparable with SW PAN.
Arnd Bergmann March 11, 2020, 4:59 p.m. UTC | #46
On Wed, Mar 11, 2020 at 3:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote:

> > - Flip TTBR0 on kernel entry/exit, and again during user access.
> >
> > This is probably more work to implement than your idea, but
> > I would hope this has a lower overhead on most microarchitectures
> > as it doesn't require pinning the pages. Depending on the
> > microarchitecture, I'd hope the overhead would be comparable
> > to that of ARM64_SW_TTBR0_PAN.
>
> This still doesn't solve the copy_{from,to}_user() case where both
> address spaces need to be available during copy. So you either pin the
> user pages in memory and access them via the kernel mapping or you
> temporarily map (kmap?) the destination/source kernel address. The
> overhead I'd expect to be significantly greater than ARM64_SW_TTBR0_PAN
> for the uaccess routines. For user entry/exit, your suggestion is
> probably comparable with SW PAN.

Good point, that is indeed a larger overhead. The simplest implementation
I had in mind would use the code from arch/arm/lib/copy_from_user.S and
flip ttbr0 between each ldm and stm (up to 32 bytes), but I have no idea
of the cost of storing to ttbr0, so this might be even more expensive. Do you
have an estimate of how long writing to TTBR0_64 takes on Cortex-A7
and A15, respectively?

Another way might be to use a use a temporary buffer that is already
mapped, and add a memcpy() through L1-cache to reduce the number
of ttbr0 changes. The buffer would probably have to be on the stack,
which limits the size, but for large copies get_user_pages()+memcpy()
may end up being faster anyway.

      Arnd
Catalin Marinas March 11, 2020, 5:26 p.m. UTC | #47
On Wed, Mar 11, 2020 at 05:59:53PM +0100, Arnd Bergmann wrote:
> On Wed, Mar 11, 2020 at 3:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > > - Flip TTBR0 on kernel entry/exit, and again during user access.
> > >
> > > This is probably more work to implement than your idea, but
> > > I would hope this has a lower overhead on most microarchitectures
> > > as it doesn't require pinning the pages. Depending on the
> > > microarchitecture, I'd hope the overhead would be comparable
> > > to that of ARM64_SW_TTBR0_PAN.
> >
> > This still doesn't solve the copy_{from,to}_user() case where both
> > address spaces need to be available during copy. So you either pin the
> > user pages in memory and access them via the kernel mapping or you
> > temporarily map (kmap?) the destination/source kernel address. The
> > overhead I'd expect to be significantly greater than ARM64_SW_TTBR0_PAN
> > for the uaccess routines. For user entry/exit, your suggestion is
> > probably comparable with SW PAN.
> 
> Good point, that is indeed a larger overhead. The simplest implementation
> I had in mind would use the code from arch/arm/lib/copy_from_user.S and
> flip ttbr0 between each ldm and stm (up to 32 bytes), but I have no idea
> of the cost of storing to ttbr0, so this might be even more expensive. Do you
> have an estimate of how long writing to TTBR0_64 takes on Cortex-A7
> and A15, respectively?

I don't have numbers but it's usually not cheap since you need an ISB to
synchronise the context after TTBR0 update (basically flushing the
pipeline).

> Another way might be to use a use a temporary buffer that is already
> mapped, and add a memcpy() through L1-cache to reduce the number
> of ttbr0 changes. The buffer would probably have to be on the stack,
> which limits the size, but for large copies get_user_pages()+memcpy()
> may end up being faster anyway.

IIRC, the x86 attempt from Ingo some years ago was using
get_user_pages() for uaccess. Depending on the size of the buffer, this
may be faster than copying twice.
Arnd Bergmann March 11, 2020, 10:21 p.m. UTC | #48
On Wed, Mar 11, 2020 at 6:26 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Mar 11, 2020 at 05:59:53PM +0100, Arnd Bergmann wrote:
> > On Wed, Mar 11, 2020 at 3:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > Do you have an estimate of how long writing to TTBR0_64 takes on Cortex-A7
> > and A15, respectively?
>
> I don't have numbers but it's usually not cheap since you need an ISB to
> synchronise the context after TTBR0 update (basically flushing the
> pipeline).

Ok.

> > Another way might be to use a use a temporary buffer that is already
> > mapped, and add a memcpy() through L1-cache to reduce the number
> > of ttbr0 changes. The buffer would probably have to be on the stack,
> > which limits the size, but for large copies get_user_pages()+memcpy()
> > may end up being faster anyway.
>
> IIRC, the x86 attempt from Ingo some years ago was using
> get_user_pages() for uaccess. Depending on the size of the buffer, this
> may be faster than copying twice.

I guess the tradeoffs for that were rather different, as x86 back
then had no ASIDs, so changing the page tables required a full
TLB flush.

         Arnd
Johannes Weiner May 12, 2020, 9:29 p.m. UTC | #49
On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote:
> The VFS inode shrinker is currently allowed to reclaim inodes with
> populated page cache. As a result it can drop gigabytes of hot and
> active page cache on the floor without consulting the VM (recorded as
> "inodesteal" events in /proc/vmstat).

I'm sending a rebased version of this patch.

We've been running with this change in the Facebook fleet since
February with no ill side effects observed.

However, I just spent several hours chasing a mysterious reclaim
problem that turned out to be this bug again on an unpatched system.

In the scenario I was debugging, the problem wasn't that we were
losing cache, but that we were losing the non-resident information for
previously evicted cache.

I understood the file set enough to know it was thrashing like crazy,
but it didn't register as refaults to the kernel. Without detecting
the refaults, reclaim wouldn't start swapping to relieve the
struggling cache (plenty of cold anon memory around). It also meant
the IO delays of those refaults didn't contribute to memory pressure
in psi, which made userspace blind to the situation as well.

The first aspect means we can get stuck in pathological thrashing, the
second means userspace OOM detection breaks and we can leave servers
(or Android devices, for that matter) hopelessly livelocked.

New patch attached below. I hope we can get this fixed in 5.8, it's
really quite a big hole in our cache management strategy.

---
From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 11 Feb 2020 12:55:07 -0500
Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU

The VFS inode shrinker is currently allowed to reclaim cold inodes
with populated page cache. This behavior goes back to CONFIG_HIGHMEM
setups, which required the ability to drop page cache in large highem
zones to free up struct inodes in comparatively tiny lowmem zones.

However, it has significant side effects that are hard to justify on
systems without highmem:

- It can drop gigabytes of hot and active page cache on the floor
without consulting the VM (recorded as "inodesteal" events in
/proc/vmstat). Such an "aging inversion" between unreferenced inodes
holding hot cache easily happens in practice: for example, a git tree
whose objects are accessed frequently but no open file descriptors are
maintained throughout.

- It can als prematurely drop non-resident info stored inside the
inodes, which can let massive cache thrashing go unnoticed. This in
turn means we're making the wrong decisions in page reclaim and can
get stuck in pathological thrashing. The thrashing also doesn't show
up as memory pressure in psi, causing failure in the userspace OOM
detection chain, which can leave a system (or Android device e.g.)
stranded in a livelock.

History

As this keeps causing problems for people, there have been several
attempts to address this.

One recent attempt was to make the inode shrinker simply skip over
inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim
inodes with many attached pages").

However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
don't reclaim inodes with many attached pages"") because it caused
severe reclaim performance problems: Inodes that sit on the shrinker
LRU are attracting reclaim pressure away from the page cache and
toward the VFS. If we then permanently exempt sizable portions of this
pool from actually getting reclaimed when looked at, this pressure
accumulates as deferred shrinker work (a mechanism for *temporarily*
unreclaimable objects) until it causes mayhem in the VFS cache pools.

In the bug quoted in 69056ee6a8a3 in particular, the excessive
pressure drove the XFS shrinker into dirty objects, where it caused
synchronous, IO-bound stalls, even as there was plenty of clean page
cache that should have been reclaimed instead.

Another variant of this problem was recently observed, where the
kernel violates cgroups' memory.low protection settings and reclaims
page cache way beyond the configured thresholds. It was followed by a
proposal of a modified form of the reverted commit above, that
implements memory.low-sensitive shrinker skipping over populated
inodes on the LRU [1]. However, this proposal continues to run the
risk of attracting disproportionate reclaim pressure to a pool of
still-used inodes, while not addressing the more generic reclaim
inversion problem outside of a very specific cgroup application.

[1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/

Solution

This patch fixes the aging inversion described above on
!CONFIG_HIGHMEM systems, without reintroducing the problems associated
with excessive shrinker LRU rotations, by keeping populated inodes off
the shrinker LRUs entirely.

Currently, inodes are kept off the shrinker LRU as long as they have
an elevated i_count, indicating an active user. Unfortunately, the
page cache cannot simply hold an i_count reference, because unlink()
*should* result in the inode being dropped and its cache invalidated.

Instead, this patch makes iput_final() consult the state of the page
cache and punt the LRU linking to the VM if the inode is still
populated; the VM in turn checks the inode state when it depopulates
the page cache, and adds the inode to the LRU if necessary.

This is not unlike what we do for dirty inodes, which are moved off
the LRU permanently until writeback completion puts them back on (iff
still unused). We can reuse the same code -- inode_add_lru() - here.

This is also not unlike page reclaim, where the lower VM layer has to
negotiate state with the higher VFS layer. Follow existing precedence
and handle the inversion as much as possible on the VM side:

- introduce an I_PAGES flag that the VM maintains under the i_lock, so
  that any inode code holding that lock can check the page cache state
  without having to lock and inspect the struct address_space

- introduce inode_pages_set() and inode_pages_clear() to maintain the
  inode LRU state from the VM side, then update all cache mutators to
  use them when populating the first cache entry or clearing the last

With this, the concept of "inodesteal" - where the inode shrinker
drops page cache - is relegated to CONFIG_HIGHMEM systems only. The VM
is in charge of the cache, the shrinker in charge of struct inode.

Footnotes

- For debuggability, add vmstat counters that track the number of
  times a new cache entry pulls a previously unused inode off the LRU
  (pginoderescue), as well as how many times existing cache deferred
  an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters
  for backwards compatibility, but they'll just show 0 now.

- Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
  cache. Not doing so has always been a bit strange, but since most
  people drop cache and metadata cache together, the inode shrinker
  would have taken care of them before - no more, so do it VM-side.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c                |   2 +-
 fs/dax.c                      |  14 +++++
 fs/drop_caches.c              |   2 +-
 fs/inode.c                    | 112 +++++++++++++++++++++++++++++++---
 fs/internal.h                 |   2 +-
 include/linux/fs.h            |  17 ++++++
 include/linux/pagemap.h       |   2 +-
 include/linux/vm_event_item.h |   3 +-
 mm/filemap.c                  |  37 ++++++++---
 mm/huge_memory.c              |   3 +-
 mm/truncate.c                 |  34 ++++++++---
 mm/vmscan.c                   |   6 +-
 mm/vmstat.c                   |   4 +-
 mm/workingset.c               |   4 ++
 14 files changed, 208 insertions(+), 34 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4c26dcd9dba3..e0f73647c848 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..89d1245a96ce 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas,
 {
 	unsigned long index = xas->xa_index;
 	bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
+	int populated;
 	void *entry;
 
 retry:
+	populated = 0;
 	xas_lock_irq(xas);
 	entry = get_unlocked_entry(xas, order);
 
@@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas,
 		xas_store(xas, NULL);	/* undo the PMD join */
 		dax_wake_entry(xas, entry, true);
 		mapping->nrexceptional--;
+		if (mapping_empty(mapping))
+			populated = -1;
 		entry = NULL;
 		xas_set(xas, index);
 	}
@@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas,
 		dax_lock_entry(xas, entry);
 		if (xas_error(xas))
 			goto out_unlock;
+		if (mapping_empty(mapping))
+			populated++;
 		mapping->nrexceptional++;
 	}
 
 out_unlock:
 	xas_unlock_irq(xas);
+	if (populated == -1)
+		inode_pages_clear(mapping->inode);
+	else if (populated == 1)
+		inode_pages_set(mapping->inode);
 	if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
 		goto retry;
 	if (xas->xa_node == XA_ERROR(-ENOMEM))
@@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
+	bool empty = false;
 	int ret = 0;
 	void *entry;
 
@@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 	dax_disassociate_entry(entry, mapping, trunc);
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+	empty = mapping_empty(mapping);
 	ret = 1;
 out:
 	put_unlocked_entry(&xas, entry);
 	xas_unlock_irq(&xas);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	return ret;
 }
 
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index f00fcc4a4f72..20e845ff43f2 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		 * we need to reschedule to avoid softlockups.
 		 */
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (inode->i_mapping->nrpages == 0 && !need_resched())) {
+		    (mapping_empty(inode->i_mapping) && !need_resched())) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
diff --git a/fs/inode.c b/fs/inode.c
index 9fcec07a9d7c..9da373244db7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -431,26 +431,101 @@ static void inode_lru_list_add(struct inode *inode)
 		inode->i_state |= I_REFERENCED;
 }
 
+static void inode_lru_list_del(struct inode *inode)
+{
+	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
+		this_cpu_dec(nr_unused);
+}
+
 /*
  * Add inode to LRU if needed (inode is unused and clean).
  *
  * Needs inode->i_lock held.
  */
-void inode_add_lru(struct inode *inode)
+bool inode_add_lru(struct inode *inode)
 {
-	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
-				I_FREEING | I_WILL_FREE)) &&
-	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
-		inode_lru_list_add(inode);
+	if (inode->i_state &
+	    (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES))
+		return false;
+	if (atomic_read(&inode->i_count))
+		return false;
+	if (!(inode->i_sb->s_flags & SB_ACTIVE))
+		return false;
+	inode_lru_list_add(inode);
+	return true;
 }
 
+/*
+ * Usually, inodes become reclaimable when they are no longer
+ * referenced and their page cache has been reclaimed. The following
+ * API allows the VM to communicate cache population state to the VFS.
+ *
+ * However, on CONFIG_HIGHMEM we can't wait for the page cache to go
+ * away: cache pages allocated in a large highmem zone could pin
+ * struct inode memory allocated in relatively small lowmem zones. So
+ * when CONFIG_HIGHMEM is enabled, we tie cache to the inode lifetime.
+ */
 
-static void inode_lru_list_del(struct inode *inode)
+#ifndef CONFIG_HIGHMEM
+/**
+ * inode_pages_set - mark the inode as holding page cache
+ * @inode: the inode whose first cache page was just added
+ *
+ * Tell the VFS that this inode has populated page cache and must not
+ * be reclaimed by the inode shrinker.
+ *
+ * The caller must hold the page lock of the just-added page: by
+ * pinning the page, the page cache cannot become depopulated, and we
+ * can safely set I_PAGES without a race check under the i_pages lock.
+ *
+ * This function acquires the i_lock.
+ */
+void inode_pages_set(struct inode *inode)
 {
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & I_PAGES)) {
+		inode->i_state |= I_PAGES;
+		if (!list_empty(&inode->i_lru)) {
+			count_vm_event(PGINODERESCUE);
+			inode_lru_list_del(inode);
+		}
+	}
+	spin_unlock(&inode->i_lock);
+}
 
-	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
-		this_cpu_dec(nr_unused);
+/**
+ * inode_pages_clear - mark the inode as not holding page cache
+ * @inode: the inode whose last cache page was just removed
+ *
+ * Tell the VFS that the inode no longer holds page cache and that its
+ * lifetime is to be handed over to the inode shrinker LRU.
+ *
+ * This function acquires the i_lock and the i_pages lock.
+ */
+void inode_pages_clear(struct inode *inode)
+{
+	struct address_space *mapping = &inode->i_data;
+	bool add_to_lru = false;
+	unsigned long flags;
+
+	spin_lock(&inode->i_lock);
+
+	xa_lock_irqsave(&mapping->i_pages, flags);
+	if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) {
+		inode->i_state &= ~I_PAGES;
+		add_to_lru = true;
+	}
+	xa_unlock_irqrestore(&mapping->i_pages, flags);
+
+	if (add_to_lru) {
+		WARN_ON_ONCE(!list_empty(&inode->i_lru));
+		if (inode_add_lru(inode))
+			__count_vm_event(PGINODEDELAYED);
+	}
+
+	spin_unlock(&inode->i_lock);
 }
+#endif /* !CONFIG_HIGHMEM */
 
 /**
  * inode_sb_list_add - add inode to the superblock list of inodes
@@ -743,6 +818,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
+	WARN_ON_ONCE(inode->i_state & I_PAGES);
+
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
 	 * through the LRU as we canot reclaim them now.
@@ -762,7 +839,18 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 		return LRU_ROTATE;
 	}
 
-	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+	/*
+	 * Usually, populated inodes shouldn't be on the shrinker LRU,
+	 * but they can be briefly visible when a new page is added to
+	 * an inode that was already linked but inode_pages_set()
+	 * hasn't run yet to move them off.
+	 *
+	 * The other exception is on HIGHMEM systems: highmem cache
+	 * can pin lowmem struct inodes, and we might be in dire
+	 * straits in the lower zones. Purge cache to free the inode.
+	 */
+	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
+#ifdef CONFIG_HIGHMEM
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(lru_lock);
@@ -779,6 +867,12 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 		iput(inode);
 		spin_lock(lru_lock);
 		return LRU_RETRY;
+#else
+		list_lru_isolate(lru, &inode->i_lru);
+		spin_unlock(&inode->i_lock);
+		this_cpu_dec(nr_unused);
+		return LRU_REMOVED;
+#endif
 	}
 
 	WARN_ON(inode->i_state & I_NEW);
diff --git a/fs/internal.h b/fs/internal.h
index aa5d45524e87..2ca77c0afb0b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -137,7 +137,7 @@ extern int vfs_open(const struct path *, struct file *);
  * inode.c
  */
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
-extern void inode_add_lru(struct inode *inode);
+extern bool inode_add_lru(struct inode *inode);
 extern int dentry_needs_remove_privs(struct dentry *dentry);
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e4a1f28435fe..c3137ee87cb2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -592,6 +592,11 @@ static inline void mapping_allow_writable(struct address_space *mapping)
 	atomic_inc(&mapping->i_mmap_writable);
 }
 
+static inline bool mapping_empty(struct address_space *mapping)
+{
+	return mapping->nrpages + mapping->nrexceptional == 0;
+}
+
 /*
  * Use sequence counter to get consistent i_size on 32-bit processors.
  */
@@ -2165,6 +2170,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  *
  * I_CREATING		New object's inode in the middle of setting up.
  *
+ * I_PAGES		Inode is holding page cache that needs to get reclaimed
+ *			first before the inode can go onto the shrinker LRU.
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -2187,6 +2195,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 #define I_WB_SWITCH		(1 << 13)
 #define I_OVL_INUSE		(1 << 14)
 #define I_CREATING		(1 << 15)
+#define I_PAGES			(1 << 16)
 
 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
@@ -3124,6 +3133,14 @@ static inline void remove_inode_hash(struct inode *inode)
 		__remove_inode_hash(inode);
 }
 
+#ifndef CONFIG_HIGHMEM
+extern void inode_pages_set(struct inode *inode);
+extern void inode_pages_clear(struct inode *inode);
+#else
+static inline void inode_pages_set(struct inode *inode) {}
+static inline void inode_pages_clear(struct inode *inode) {}
+#endif
+
 extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c6348c50136f..c3646b79489e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -613,7 +613,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page, void *shadow);
+extern bool __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ffef0f279747..de362c9cda29 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
-		PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
+		SLABS_SCANNED,
+		PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
diff --git a/mm/filemap.c b/mm/filemap.c
index 792e22e1e3c0..9cc423dcb16c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,8 +116,8 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
-static void page_cache_delete(struct address_space *mapping,
-				   struct page *page, void *shadow)
+static bool __must_check page_cache_delete(struct address_space *mapping,
+					   struct page *page, void *shadow)
 {
 	XA_STATE(xas, &mapping->i_pages, page->index);
 	unsigned int nr = 1;
@@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping,
 		smp_wmb();
 	}
 	mapping->nrpages -= nr;
+
+	return mapping_empty(mapping);
 }
 
 static void unaccount_page_cache_page(struct address_space *mapping,
@@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping,
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the i_pages lock.
+ *
+ * If this returns true, the caller must call inode_pages_clear()
+ * after dropping the i_pages lock.
  */
-void __delete_from_page_cache(struct page *page, void *shadow)
+bool __must_check __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 
 	unaccount_page_cache_page(mapping, page);
-	page_cache_delete(mapping, page, shadow);
+	return page_cache_delete(mapping, page, shadow);
 }
 
 static void page_cache_free_page(struct address_space *mapping,
@@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
+	bool empty;
 
 	BUG_ON(!PageLocked(page));
 	xa_lock_irqsave(&mapping->i_pages, flags);
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	page_cache_free_page(mapping, page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
@@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache);
  *
  * The function expects the i_pages lock to be held.
  */
-static void page_cache_delete_batch(struct address_space *mapping,
-			     struct pagevec *pvec)
+static bool __must_check page_cache_delete_batch(struct address_space *mapping,
+						 struct pagevec *pvec)
 {
 	XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
 	int total_pages = 0;
@@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping,
 		total_pages++;
 	}
 	mapping->nrpages -= total_pages;
+
+	return mapping_empty(mapping);
 }
 
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec)
 {
 	int i;
+	bool empty;
 	unsigned long flags;
 
 	if (!pagevec_count(pvec))
@@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping,
 
 		unaccount_page_cache_page(mapping, pvec->pages[i]);
 	}
-	page_cache_delete_batch(mapping, pvec);
+	empty = page_cache_delete_batch(mapping, pvec);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	for (i = 0; i < pagevec_count(pvec); i++)
 		page_cache_free_page(mapping, pvec->pages[i]);
 }
@@ -833,6 +848,7 @@ static int __add_to_page_cache_locked(struct page *page,
 {
 	XA_STATE(xas, &mapping->i_pages, offset);
 	int huge = PageHuge(page);
+	bool populated = false;
 	int error;
 	void *old;
 
@@ -859,6 +875,7 @@ static int __add_to_page_cache_locked(struct page *page,
 		if (xas_error(&xas))
 			goto unlock;
 
+		populated = mapping_empty(mapping);
 		if (xa_is_value(old)) {
 			mapping->nrexceptional--;
 			if (shadowp)
@@ -879,6 +896,10 @@ static int __add_to_page_cache_locked(struct page *page,
 	}
 
 	trace_mm_filemap_add_to_page_cache(page);
+
+	if (populated)
+		inode_pages_set(mapping->host);
+
 	return 0;
 error:
 	page->mapping = NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 23763452b52a..fbe9a7297759 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2406,7 +2406,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		/* Some pages can be beyond i_size: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
-			__delete_from_page_cache(head + i, NULL);
+			/* We know we're not removing the last page */
+			(void)__delete_from_page_cache(head + i, NULL);
 			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
 				shmem_uncharge(head->mapping->host, 1);
 			put_page(head + i);
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..8fb6c2f762bc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -31,24 +31,31 @@
  * itself locked.  These unlocked entries need verification under the tree
  * lock.
  */
-static inline void __clear_shadow_entry(struct address_space *mapping,
-				pgoff_t index, void *entry)
+static bool __must_check __clear_shadow_entry(struct address_space *mapping,
+					      pgoff_t index, void *entry)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 
 	xas_set_update(&xas, workingset_update_node);
 	if (xas_load(&xas) != entry)
-		return;
+		return 0;
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+
+	return mapping_empty(mapping);
 }
 
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
 			       void *entry)
 {
+	bool empty;
+
 	xa_lock_irq(&mapping->i_pages);
-	__clear_shadow_entry(mapping, index, entry);
+	empty = __clear_shadow_entry(mapping, index, entry);
 	xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
 }
 
 /*
@@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 				pgoff_t end)
 {
 	int i, j;
-	bool dax, lock;
+	bool dax, lock, empty = false;
 
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
@@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 			continue;
 		}
 
-		__clear_shadow_entry(mapping, index, page);
+		if (__clear_shadow_entry(mapping, index, page))
+			empty = true;
 	}
 
 	if (lock)
 		xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	pvec->nr = j;
 }
 
@@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	pgoff_t		index;
 	int		i;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	/* Offsets within partial pages */
@@ -636,6 +648,7 @@ static int
 invalidate_complete_page2(struct address_space *mapping, struct page *page)
 {
 	unsigned long flags;
+	bool empty;
 
 	if (page->mapping != mapping)
 		return 0;
@@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
 
@@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int ret2 = 0;
 	int did_range_unmap = 0;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	pagevec_init(&pvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc555903a332..2ae2df605006 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
+		int empty;
 
 		freepage = mapping->a_ops->freepage;
 		/*
@@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		if (reclaimed && page_is_file_lru(page) &&
 		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(page, target_memcg);
-		__delete_from_page_cache(page, shadow);
+		empty = __delete_from_page_cache(page, shadow);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+		if (empty)
+			inode_pages_clear(mapping->host);
+
 		if (freepage != NULL)
 			freepage(page);
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 35219271796f..0fdbec92265a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1205,9 +1205,11 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_NUMA
 	"zone_reclaim_failed",
 #endif
-	"pginodesteal",
 	"slabs_scanned",
+	"pginodesteal",
 	"kswapd_inodesteal",
+	"pginoderescue",
+	"pginodedelayed",
 	"kswapd_low_wmark_hit_quickly",
 	"kswapd_high_wmark_hit_quickly",
 	"pageoutrun",
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b76ced..7ce9c74ebfdb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	struct xa_node *node = container_of(item, struct xa_node, private_list);
 	XA_STATE(xas, node->array, 0);
 	struct address_space *mapping;
+	bool empty = false;
 	int ret;
 
 	/*
@@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	if (WARN_ON_ONCE(node->count != node->nr_values))
 		goto out_invalid;
 	mapping->nrexceptional -= node->nr_values;
+	empty = mapping_empty(mapping);
 	xas.xa_node = xa_parent_locked(&mapping->i_pages, node);
 	xas.xa_offset = node->offset;
 	xas.xa_shift = node->shift + XA_CHUNK_SHIFT;
@@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 
 out_invalid:
 	xa_unlock_irq(&mapping->i_pages);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	ret = LRU_REMOVED_RETRY;
 out:
 	cond_resched();
Yafang Shao May 13, 2020, 1:32 a.m. UTC | #50
On Wed, May 13, 2020 at 5:29 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote:
> > The VFS inode shrinker is currently allowed to reclaim inodes with
> > populated page cache. As a result it can drop gigabytes of hot and
> > active page cache on the floor without consulting the VM (recorded as
> > "inodesteal" events in /proc/vmstat).
>
> I'm sending a rebased version of this patch.
>
> We've been running with this change in the Facebook fleet since
> February with no ill side effects observed.
>
> However, I just spent several hours chasing a mysterious reclaim
> problem that turned out to be this bug again on an unpatched system.
>
> In the scenario I was debugging, the problem wasn't that we were
> losing cache, but that we were losing the non-resident information for
> previously evicted cache.
>
> I understood the file set enough to know it was thrashing like crazy,
> but it didn't register as refaults to the kernel. Without detecting
> the refaults, reclaim wouldn't start swapping to relieve the
> struggling cache (plenty of cold anon memory around). It also meant
> the IO delays of those refaults didn't contribute to memory pressure
> in psi, which made userspace blind to the situation as well.
>
> The first aspect means we can get stuck in pathological thrashing, the
> second means userspace OOM detection breaks and we can leave servers
> (or Android devices, for that matter) hopelessly livelocked.
>
> New patch attached below. I hope we can get this fixed in 5.8, it's
> really quite a big hole in our cache management strategy.
>
> ---
> From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Tue, 11 Feb 2020 12:55:07 -0500
> Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
>
> The VFS inode shrinker is currently allowed to reclaim cold inodes
> with populated page cache. This behavior goes back to CONFIG_HIGHMEM
> setups, which required the ability to drop page cache in large highem
> zones to free up struct inodes in comparatively tiny lowmem zones.
>
> However, it has significant side effects that are hard to justify on
> systems without highmem:
>
> - It can drop gigabytes of hot and active page cache on the floor
> without consulting the VM (recorded as "inodesteal" events in
> /proc/vmstat). Such an "aging inversion" between unreferenced inodes
> holding hot cache easily happens in practice: for example, a git tree
> whose objects are accessed frequently but no open file descriptors are
> maintained throughout.
>

Hi Johannes,

I think it is reasonable to keep inodes with _active_ page cache off
the inode shrinker LRU, but I'm not sure whether it is proper to keep
the inodes with _only_ inactive page cache off the inode list lru
neither. Per my understanding, if the inode has only inactive page
cache, then invalidate all these inactive page cache could save the
reclaimer's time, IOW, it may improve the performance in this case.

> - It can als prematurely drop non-resident info stored inside the
> inodes, which can let massive cache thrashing go unnoticed. This in
> turn means we're making the wrong decisions in page reclaim and can
> get stuck in pathological thrashing. The thrashing also doesn't show
> up as memory pressure in psi, causing failure in the userspace OOM
> detection chain, which can leave a system (or Android device e.g.)
> stranded in a livelock.
>
> History
>
> As this keeps causing problems for people, there have been several
> attempts to address this.
>
> One recent attempt was to make the inode shrinker simply skip over
> inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim
> inodes with many attached pages").
>
> However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm:
> don't reclaim inodes with many attached pages"") because it caused
> severe reclaim performance problems: Inodes that sit on the shrinker
> LRU are attracting reclaim pressure away from the page cache and
> toward the VFS. If we then permanently exempt sizable portions of this
> pool from actually getting reclaimed when looked at, this pressure
> accumulates as deferred shrinker work (a mechanism for *temporarily*
> unreclaimable objects) until it causes mayhem in the VFS cache pools.
>
> In the bug quoted in 69056ee6a8a3 in particular, the excessive
> pressure drove the XFS shrinker into dirty objects, where it caused
> synchronous, IO-bound stalls, even as there was plenty of clean page
> cache that should have been reclaimed instead.
>
> Another variant of this problem was recently observed, where the
> kernel violates cgroups' memory.low protection settings and reclaims
> page cache way beyond the configured thresholds. It was followed by a
> proposal of a modified form of the reverted commit above, that
> implements memory.low-sensitive shrinker skipping over populated
> inodes on the LRU [1]. However, this proposal continues to run the
> risk of attracting disproportionate reclaim pressure to a pool of
> still-used inodes, while not addressing the more generic reclaim
> inversion problem outside of a very specific cgroup application.
>
> [1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/
>
> Solution
>
> This patch fixes the aging inversion described above on
> !CONFIG_HIGHMEM systems, without reintroducing the problems associated
> with excessive shrinker LRU rotations, by keeping populated inodes off
> the shrinker LRUs entirely.
>
> Currently, inodes are kept off the shrinker LRU as long as they have
> an elevated i_count, indicating an active user. Unfortunately, the
> page cache cannot simply hold an i_count reference, because unlink()
> *should* result in the inode being dropped and its cache invalidated.
>
> Instead, this patch makes iput_final() consult the state of the page
> cache and punt the LRU linking to the VM if the inode is still
> populated; the VM in turn checks the inode state when it depopulates
> the page cache, and adds the inode to the LRU if necessary.
>
> This is not unlike what we do for dirty inodes, which are moved off
> the LRU permanently until writeback completion puts them back on (iff
> still unused). We can reuse the same code -- inode_add_lru() - here.
>
> This is also not unlike page reclaim, where the lower VM layer has to
> negotiate state with the higher VFS layer. Follow existing precedence
> and handle the inversion as much as possible on the VM side:
>
> - introduce an I_PAGES flag that the VM maintains under the i_lock, so
>   that any inode code holding that lock can check the page cache state
>   without having to lock and inspect the struct address_space
>
> - introduce inode_pages_set() and inode_pages_clear() to maintain the
>   inode LRU state from the VM side, then update all cache mutators to
>   use them when populating the first cache entry or clearing the last
>
> With this, the concept of "inodesteal" - where the inode shrinker
> drops page cache - is relegated to CONFIG_HIGHMEM systems only. The VM
> is in charge of the cache, the shrinker in charge of struct inode.
>
> Footnotes
>
> - For debuggability, add vmstat counters that track the number of
>   times a new cache entry pulls a previously unused inode off the LRU
>   (pginoderescue), as well as how many times existing cache deferred
>   an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters
>   for backwards compatibility, but they'll just show 0 now.
>
> - Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
>   cache. Not doing so has always been a bit strange, but since most
>   people drop cache and metadata cache together, the inode shrinker
>   would have taken care of them before - no more, so do it VM-side.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/block_dev.c                |   2 +-
>  fs/dax.c                      |  14 +++++
>  fs/drop_caches.c              |   2 +-
>  fs/inode.c                    | 112 +++++++++++++++++++++++++++++++---
>  fs/internal.h                 |   2 +-
>  include/linux/fs.h            |  17 ++++++
>  include/linux/pagemap.h       |   2 +-
>  include/linux/vm_event_item.h |   3 +-
>  mm/filemap.c                  |  37 ++++++++---
>  mm/huge_memory.c              |   3 +-
>  mm/truncate.c                 |  34 ++++++++---
>  mm/vmscan.c                   |   6 +-
>  mm/vmstat.c                   |   4 +-
>  mm/workingset.c               |   4 ++
>  14 files changed, 208 insertions(+), 34 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4c26dcd9dba3..e0f73647c848 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev)
>  {
>         struct address_space *mapping = bdev->bd_inode->i_mapping;
>
> -       if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +       if (mapping_empty(mapping))
>                 return;
>
>         invalidate_bh_lrus();
> diff --git a/fs/dax.c b/fs/dax.c
> index 11b16729b86f..89d1245a96ce 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas,
>  {
>         unsigned long index = xas->xa_index;
>         bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
> +       int populated;
>         void *entry;
>
>  retry:
> +       populated = 0;
>         xas_lock_irq(xas);
>         entry = get_unlocked_entry(xas, order);
>
> @@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas,
>                 xas_store(xas, NULL);   /* undo the PMD join */
>                 dax_wake_entry(xas, entry, true);
>                 mapping->nrexceptional--;
> +               if (mapping_empty(mapping))
> +                       populated = -1;
>                 entry = NULL;
>                 xas_set(xas, index);
>         }
> @@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas,
>                 dax_lock_entry(xas, entry);
>                 if (xas_error(xas))
>                         goto out_unlock;
> +               if (mapping_empty(mapping))
> +                       populated++;
>                 mapping->nrexceptional++;
>         }
>
>  out_unlock:
>         xas_unlock_irq(xas);
> +       if (populated == -1)
> +               inode_pages_clear(mapping->inode);
> +       else if (populated == 1)
> +               inode_pages_set(mapping->inode);
>         if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
>                 goto retry;
>         if (xas->xa_node == XA_ERROR(-ENOMEM))
> @@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping,
>                                           pgoff_t index, bool trunc)
>  {
>         XA_STATE(xas, &mapping->i_pages, index);
> +       bool empty = false;
>         int ret = 0;
>         void *entry;
>
> @@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping,
>         dax_disassociate_entry(entry, mapping, trunc);
>         xas_store(&xas, NULL);
>         mapping->nrexceptional--;
> +       empty = mapping_empty(mapping);
>         ret = 1;
>  out:
>         put_unlocked_entry(&xas, entry);
>         xas_unlock_irq(&xas);
> +       if (empty)
> +               inode_pages_clear(mapping->host);
>         return ret;
>  }
>
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index f00fcc4a4f72..20e845ff43f2 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
>                  * we need to reschedule to avoid softlockups.
>                  */
>                 if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -                   (inode->i_mapping->nrpages == 0 && !need_resched())) {
> +                   (mapping_empty(inode->i_mapping) && !need_resched())) {
>                         spin_unlock(&inode->i_lock);
>                         continue;
>                 }
> diff --git a/fs/inode.c b/fs/inode.c
> index 9fcec07a9d7c..9da373244db7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -431,26 +431,101 @@ static void inode_lru_list_add(struct inode *inode)
>                 inode->i_state |= I_REFERENCED;
>  }
>
> +static void inode_lru_list_del(struct inode *inode)
> +{
> +       if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> +               this_cpu_dec(nr_unused);
> +}
> +
>  /*
>   * Add inode to LRU if needed (inode is unused and clean).
>   *
>   * Needs inode->i_lock held.
>   */
> -void inode_add_lru(struct inode *inode)
> +bool inode_add_lru(struct inode *inode)
>  {
> -       if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
> -                               I_FREEING | I_WILL_FREE)) &&
> -           !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
> -               inode_lru_list_add(inode);
> +       if (inode->i_state &
> +           (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES))
> +               return false;
> +       if (atomic_read(&inode->i_count))
> +               return false;
> +       if (!(inode->i_sb->s_flags & SB_ACTIVE))
> +               return false;
> +       inode_lru_list_add(inode);
> +       return true;
>  }
>
> +/*
> + * Usually, inodes become reclaimable when they are no longer
> + * referenced and their page cache has been reclaimed. The following
> + * API allows the VM to communicate cache population state to the VFS.
> + *
> + * However, on CONFIG_HIGHMEM we can't wait for the page cache to go
> + * away: cache pages allocated in a large highmem zone could pin
> + * struct inode memory allocated in relatively small lowmem zones. So
> + * when CONFIG_HIGHMEM is enabled, we tie cache to the inode lifetime.
> + */
>
> -static void inode_lru_list_del(struct inode *inode)
> +#ifndef CONFIG_HIGHMEM
> +/**
> + * inode_pages_set - mark the inode as holding page cache
> + * @inode: the inode whose first cache page was just added
> + *
> + * Tell the VFS that this inode has populated page cache and must not
> + * be reclaimed by the inode shrinker.
> + *
> + * The caller must hold the page lock of the just-added page: by
> + * pinning the page, the page cache cannot become depopulated, and we
> + * can safely set I_PAGES without a race check under the i_pages lock.
> + *
> + * This function acquires the i_lock.
> + */
> +void inode_pages_set(struct inode *inode)
>  {
> +       spin_lock(&inode->i_lock);
> +       if (!(inode->i_state & I_PAGES)) {
> +               inode->i_state |= I_PAGES;
> +               if (!list_empty(&inode->i_lru)) {
> +                       count_vm_event(PGINODERESCUE);
> +                       inode_lru_list_del(inode);
> +               }
> +       }
> +       spin_unlock(&inode->i_lock);
> +}
>
> -       if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
> -               this_cpu_dec(nr_unused);
> +/**
> + * inode_pages_clear - mark the inode as not holding page cache
> + * @inode: the inode whose last cache page was just removed
> + *
> + * Tell the VFS that the inode no longer holds page cache and that its
> + * lifetime is to be handed over to the inode shrinker LRU.
> + *
> + * This function acquires the i_lock and the i_pages lock.
> + */
> +void inode_pages_clear(struct inode *inode)
> +{
> +       struct address_space *mapping = &inode->i_data;
> +       bool add_to_lru = false;
> +       unsigned long flags;
> +
> +       spin_lock(&inode->i_lock);
> +
> +       xa_lock_irqsave(&mapping->i_pages, flags);
> +       if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) {
> +               inode->i_state &= ~I_PAGES;
> +               add_to_lru = true;
> +       }
> +       xa_unlock_irqrestore(&mapping->i_pages, flags);
> +
> +       if (add_to_lru) {
> +               WARN_ON_ONCE(!list_empty(&inode->i_lru));
> +               if (inode_add_lru(inode))
> +                       __count_vm_event(PGINODEDELAYED);
> +       }
> +
> +       spin_unlock(&inode->i_lock);
>  }
> +#endif /* !CONFIG_HIGHMEM */
>
>  /**
>   * inode_sb_list_add - add inode to the superblock list of inodes
> @@ -743,6 +818,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>         if (!spin_trylock(&inode->i_lock))
>                 return LRU_SKIP;
>
> +       WARN_ON_ONCE(inode->i_state & I_PAGES);
> +
>         /*
>          * Referenced or dirty inodes are still in use. Give them another pass
>          * through the LRU as we canot reclaim them now.
> @@ -762,7 +839,18 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>                 return LRU_ROTATE;
>         }
>
> -       if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> +       /*
> +        * Usually, populated inodes shouldn't be on the shrinker LRU,
> +        * but they can be briefly visible when a new page is added to
> +        * an inode that was already linked but inode_pages_set()
> +        * hasn't run yet to move them off.
> +        *
> +        * The other exception is on HIGHMEM systems: highmem cache
> +        * can pin lowmem struct inodes, and we might be in dire
> +        * straits in the lower zones. Purge cache to free the inode.
> +        */
> +       if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
> +#ifdef CONFIG_HIGHMEM
>                 __iget(inode);
>                 spin_unlock(&inode->i_lock);
>                 spin_unlock(lru_lock);
> @@ -779,6 +867,12 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>                 iput(inode);
>                 spin_lock(lru_lock);
>                 return LRU_RETRY;
> +#else
> +               list_lru_isolate(lru, &inode->i_lru);
> +               spin_unlock(&inode->i_lock);
> +               this_cpu_dec(nr_unused);
> +               return LRU_REMOVED;
> +#endif
>         }
>
>         WARN_ON(inode->i_state & I_NEW);
> diff --git a/fs/internal.h b/fs/internal.h
> index aa5d45524e87..2ca77c0afb0b 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -137,7 +137,7 @@ extern int vfs_open(const struct path *, struct file *);
>   * inode.c
>   */
>  extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
> -extern void inode_add_lru(struct inode *inode);
> +extern bool inode_add_lru(struct inode *inode);
>  extern int dentry_needs_remove_privs(struct dentry *dentry);
>
>  /*
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e4a1f28435fe..c3137ee87cb2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -592,6 +592,11 @@ static inline void mapping_allow_writable(struct address_space *mapping)
>         atomic_inc(&mapping->i_mmap_writable);
>  }
>
> +static inline bool mapping_empty(struct address_space *mapping)
> +{
> +       return mapping->nrpages + mapping->nrexceptional == 0;
> +}
> +
>  /*
>   * Use sequence counter to get consistent i_size on 32-bit processors.
>   */
> @@ -2165,6 +2170,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
>   *
>   * I_CREATING          New object's inode in the middle of setting up.
>   *
> + * I_PAGES             Inode is holding page cache that needs to get reclaimed
> + *                     first before the inode can go onto the shrinker LRU.
> + *
>   * Q: What is the difference between I_WILL_FREE and I_FREEING?
>   */
>  #define I_DIRTY_SYNC           (1 << 0)
> @@ -2187,6 +2195,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
>  #define I_WB_SWITCH            (1 << 13)
>  #define I_OVL_INUSE            (1 << 14)
>  #define I_CREATING             (1 << 15)
> +#define I_PAGES                        (1 << 16)
>
>  #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
>  #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
> @@ -3124,6 +3133,14 @@ static inline void remove_inode_hash(struct inode *inode)
>                 __remove_inode_hash(inode);
>  }
>
> +#ifndef CONFIG_HIGHMEM
> +extern void inode_pages_set(struct inode *inode);
> +extern void inode_pages_clear(struct inode *inode);
> +#else
> +static inline void inode_pages_set(struct inode *inode) {}
> +static inline void inode_pages_clear(struct inode *inode) {}
> +#endif
> +
>  extern void inode_sb_list_add(struct inode *inode);
>
>  #ifdef CONFIG_BLOCK
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c6348c50136f..c3646b79489e 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -613,7 +613,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>                                 pgoff_t index, gfp_t gfp_mask);
>  extern void delete_from_page_cache(struct page *page);
> -extern void __delete_from_page_cache(struct page *page, void *shadow);
> +extern bool __delete_from_page_cache(struct page *page, void *shadow);
>  int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
>  void delete_from_page_cache_batch(struct address_space *mapping,
>                                   struct pagevec *pvec);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index ffef0f279747..de362c9cda29 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_NUMA
>                 PGSCAN_ZONE_RECLAIM_FAILED,
>  #endif
> -               PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
> +               SLABS_SCANNED,
> +               PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED,
>                 KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>                 PAGEOUTRUN, PGROTATED,
>                 DROP_PAGECACHE, DROP_SLAB,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 792e22e1e3c0..9cc423dcb16c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -116,8 +116,8 @@
>   *   ->tasklist_lock            (memory_failure, collect_procs_ao)
>   */
>
> -static void page_cache_delete(struct address_space *mapping,
> -                                  struct page *page, void *shadow)
> +static bool __must_check page_cache_delete(struct address_space *mapping,
> +                                          struct page *page, void *shadow)
>  {
>         XA_STATE(xas, &mapping->i_pages, page->index);
>         unsigned int nr = 1;
> @@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping,
>                 smp_wmb();
>         }
>         mapping->nrpages -= nr;
> +
> +       return mapping_empty(mapping);
>  }
>
>  static void unaccount_page_cache_page(struct address_space *mapping,
> @@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping,
>   * Delete a page from the page cache and free it. Caller has to make
>   * sure the page is locked and that nobody else uses it - or that usage
>   * is safe.  The caller must hold the i_pages lock.
> + *
> + * If this returns true, the caller must call inode_pages_clear()
> + * after dropping the i_pages lock.
>   */
> -void __delete_from_page_cache(struct page *page, void *shadow)
> +bool __must_check __delete_from_page_cache(struct page *page, void *shadow)
>  {
>         struct address_space *mapping = page->mapping;
>
>         trace_mm_filemap_delete_from_page_cache(page);
>
>         unaccount_page_cache_page(mapping, page);
> -       page_cache_delete(mapping, page, shadow);
> +       return page_cache_delete(mapping, page, shadow);
>  }
>
>  static void page_cache_free_page(struct address_space *mapping,
> @@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page)
>  {
>         struct address_space *mapping = page_mapping(page);
>         unsigned long flags;
> +       bool empty;
>
>         BUG_ON(!PageLocked(page));
>         xa_lock_irqsave(&mapping->i_pages, flags);
> -       __delete_from_page_cache(page, NULL);
> +       empty = __delete_from_page_cache(page, NULL);
>         xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         page_cache_free_page(mapping, page);
>  }
>  EXPORT_SYMBOL(delete_from_page_cache);
> @@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache);
>   *
>   * The function expects the i_pages lock to be held.
>   */
> -static void page_cache_delete_batch(struct address_space *mapping,
> -                            struct pagevec *pvec)
> +static bool __must_check page_cache_delete_batch(struct address_space *mapping,
> +                                                struct pagevec *pvec)
>  {
>         XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
>         int total_pages = 0;
> @@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping,
>                 total_pages++;
>         }
>         mapping->nrpages -= total_pages;
> +
> +       return mapping_empty(mapping);
>  }
>
>  void delete_from_page_cache_batch(struct address_space *mapping,
>                                   struct pagevec *pvec)
>  {
>         int i;
> +       bool empty;
>         unsigned long flags;
>
>         if (!pagevec_count(pvec))
> @@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping,
>
>                 unaccount_page_cache_page(mapping, pvec->pages[i]);
>         }
> -       page_cache_delete_batch(mapping, pvec);
> +       empty = page_cache_delete_batch(mapping, pvec);
>         xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         for (i = 0; i < pagevec_count(pvec); i++)
>                 page_cache_free_page(mapping, pvec->pages[i]);
>  }
> @@ -833,6 +848,7 @@ static int __add_to_page_cache_locked(struct page *page,
>  {
>         XA_STATE(xas, &mapping->i_pages, offset);
>         int huge = PageHuge(page);
> +       bool populated = false;
>         int error;
>         void *old;
>
> @@ -859,6 +875,7 @@ static int __add_to_page_cache_locked(struct page *page,
>                 if (xas_error(&xas))
>                         goto unlock;
>
> +               populated = mapping_empty(mapping);
>                 if (xa_is_value(old)) {
>                         mapping->nrexceptional--;
>                         if (shadowp)
> @@ -879,6 +896,10 @@ static int __add_to_page_cache_locked(struct page *page,
>         }
>
>         trace_mm_filemap_add_to_page_cache(page);
> +
> +       if (populated)
> +               inode_pages_set(mapping->host);
> +
>         return 0;
>  error:
>         page->mapping = NULL;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 23763452b52a..fbe9a7297759 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2406,7 +2406,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                 /* Some pages can be beyond i_size: drop them from page cache */
>                 if (head[i].index >= end) {
>                         ClearPageDirty(head + i);
> -                       __delete_from_page_cache(head + i, NULL);
> +                       /* We know we're not removing the last page */
> +                       (void)__delete_from_page_cache(head + i, NULL);
>                         if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
>                                 shmem_uncharge(head->mapping->host, 1);
>                         put_page(head + i);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index dd9ebc1da356..8fb6c2f762bc 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -31,24 +31,31 @@
>   * itself locked.  These unlocked entries need verification under the tree
>   * lock.
>   */
> -static inline void __clear_shadow_entry(struct address_space *mapping,
> -                               pgoff_t index, void *entry)
> +static bool __must_check __clear_shadow_entry(struct address_space *mapping,
> +                                             pgoff_t index, void *entry)
>  {
>         XA_STATE(xas, &mapping->i_pages, index);
>
>         xas_set_update(&xas, workingset_update_node);
>         if (xas_load(&xas) != entry)
> -               return;
> +               return 0;
>         xas_store(&xas, NULL);
>         mapping->nrexceptional--;
> +
> +       return mapping_empty(mapping);
>  }
>
>  static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
>                                void *entry)
>  {
> +       bool empty;
> +
>         xa_lock_irq(&mapping->i_pages);
> -       __clear_shadow_entry(mapping, index, entry);
> +       empty = __clear_shadow_entry(mapping, index, entry);
>         xa_unlock_irq(&mapping->i_pages);
> +
> +       if (empty)
> +               inode_pages_clear(mapping->host);
>  }
>
>  /*
> @@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
>                                 pgoff_t end)
>  {
>         int i, j;
> -       bool dax, lock;
> +       bool dax, lock, empty = false;
>
>         /* Handled by shmem itself */
>         if (shmem_mapping(mapping))
> @@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
>                         continue;
>                 }
>
> -               __clear_shadow_entry(mapping, index, page);
> +               if (__clear_shadow_entry(mapping, index, page))
> +                       empty = true;
>         }
>
>         if (lock)
>                 xa_unlock_irq(&mapping->i_pages);
> +
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         pvec->nr = j;
>  }
>
> @@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>         pgoff_t         index;
>         int             i;
>
> -       if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +       if (mapping_empty(mapping))
>                 goto out;
>
>         /* Offsets within partial pages */
> @@ -636,6 +648,7 @@ static int
>  invalidate_complete_page2(struct address_space *mapping, struct page *page)
>  {
>         unsigned long flags;
> +       bool empty;
>
>         if (page->mapping != mapping)
>                 return 0;
> @@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
>                 goto failed;
>
>         BUG_ON(page_has_private(page));
> -       __delete_from_page_cache(page, NULL);
> +       empty = __delete_from_page_cache(page, NULL);
>         xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +       if (empty)
> +               inode_pages_clear(mapping->host);
> +
>         if (mapping->a_ops->freepage)
>                 mapping->a_ops->freepage(page);
>
> @@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>         int ret2 = 0;
>         int did_range_unmap = 0;
>
> -       if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +       if (mapping_empty(mapping))
>                 goto out;
>
>         pagevec_init(&pvec);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc555903a332..2ae2df605006 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>         } else {
>                 void (*freepage)(struct page *);
>                 void *shadow = NULL;
> +               int empty;
>
>                 freepage = mapping->a_ops->freepage;
>                 /*
> @@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>                 if (reclaimed && page_is_file_lru(page) &&
>                     !mapping_exiting(mapping) && !dax_mapping(mapping))
>                         shadow = workingset_eviction(page, target_memcg);
> -               __delete_from_page_cache(page, shadow);
> +               empty = __delete_from_page_cache(page, shadow);
>                 xa_unlock_irqrestore(&mapping->i_pages, flags);
>
> +               if (empty)
> +                       inode_pages_clear(mapping->host);
> +
>                 if (freepage != NULL)
>                         freepage(page);
>         }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 35219271796f..0fdbec92265a 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1205,9 +1205,11 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_NUMA
>         "zone_reclaim_failed",
>  #endif
> -       "pginodesteal",
>         "slabs_scanned",
> +       "pginodesteal",
>         "kswapd_inodesteal",
> +       "pginoderescue",
> +       "pginodedelayed",
>         "kswapd_low_wmark_hit_quickly",
>         "kswapd_high_wmark_hit_quickly",
>         "pageoutrun",
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 474186b76ced..7ce9c74ebfdb 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>         struct xa_node *node = container_of(item, struct xa_node, private_list);
>         XA_STATE(xas, node->array, 0);
>         struct address_space *mapping;
> +       bool empty = false;
>         int ret;
>
>         /*
> @@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>         if (WARN_ON_ONCE(node->count != node->nr_values))
>                 goto out_invalid;
>         mapping->nrexceptional -= node->nr_values;
> +       empty = mapping_empty(mapping);
>         xas.xa_node = xa_parent_locked(&mapping->i_pages, node);
>         xas.xa_offset = node->offset;
>         xas.xa_shift = node->shift + XA_CHUNK_SHIFT;
> @@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>
>  out_invalid:
>         xa_unlock_irq(&mapping->i_pages);
> +       if (empty)
> +               inode_pages_clear(mapping->host);
>         ret = LRU_REMOVED_RETRY;
>  out:
>         cond_resched();
> --
> 2.26.2
>
Johannes Weiner May 13, 2020, 1 p.m. UTC | #51
On Wed, May 13, 2020 at 09:32:58AM +0800, Yafang Shao wrote:
> On Wed, May 13, 2020 at 5:29 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote:
> > > The VFS inode shrinker is currently allowed to reclaim inodes with
> > > populated page cache. As a result it can drop gigabytes of hot and
> > > active page cache on the floor without consulting the VM (recorded as
> > > "inodesteal" events in /proc/vmstat).
> >
> > I'm sending a rebased version of this patch.
> >
> > We've been running with this change in the Facebook fleet since
> > February with no ill side effects observed.
> >
> > However, I just spent several hours chasing a mysterious reclaim
> > problem that turned out to be this bug again on an unpatched system.
> >
> > In the scenario I was debugging, the problem wasn't that we were
> > losing cache, but that we were losing the non-resident information for
> > previously evicted cache.
> >
> > I understood the file set enough to know it was thrashing like crazy,
> > but it didn't register as refaults to the kernel. Without detecting
> > the refaults, reclaim wouldn't start swapping to relieve the
> > struggling cache (plenty of cold anon memory around). It also meant
> > the IO delays of those refaults didn't contribute to memory pressure
> > in psi, which made userspace blind to the situation as well.
> >
> > The first aspect means we can get stuck in pathological thrashing, the
> > second means userspace OOM detection breaks and we can leave servers
> > (or Android devices, for that matter) hopelessly livelocked.
> >
> > New patch attached below. I hope we can get this fixed in 5.8, it's
> > really quite a big hole in our cache management strategy.
> >
> > ---
> > From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Tue, 11 Feb 2020 12:55:07 -0500
> > Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU
> >
> > The VFS inode shrinker is currently allowed to reclaim cold inodes
> > with populated page cache. This behavior goes back to CONFIG_HIGHMEM
> > setups, which required the ability to drop page cache in large highem
> > zones to free up struct inodes in comparatively tiny lowmem zones.
> >
> > However, it has significant side effects that are hard to justify on
> > systems without highmem:
> >
> > - It can drop gigabytes of hot and active page cache on the floor
> > without consulting the VM (recorded as "inodesteal" events in
> > /proc/vmstat). Such an "aging inversion" between unreferenced inodes
> > holding hot cache easily happens in practice: for example, a git tree
> > whose objects are accessed frequently but no open file descriptors are
> > maintained throughout.
> >
> 
> Hi Johannes,
> 
> I think it is reasonable to keep inodes with _active_ page cache off
> the inode shrinker LRU, but I'm not sure whether it is proper to keep
> the inodes with _only_ inactive page cache off the inode list lru
> neither. Per my understanding, if the inode has only inactive page
> cache, then invalidate all these inactive page cache could save the
> reclaimer's time, IOW, it may improve the performance in this case.

The shrinker doesn't know whether pages are active or inactive.

There is a PageActive() flag, but that's a sampled state that's only
uptodate when page reclaim is running. All the active pages could be
stale and getting deactivated on the next scan; all the inactive pages
could have page table references that would get them activated on the
next reclaim run etc.

You'd have to duplicate aspects of page reclaim itself to be sure
you're axing the right pages.

It also wouldn't be a reliable optimization. This only happens when
there is a disconnect between the inode and the cache life time, which
is true for some situations but not others.
Andrew Morton May 13, 2020, 9:15 p.m. UTC | #52
On Tue, 12 May 2020 17:29:36 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> 
> ...
>
> Solution
> 
> This patch fixes the aging inversion described above on
> !CONFIG_HIGHMEM systems, without reintroducing the problems associated
> with excessive shrinker LRU rotations, by keeping populated inodes off
> the shrinker LRUs entirely.
> 
> Currently, inodes are kept off the shrinker LRU as long as they have
> an elevated i_count, indicating an active user. Unfortunately, the
> page cache cannot simply hold an i_count reference, because unlink()
> *should* result in the inode being dropped and its cache invalidated.
> 
> Instead, this patch makes iput_final() consult the state of the page
> cache and punt the LRU linking to the VM if the inode is still
> populated; the VM in turn checks the inode state when it depopulates
> the page cache, and adds the inode to the LRU if necessary.
> 
> This is not unlike what we do for dirty inodes, which are moved off
> the LRU permanently until writeback completion puts them back on (iff
> still unused). We can reuse the same code -- inode_add_lru() - here.
> 
> This is also not unlike page reclaim, where the lower VM layer has to
> negotiate state with the higher VFS layer. Follow existing precedence
> and handle the inversion as much as possible on the VM side:
> 
> - introduce an I_PAGES flag that the VM maintains under the i_lock, so
>   that any inode code holding that lock can check the page cache state
>   without having to lock and inspect the struct address_space

Maintaining the same info in two places is a hassle.  Is this
optimization worthwhile?

> - introduce inode_pages_set() and inode_pages_clear() to maintain the
>   inode LRU state from the VM side, then update all cache mutators to
>   use them when populating the first cache entry or clearing the last
> 
> With this, the concept of "inodesteal" - where the inode shrinker
> drops page cache - is relegated to CONFIG_HIGHMEM systems only. The VM
> is in charge of the cache, the shrinker in charge of struct inode.

How tested is this on highmem machines?

> Footnotes
> 
> - For debuggability, add vmstat counters that track the number of
>   times a new cache entry pulls a previously unused inode off the LRU
>   (pginoderescue), as well as how many times existing cache deferred
>   an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters
>   for backwards compatibility, but they'll just show 0 now.
> 
> - Fix /proc/sys/vm/drop_caches to drop shadow entries from the page
>   cache. Not doing so has always been a bit strange, but since most
>   people drop cache and metadata cache together, the inode shrinker
>   would have taken care of them before - no more, so do it VM-side.
> 
> ...
>
>  14 files changed, 208 insertions(+), 34 deletions(-)

Patch is surprisingly large.
Andrew Morton May 14, 2020, 2:24 a.m. UTC | #53
On Tue, 12 May 2020 17:29:36 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> +		inode_pages_clear(mapping->inode);
> +	else if (populated == 1)
> +		inode_pages_set(mapping->inode);

mapping->host...

I have to assume this version wasn't runtime tested, so I'll drop it.
Johannes Weiner May 14, 2020, 10:37 a.m. UTC | #54
On Wed, May 13, 2020 at 07:24:10PM -0700, Andrew Morton wrote:
> On Tue, 12 May 2020 17:29:36 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > +		inode_pages_clear(mapping->inode);
> > +	else if (populated == 1)
> > +		inode_pages_set(mapping->inode);
> 
> mapping->host...

I'll fix that.

> I have to assume this version wasn't runtime tested, so I'll drop it.

It was, but I don't have DAX-suitable hardware so that config option
isn't enabled anywhere. I'll try to test this with a brd.

Sorry about that.
Johannes Weiner May 14, 2020, 11:27 a.m. UTC | #55
On Wed, May 13, 2020 at 02:15:19PM -0700, Andrew Morton wrote:
> On Tue, 12 May 2020 17:29:36 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > 
> > ...
> >
> > Solution
> > 
> > This patch fixes the aging inversion described above on
> > !CONFIG_HIGHMEM systems, without reintroducing the problems associated
> > with excessive shrinker LRU rotations, by keeping populated inodes off
> > the shrinker LRUs entirely.
> > 
> > Currently, inodes are kept off the shrinker LRU as long as they have
> > an elevated i_count, indicating an active user. Unfortunately, the
> > page cache cannot simply hold an i_count reference, because unlink()
> > *should* result in the inode being dropped and its cache invalidated.
> > 
> > Instead, this patch makes iput_final() consult the state of the page
> > cache and punt the LRU linking to the VM if the inode is still
> > populated; the VM in turn checks the inode state when it depopulates
> > the page cache, and adds the inode to the LRU if necessary.
> > 
> > This is not unlike what we do for dirty inodes, which are moved off
> > the LRU permanently until writeback completion puts them back on (iff
> > still unused). We can reuse the same code -- inode_add_lru() - here.
> > 
> > This is also not unlike page reclaim, where the lower VM layer has to
> > negotiate state with the higher VFS layer. Follow existing precedence
> > and handle the inversion as much as possible on the VM side:
> > 
> > - introduce an I_PAGES flag that the VM maintains under the i_lock, so
> >   that any inode code holding that lock can check the page cache state
> >   without having to lock and inspect the struct address_space
> 
> Maintaining the same info in two places is a hassle.  Is this
> optimization worthwhile?

Hm, maybe not. I'll try to get rid of it and test cache / LRU state
directly.

> > - introduce inode_pages_set() and inode_pages_clear() to maintain the
> >   inode LRU state from the VM side, then update all cache mutators to
> >   use them when populating the first cache entry or clearing the last
> > 
> > With this, the concept of "inodesteal" - where the inode shrinker
> > drops page cache - is relegated to CONFIG_HIGHMEM systems only. The VM
> > is in charge of the cache, the shrinker in charge of struct inode.
> 
> How tested is this on highmem machines?

I don't have a highmem machine, but my code is ifdeffed out on
CONFIG_HIGHMEM so the behavior shouldn't have changed there.
diff mbox series

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..46f67147ad20 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -79,7 +79,7 @@  void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/dax.c b/fs/dax.c
index 35da144375a0..7f3ce4612272 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -478,9 +478,11 @@  static void *grab_mapping_entry(struct xa_state *xas,
 {
 	unsigned long index = xas->xa_index;
 	bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */
+	int populated;
 	void *entry;
 
 retry:
+	populated = 0;
 	xas_lock_irq(xas);
 	entry = get_unlocked_entry(xas, order);
 
@@ -526,6 +528,8 @@  static void *grab_mapping_entry(struct xa_state *xas,
 		xas_store(xas, NULL);	/* undo the PMD join */
 		dax_wake_entry(xas, entry, true);
 		mapping->nrexceptional--;
+		if (mapping_empty(mapping))
+			populated = -1;
 		entry = NULL;
 		xas_set(xas, index);
 	}
@@ -541,11 +545,17 @@  static void *grab_mapping_entry(struct xa_state *xas,
 		dax_lock_entry(xas, entry);
 		if (xas_error(xas))
 			goto out_unlock;
+		if (mapping_empty(mapping))
+			populated++;
 		mapping->nrexceptional++;
 	}
 
 out_unlock:
 	xas_unlock_irq(xas);
+	if (populated == -1)
+		inode_pages_clear(mapping->inode);
+	else if (populated == 1)
+		inode_pages_set(mapping->inode);
 	if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
 		goto retry;
 	if (xas->xa_node == XA_ERROR(-ENOMEM))
@@ -631,6 +641,7 @@  static int __dax_invalidate_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
+	bool empty = false;
 	int ret = 0;
 	void *entry;
 
@@ -645,10 +656,13 @@  static int __dax_invalidate_entry(struct address_space *mapping,
 	dax_disassociate_entry(entry, mapping, trunc);
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+	empty = mapping_empty(mapping);
 	ret = 1;
 out:
 	put_unlocked_entry(&xas, entry);
 	xas_unlock_irq(&xas);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	return ret;
 }
 
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index dc1a1d5d825b..a5e9e9053474 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -27,7 +27,7 @@  static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		 * we need to reschedule to avoid softlockups.
 		 */
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (inode->i_mapping->nrpages == 0 && !need_resched())) {
+		    (mapping_empty(inode->i_mapping) && !need_resched())) {
 			spin_unlock(&inode->i_lock);
 			continue;
 		}
diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..575b780fa9bb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -430,25 +430,87 @@  static void inode_lru_list_add(struct inode *inode)
 		inode->i_state |= I_REFERENCED;
 }
 
+static void inode_lru_list_del(struct inode *inode)
+{
+	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
+		this_cpu_dec(nr_unused);
+}
+
 /*
  * Add inode to LRU if needed (inode is unused and clean).
  *
  * Needs inode->i_lock held.
  */
-void inode_add_lru(struct inode *inode)
+bool inode_add_lru(struct inode *inode)
 {
-	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
-				I_FREEING | I_WILL_FREE)) &&
-	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE)
-		inode_lru_list_add(inode);
+	if (inode->i_state &
+	    (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES))
+		return false;
+	if (atomic_read(&inode->i_count))
+		return false;
+	if (!(inode->i_sb->s_flags & SB_ACTIVE))
+		return false;
+	inode_lru_list_add(inode);
+	return true;
 }
 
+/**
+ * inode_pages_set - mark the inode as holding page cache
+ * @inode: the inode whose first cache page was just added
+ *
+ * Tell the VFS that this inode has populated page cache and must not
+ * be reclaimed by the inode shrinker.
+ *
+ * The caller must hold the page lock of the just-added page: by
+ * pinning the page, the page cache cannot become depopulated, and we
+ * can safely set I_PAGES without a race check under the i_pages lock.
+ *
+ * This function acquires the i_lock.
+ */
+void inode_pages_set(struct inode *inode)
+{
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & I_PAGES)) {
+		inode->i_state |= I_PAGES;
+		if (!list_empty(&inode->i_lru)) {
+			count_vm_event(PGINODERESCUE);
+			inode_lru_list_del(inode);
+		}
+	}
+	spin_unlock(&inode->i_lock);
+}
 
-static void inode_lru_list_del(struct inode *inode)
+/**
+ * inode_pages_clear - mark the inode as not holding page cache
+ * @inode: the inode whose last cache page was just removed
+ *
+ * Tell the VFS that the inode no longer holds page cache and that its
+ * lifetime is to be handed over to the inode shrinker LRU.
+ *
+ * This function acquires the i_lock and the i_pages lock.
+ */
+void inode_pages_clear(struct inode *inode)
 {
+	struct address_space *mapping = &inode->i_data;
+	bool add_to_lru = false;
+	unsigned long flags;
 
-	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
-		this_cpu_dec(nr_unused);
+	spin_lock(&inode->i_lock);
+
+	xa_lock_irqsave(&mapping->i_pages, flags);
+	if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) {
+		inode->i_state &= ~I_PAGES;
+		add_to_lru = true;
+	}
+	xa_unlock_irqrestore(&mapping->i_pages, flags);
+
+	if (add_to_lru) {
+		WARN_ON_ONCE(!list_empty(&inode->i_lru));
+		if (inode_add_lru(inode))
+			__count_vm_event(PGINODEDELAYED);
+	}
+
+	spin_unlock(&inode->i_lock);
 }
 
 /**
@@ -742,6 +804,8 @@  static enum lru_status inode_lru_isolate(struct list_head *item,
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
+	WARN_ON_ONCE(inode->i_state & I_PAGES);
+
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
 	 * through the LRU as we canot reclaim them now.
@@ -761,23 +825,17 @@  static enum lru_status inode_lru_isolate(struct list_head *item,
 		return LRU_ROTATE;
 	}
 
-	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-		__iget(inode);
+	/*
+	 * Populated inodes shouldn't be on the shrinker LRU, but they
+	 * can be briefly visible when a new page is added to an inode
+	 * that was already linked but inode_pages_set() hasn't run
+	 * yet to move them off.
+	 */
+	if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
+		list_lru_isolate(lru, &inode->i_lru);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(lru_lock);
-		if (remove_inode_buffers(inode)) {
-			unsigned long reap;
-			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
-			if (current_is_kswapd())
-				__count_vm_events(KSWAPD_INODESTEAL, reap);
-			else
-				__count_vm_events(PGINODESTEAL, reap);
-			if (current->reclaim_state)
-				current->reclaim_state->reclaimed_slab += reap;
-		}
-		iput(inode);
-		spin_lock(lru_lock);
-		return LRU_RETRY;
+		this_cpu_dec(nr_unused);
+		return LRU_REMOVED;
 	}
 
 	WARN_ON(inode->i_state & I_NEW);
diff --git a/fs/internal.h b/fs/internal.h
index f3f280b952a3..4a9dc77e817b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -139,7 +139,7 @@  extern int vfs_open(const struct path *, struct file *);
  * inode.c
  */
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
-extern void inode_add_lru(struct inode *inode);
+extern bool inode_add_lru(struct inode *inode);
 extern int dentry_needs_remove_privs(struct dentry *dentry);
 
 /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..a98d9dee39f4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -585,6 +585,11 @@  static inline void mapping_allow_writable(struct address_space *mapping)
 	atomic_inc(&mapping->i_mmap_writable);
 }
 
+static inline bool mapping_empty(struct address_space *mapping)
+{
+	return mapping->nrpages + mapping->nrexceptional == 0;
+}
+
 /*
  * Use sequence counter to get consistent i_size on 32-bit processors.
  */
@@ -2150,6 +2155,9 @@  static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  *
  * I_CREATING		New object's inode in the middle of setting up.
  *
+ * I_PAGES		Inode is holding page cache that needs to get reclaimed
+ *			first before the inode can go onto the shrinker LRU.
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -2172,6 +2180,7 @@  static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 #define I_WB_SWITCH		(1 << 13)
 #define I_OVL_INUSE		(1 << 14)
 #define I_CREATING		(1 << 15)
+#define I_PAGES			(1 << 16)
 
 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
@@ -3097,6 +3106,9 @@  static inline void remove_inode_hash(struct inode *inode)
 		__remove_inode_hash(inode);
 }
 
+extern void inode_pages_set(struct inode *inode);
+extern void inode_pages_clear(struct inode *inode);
+
 extern void inode_sb_list_add(struct inode *inode);
 
 #ifdef CONFIG_BLOCK
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ccb14b6a16b5..ae4d90bd0873 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -609,7 +609,7 @@  int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page, void *shadow);
+extern bool __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..f31026ccf590 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,7 +38,8 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
 #endif
-		PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
+		SLABS_SCANNED,
+		PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..fcc24b3b3540 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,8 +116,8 @@ 
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
-static void page_cache_delete(struct address_space *mapping,
-				   struct page *page, void *shadow)
+static bool __must_check page_cache_delete(struct address_space *mapping,
+					   struct page *page, void *shadow)
 {
 	XA_STATE(xas, &mapping->i_pages, page->index);
 	unsigned int nr = 1;
@@ -151,6 +151,8 @@  static void page_cache_delete(struct address_space *mapping,
 		smp_wmb();
 	}
 	mapping->nrpages -= nr;
+
+	return mapping_empty(mapping);
 }
 
 static void unaccount_page_cache_page(struct address_space *mapping,
@@ -227,15 +229,18 @@  static void unaccount_page_cache_page(struct address_space *mapping,
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the i_pages lock.
+ *
+ * If this returns true, the caller must call inode_pages_clear()
+ * after dropping the i_pages lock.
  */
-void __delete_from_page_cache(struct page *page, void *shadow)
+bool __must_check __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 
 	unaccount_page_cache_page(mapping, page);
-	page_cache_delete(mapping, page, shadow);
+	return page_cache_delete(mapping, page, shadow);
 }
 
 static void page_cache_free_page(struct address_space *mapping,
@@ -267,12 +272,16 @@  void delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
+	bool empty;
 
 	BUG_ON(!PageLocked(page));
 	xa_lock_irqsave(&mapping->i_pages, flags);
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	page_cache_free_page(mapping, page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
@@ -291,8 +300,8 @@  EXPORT_SYMBOL(delete_from_page_cache);
  *
  * The function expects the i_pages lock to be held.
  */
-static void page_cache_delete_batch(struct address_space *mapping,
-			     struct pagevec *pvec)
+static bool __must_check page_cache_delete_batch(struct address_space *mapping,
+						 struct pagevec *pvec)
 {
 	XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index);
 	int total_pages = 0;
@@ -337,12 +346,15 @@  static void page_cache_delete_batch(struct address_space *mapping,
 		total_pages++;
 	}
 	mapping->nrpages -= total_pages;
+
+	return mapping_empty(mapping);
 }
 
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct pagevec *pvec)
 {
 	int i;
+	bool empty;
 	unsigned long flags;
 
 	if (!pagevec_count(pvec))
@@ -354,9 +366,12 @@  void delete_from_page_cache_batch(struct address_space *mapping,
 
 		unaccount_page_cache_page(mapping, pvec->pages[i]);
 	}
-	page_cache_delete_batch(mapping, pvec);
+	empty = page_cache_delete_batch(mapping, pvec);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	for (i = 0; i < pagevec_count(pvec); i++)
 		page_cache_free_page(mapping, pvec->pages[i]);
 }
@@ -831,9 +846,10 @@  static int __add_to_page_cache_locked(struct page *page,
 				      void **shadowp)
 {
 	XA_STATE(xas, &mapping->i_pages, offset);
+	int error;
 	int huge = PageHuge(page);
 	struct mem_cgroup *memcg;
-	int error;
+	bool populated = false;
 	void *old;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -860,6 +876,7 @@  static int __add_to_page_cache_locked(struct page *page,
 		if (xas_error(&xas))
 			goto unlock;
 
+		populated = mapping_empty(mapping);
 		if (xa_is_value(old)) {
 			mapping->nrexceptional--;
 			if (shadowp)
@@ -880,6 +897,10 @@  static int __add_to_page_cache_locked(struct page *page,
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
+
+	if (populated)
+		inode_pages_set(mapping->host);
+
 	return 0;
 error:
 	page->mapping = NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..8b3e33a52d93 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2535,7 +2535,8 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 		/* Some pages can be beyond i_size: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
-			__delete_from_page_cache(head + i, NULL);
+			/* We know we're not removing the last page */
+			(void)__delete_from_page_cache(head + i, NULL);
 			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
 				shmem_uncharge(head->mapping->host, 1);
 			put_page(head + i);
diff --git a/mm/truncate.c b/mm/truncate.c
index dd9ebc1da356..8fb6c2f762bc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -31,24 +31,31 @@ 
  * itself locked.  These unlocked entries need verification under the tree
  * lock.
  */
-static inline void __clear_shadow_entry(struct address_space *mapping,
-				pgoff_t index, void *entry)
+static bool __must_check __clear_shadow_entry(struct address_space *mapping,
+					      pgoff_t index, void *entry)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 
 	xas_set_update(&xas, workingset_update_node);
 	if (xas_load(&xas) != entry)
-		return;
+		return 0;
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
+
+	return mapping_empty(mapping);
 }
 
 static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
 			       void *entry)
 {
+	bool empty;
+
 	xa_lock_irq(&mapping->i_pages);
-	__clear_shadow_entry(mapping, index, entry);
+	empty = __clear_shadow_entry(mapping, index, entry);
 	xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
 }
 
 /*
@@ -61,7 +68,7 @@  static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 				pgoff_t end)
 {
 	int i, j;
-	bool dax, lock;
+	bool dax, lock, empty = false;
 
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
@@ -96,11 +103,16 @@  static void truncate_exceptional_pvec_entries(struct address_space *mapping,
 			continue;
 		}
 
-		__clear_shadow_entry(mapping, index, page);
+		if (__clear_shadow_entry(mapping, index, page))
+			empty = true;
 	}
 
 	if (lock)
 		xa_unlock_irq(&mapping->i_pages);
+
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	pvec->nr = j;
 }
 
@@ -300,7 +312,7 @@  void truncate_inode_pages_range(struct address_space *mapping,
 	pgoff_t		index;
 	int		i;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	/* Offsets within partial pages */
@@ -636,6 +648,7 @@  static int
 invalidate_complete_page2(struct address_space *mapping, struct page *page)
 {
 	unsigned long flags;
+	bool empty;
 
 	if (page->mapping != mapping)
 		return 0;
@@ -648,9 +661,12 @@  invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page, NULL);
+	empty = __delete_from_page_cache(page, NULL);
 	xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+	if (empty)
+		inode_pages_clear(mapping->host);
+
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
 
@@ -692,7 +708,7 @@  int invalidate_inode_pages2_range(struct address_space *mapping,
 	int ret2 = 0;
 	int did_range_unmap = 0;
 
-	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+	if (mapping_empty(mapping))
 		goto out;
 
 	pagevec_init(&pvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05eb9efec07..c82e9831003f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -901,6 +901,7 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
+		int empty;
 
 		freepage = mapping->a_ops->freepage;
 		/*
@@ -922,9 +923,12 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,
 		if (reclaimed && page_is_file_cache(page) &&
 		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(page, target_memcg);
-		__delete_from_page_cache(page, shadow);
+		empty = __delete_from_page_cache(page, shadow);
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 
+		if (empty)
+			inode_pages_clear(mapping->host);
+
 		if (freepage != NULL)
 			freepage(page);
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..ae802253f71c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1203,9 +1203,11 @@  const char * const vmstat_text[] = {
 #ifdef CONFIG_NUMA
 	"zone_reclaim_failed",
 #endif
-	"pginodesteal",
 	"slabs_scanned",
+	"pginodesteal",
 	"kswapd_inodesteal",
+	"pginoderescue",
+	"pginodedelayed",
 	"kswapd_low_wmark_hit_quickly",
 	"kswapd_high_wmark_hit_quickly",
 	"pageoutrun",
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b76ced..7ce9c74ebfdb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -491,6 +491,7 @@  static enum lru_status shadow_lru_isolate(struct list_head *item,
 	struct xa_node *node = container_of(item, struct xa_node, private_list);
 	XA_STATE(xas, node->array, 0);
 	struct address_space *mapping;
+	bool empty = false;
 	int ret;
 
 	/*
@@ -529,6 +530,7 @@  static enum lru_status shadow_lru_isolate(struct list_head *item,
 	if (WARN_ON_ONCE(node->count != node->nr_values))
 		goto out_invalid;
 	mapping->nrexceptional -= node->nr_values;
+	empty = mapping_empty(mapping);
 	xas.xa_node = xa_parent_locked(&mapping->i_pages, node);
 	xas.xa_offset = node->offset;
 	xas.xa_shift = node->shift + XA_CHUNK_SHIFT;
@@ -542,6 +544,8 @@  static enum lru_status shadow_lru_isolate(struct list_head *item,
 
 out_invalid:
 	xa_unlock_irq(&mapping->i_pages);
+	if (empty)
+		inode_pages_clear(mapping->host);
 	ret = LRU_REMOVED_RETRY;
 out:
 	cond_resched();