Message ID | 20200211175507.178100-1-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfs: keep inodes with page cache off the inode shrinker LRU | expand |
On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote: > However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm: > don't reclaim inodes with many attached pages"") because it caused > severe reclaim performance problems: Inodes that sit on the shrinker > LRU are attracting reclaim pressure away from the page cache and > toward the VFS. If we then permanently exempt sizable portions of this > pool from actually getting reclaimed when looked at, this pressure > accumulates as deferred shrinker work (a mechanism for *temporarily* > unreclaimable objects) until it causes mayhem in the VFS cache pools. > > In the bug quoted in 69056ee6a8a3 in particular, the excessive > pressure drove the XFS shrinker into dirty objects, where it caused > synchronous, IO-bound stalls, even as there was plenty of clean page > cache that should have been reclaimed instead. A note on testing: the patch behaves much better on my machine and the inode shrinker doesn't drop hot page cache anymore, without noticable downsides so far. However, I tried to reproduce the xfs case that caused the 69056ee6a8a3 revert and haven't managed to do so yet on 5.5 plus the reverted patch. I cannot provoke higher inode sync stalls in the xfs shrinker regardless of shrinker strategy. Maybe something else changed since 4.19 and it's less of a concern now. Nonetheless, I'm interested in opinions on the premise of this patch. And Yafang is working on his memcg-specific fix for this issue, so I wanted to put this proposal on the table sooner than later. Thanks
On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote: > The VFS inode shrinker is currently allowed to reclaim inodes with > populated page cache. As a result it can drop gigabytes of hot and > active page cache on the floor without consulting the VM (recorded as > "inodesteal" events in /proc/vmstat). > > This causes real problems in practice. Consider for example how the > VM > would cache a source tree, such as the Linux git tree. As large parts > of the checked out files and the object database are accessed > repeatedly, the page cache holding this data gets moved to the active > list, where it's fully (and indefinitely) insulated from one-off > cache > moving through the inactive list. > This behavior of invalidating page cache from the inode shrinker goes > back to even before the git import of the kernel tree. It may have > been less noticeable when the VM itself didn't have real workingset > protection, and floods of one-off cache would push out any active > cache over time anyway. But the VM has come a long way since then and > the inode shrinker is now actively subverting its caching strategy. Two things come to mind when looking at this: - highmem - NUMA IIRC one of the reasons reclaim is done in this way is because a page cache page in one area of memory (highmem, or a NUMA node) can end up pinning inode slab memory in another memory area (normal zone, other NUMA node). I do not know how much of a concern that still is nowadays, but it seemed something worth bringing up.
On Tue, Feb 11, 2020 at 02:05:38PM -0500, Rik van Riel wrote: > On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote: > > The VFS inode shrinker is currently allowed to reclaim inodes with > > populated page cache. As a result it can drop gigabytes of hot and > > active page cache on the floor without consulting the VM (recorded as > > "inodesteal" events in /proc/vmstat). > > > > This causes real problems in practice. Consider for example how the > > VM > > would cache a source tree, such as the Linux git tree. As large parts > > of the checked out files and the object database are accessed > > repeatedly, the page cache holding this data gets moved to the active > > list, where it's fully (and indefinitely) insulated from one-off > > cache > > moving through the inactive list. > > > This behavior of invalidating page cache from the inode shrinker goes > > back to even before the git import of the kernel tree. It may have > > been less noticeable when the VM itself didn't have real workingset > > protection, and floods of one-off cache would push out any active > > cache over time anyway. But the VM has come a long way since then and > > the inode shrinker is now actively subverting its caching strategy. > > Two things come to mind when looking at this: > - highmem > - NUMA > > IIRC one of the reasons reclaim is done in this way is > because a page cache page in one area of memory (highmem, > or a NUMA node) can end up pinning inode slab memory in > another memory area (normal zone, other NUMA node). That's a good point, highmem does ring a bell now that you mention it. If we still care, I think this could be solved by doing something similar to what we do with buffer_heads_over_limit: allow a lowmem allocation to reclaim page cache inside the highmem zone if the bhs (or inodes in this case) have accumulated excessively. AFAICS, we haven't done anything similar for NUMA, so it might not be much of a problem there. I could imagine this is in part because NUMA nodes tend to be more balanced in size, and the ratio between cache memory and inode/bh memory means that these objects won't turn into a significant externality. Whereas with extreme highmem:lowmem ratios, they can.
On Tue, 11 Feb 2020 14:31:01 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Tue, Feb 11, 2020 at 02:05:38PM -0500, Rik van Riel wrote: > > On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote: > > > The VFS inode shrinker is currently allowed to reclaim inodes with > > > populated page cache. As a result it can drop gigabytes of hot and > > > active page cache on the floor without consulting the VM (recorded as > > > "inodesteal" events in /proc/vmstat). > > > > > > This causes real problems in practice. Consider for example how the > > > VM > > > would cache a source tree, such as the Linux git tree. As large parts > > > of the checked out files and the object database are accessed > > > repeatedly, the page cache holding this data gets moved to the active > > > list, where it's fully (and indefinitely) insulated from one-off > > > cache > > > moving through the inactive list. > > > > > This behavior of invalidating page cache from the inode shrinker goes > > > back to even before the git import of the kernel tree. It may have > > > been less noticeable when the VM itself didn't have real workingset > > > protection, and floods of one-off cache would push out any active > > > cache over time anyway. But the VM has come a long way since then and > > > the inode shrinker is now actively subverting its caching strategy. > > > > Two things come to mind when looking at this: > > - highmem > > - NUMA > > > > IIRC one of the reasons reclaim is done in this way is > > because a page cache page in one area of memory (highmem, > > or a NUMA node) can end up pinning inode slab memory in > > another memory area (normal zone, other NUMA node). > > That's a good point, highmem does ring a bell now that you mention it. Yup, that's why this mechanism exists. Here: https://marc.info/?l=git-commits-head&m=103646757213266&w=2 > If we still care, I think this could be solved by doing something > similar to what we do with buffer_heads_over_limit: allow a lowmem > allocation to reclaim page cache inside the highmem zone if the bhs > (or inodes in this case) have accumulated excessively. Well, reclaiming highmem pagecache at random would be a painful way to reclaim lowmem inodes. Better to pick an inode then shoot down all its pagecache. Perhaps we could take its pagecache's aging into account. Testing this will be a challenge, but the issue was real - a 7GB highmem machine isn't crazy and I expect the inode has become larger since those days. > AFAICS, we haven't done anything similar for NUMA, so it might not be > much of a problem there. I could imagine this is in part because NUMA > nodes tend to be more balanced in size, and the ratio between cache > memory and inode/bh memory means that these objects won't turn into a > significant externality. Whereas with extreme highmem:lowmem ratios, > they can.
On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > Testing this will be a challenge, but the issue was real - a 7GB > highmem machine isn't crazy and I expect the inode has become larger > since those days. Hmm. I would say that in the intening years a 7GB highmem machine has indeed become crazy. It used to be something we kind of supported. But we really should consider HIGHMEM to be something that is on the deprecation list. In this day and age, there is no excuse for running a 32-bit kernel with lots of physical memory. And if you really want to do that, and have some legacy hardware with a legacy use case, maybe you should be using a legacy kernel. I'd personally be perfectly happy to start removing HIGHMEM support again. Linus
On Tue, 11 Feb 2020 16:28:39 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Testing this will be a challenge, but the issue was real - a 7GB > > highmem machine isn't crazy and I expect the inode has become larger > > since those days. > > Hmm. I would say that in the intening years a 7GB highmem machine has > indeed become crazy. > > It used to be something we kind of supported. > > But we really should consider HIGHMEM to be something that is on the > deprecation list. In this day and age, there is no excuse for running > a 32-bit kernel with lots of physical memory. > > And if you really want to do that, and have some legacy hardware with > a legacy use case, maybe you should be using a legacy kernel. > > I'd personally be perfectly happy to start removing HIGHMEM support again. > That would be nice. What's the situation with highmem on ARM?
On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > What's the situation with highmem on ARM? Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever needed it, and I was ranting at some people for repeating all the mistakes Intel did. But arm64 doesn't need it, and while 32-bit arm is obviosuly still selling, I think that in many ways the switch-over to 64-bit has been quicker on ARM than it was on x86. Partly because it happened later (so all the 64-bit teething pains were dealt with), but largely because everybody ended up actively discouraging 32-bit on the Android side. There were a couple of unfortunate early 32-bit arm server attempts, but they were - predictably - complete garbage and nobody bought them. They don't exist any more. So at least my gut feel is that the arm people don't have any big reason to push for maintaining HIGHMEM support either. But I'm adding a couple of arm people and the arm list just in case they have some input. [ Obvious background for newly added people: we're talking about making CONFIG_HIGHMEM a deprecated feature and saying that if you want to run with lots of memory on a 32-bit kernel, you're doing legacy stuff and can use a legacy kernel ] Linus
On Tue, Feb 11, 2020 at 04:28:39PM -0800, Linus Torvalds wrote: > On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > Testing this will be a challenge, but the issue was real - a 7GB > > highmem machine isn't crazy and I expect the inode has become larger > > since those days. > > Hmm. I would say that in the intening years a 7GB highmem machine has > indeed become crazy. > > It used to be something we kind of supported. > > But we really should consider HIGHMEM to be something that is on the > deprecation list. In this day and age, there is no excuse for running > a 32-bit kernel with lots of physical memory. > > And if you really want to do that, and have some legacy hardware with > a legacy use case, maybe you should be using a legacy kernel. > > I'd personally be perfectly happy to start removing HIGHMEM support again. Do we have a use case where people want to run modern 32-bit guest kernels with more than 896MB of memory to support some horrendous legacy app? Or is our 32-bit compat story good enough that nobody actually does this? (Contrariwise, maybe those people are also fine with running a ten year old kernel because a `uname -r` starting with '3' breaks said app)
On Tue 11-02-20 16:28:39, Linus Torvalds wrote: > On Tue, Feb 11, 2020 at 3:44 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Testing this will be a challenge, but the issue was real - a 7GB > > highmem machine isn't crazy and I expect the inode has become larger > > since those days. > > Hmm. I would say that in the intening years a 7GB highmem machine has > indeed become crazy. Absolutely agreed. > It used to be something we kind of supported. And it's been few years since we have been actively discouraging people from using 32b kernels with a lot of memory. There are bug reports popping out from time to time but I do not remember any case where using 64b kernel would be a no-go. So my strong suspicion is that people simply keep their kernels on 32b without a good reason because it tends to work most of the time until they hit one of the lowmem problems and they move over to 64b. > But we really should consider HIGHMEM to be something that is on the > deprecation list. In this day and age, there is no excuse for running > a 32-bit kernel with lots of physical memory. > > And if you really want to do that, and have some legacy hardware with > a legacy use case, maybe you should be using a legacy kernel. > > I'd personally be perfectly happy to start removing HIGHMEM support again. I wouldn't be opposed at all.
On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote: > On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > What's the situation with highmem on ARM? > > Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever > needed it, and I was ranting at some people for repeating all the > mistakes Intel did. > > But arm64 doesn't need it, and while 32-bit arm is obviosuly still > selling, I think that in many ways the switch-over to 64-bit has been > quicker on ARM than it was on x86. Partly because it happened later > (so all the 64-bit teething pains were dealt with), but largely > because everybody ended up actively discouraging 32-bit on the Android > side. > > There were a couple of unfortunate early 32-bit arm server attempts, > but they were - predictably - complete garbage and nobody bought them. > They don't exist any more. > > So at least my gut feel is that the arm people don't have any big > reason to push for maintaining HIGHMEM support either. > > But I'm adding a couple of arm people and the arm list just in case > they have some input. > > [ Obvious background for newly added people: we're talking about > making CONFIG_HIGHMEM a deprecated feature and saying that if you want > to run with lots of memory on a 32-bit kernel, you're doing legacy > stuff and can use a legacy kernel ] Well, the recent 32-bit ARM systems generally have more than 1G of memory, so make use of highmem as a rule. You're probably talking about crippling support for any 32-bit ARM system produced in the last 8 to 10 years.
On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > The VFS inode shrinker is currently allowed to reclaim inodes with > populated page cache. As a result it can drop gigabytes of hot and > active page cache on the floor without consulting the VM (recorded as > "inodesteal" events in /proc/vmstat). > > This causes real problems in practice. Consider for example how the VM > would cache a source tree, such as the Linux git tree. As large parts > of the checked out files and the object database are accessed > repeatedly, the page cache holding this data gets moved to the active > list, where it's fully (and indefinitely) insulated from one-off cache > moving through the inactive list. > > However, due to the way users interact with the tree, no ongoing open > file descriptors into the source tree are maintained, and the inodes > end up on the "unused inode" shrinker LRU. A larger burst of one-off > cache (find, updatedb, etc.) can now drive the VFS shrinkers to drop > first the dentries and then the inodes - inodes that contain the most > valuable data currently held by the page cache - while there is plenty > of one-off cache that could be reclaimed instead. > > This doesn't make sense. The inodes aren't really "unused" as long as > the VM deems it worthwhile to hold on to their page cache. And the > shrinker can't possibly guess what is and isn't valuable to the VM > based on recent inode reference information alone (we could delete > several thousand lines of reclaim code if it could). > > History > > This behavior of invalidating page cache from the inode shrinker goes > back to even before the git import of the kernel tree. It may have > been less noticeable when the VM itself didn't have real workingset > protection, and floods of one-off cache would push out any active > cache over time anyway. But the VM has come a long way since then and > the inode shrinker is now actively subverting its caching strategy. > > As this keeps causing problems for people, there have been several > attempts to address this. > > One recent attempt was to make the inode shrinker simply skip over > inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim > inodes with many attached pages"). > > However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm: > don't reclaim inodes with many attached pages"") because it caused > severe reclaim performance problems: Inodes that sit on the shrinker > LRU are attracting reclaim pressure away from the page cache and > toward the VFS. If we then permanently exempt sizable portions of this > pool from actually getting reclaimed when looked at, this pressure > accumulates as deferred shrinker work (a mechanism for *temporarily* > unreclaimable objects) until it causes mayhem in the VFS cache pools. > > In the bug quoted in 69056ee6a8a3 in particular, the excessive > pressure drove the XFS shrinker into dirty objects, where it caused > synchronous, IO-bound stalls, even as there was plenty of clean page > cache that should have been reclaimed instead. > > Another variant of this problem was recently observed, where the > kernel violates cgroups' memory.low protection settings and reclaims > page cache way beyond the configured thresholds. It was followed by a > proposal of a modified form of the reverted commit above, that > implements memory.low-sensitive shrinker skipping over populated > inodes on the LRU [1]. However, this proposal continues to run the > risk of attracting disproportionate reclaim pressure to a pool of > still-used inodes, Hi Johannes, If you really think that is a risk, what about bellow additional patch to fix this risk ? diff --git a/fs/inode.c b/fs/inode.c index 80dddbc..61862d9 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode, goto out; cgroup_size = mem_cgroup_size(memcg); - if (inode->i_data.nrpages + protection >= cgroup_size) + if (inode->i_data.nrpages) reclaimable = false; out: With this additional patch, we skip all inodes in this memcg until all its page cache pages are reclaimed. > while not addressing the more generic reclaim > inversion problem outside of a very specific cgroup application. > But I have a different understanding. This method works like a knob. If you really care about your workingset (data), you should turn it on (i.e. by using memcg protection to protect them), while if you don't care about your workingset (data) then you'd better turn it off. That would be more flexible. Regaring your case in the commit log, why not protect your linux git tree with memcg protection ? > [1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/ > > Solution > > To fix the reclaim inversion described in the beginning, without > reintroducing the problems associated with shrinker LRU rotations, > this patch keeps populated inodes off the LRUs entirely. > > Currently, inodes are kept off the shrinker LRU as long as they have > an elevated i_count, indicating an active user. Unfortunately, the > page cache cannot simply hold an i_count reference, because unlink() > *should* result in the inode being dropped and its cache invalidated. > > Instead, this patch makes iput_final() consult the state of the page > cache and punt the LRU linking to the VM if the inode is still > populated; the VM in turn checks the inode state when it depopulates > the page cache, and adds the inode to the LRU if necessary. > > This is not unlike what we do for dirty inodes, which are moved off > the LRU permanently until writeback completion puts them back on (iff > still unused). We can reuse the same code -- inode_add_lru() - here. > > This is also not unlike page reclaim, where the lower VM layer has to > negotiate state with the higher VFS layer. Follow existing precedence > and handle the inversion as much as possible on the VM side: > > - introduce an I_PAGES flag that the VM maintains under the i_lock, so > that any inode code holding that lock can check the page cache state > without having to lock and inspect the struct address_space > > - introduce inode_pages_set() and inode_pages_clear() to maintain the > inode LRU state from the VM side, then update all cache mutators to > use them when populating the first cache entry or clearing the last > > With this, the concept of "inodesteal" - where the inode shrinker > drops page cache - is a thing of the past. The VM is in charge of the > page cache, the inode shrinker is in charge of freeing struct inode. > > Footnotes > > - For debuggability, add vmstat counters that track the number of > times a new cache entry pulls a previously unused inode off the LRU > (pginoderescue), as well as how many times existing cache deferred > an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters > for backwards compatibility, but they'll just show 0 now. > > - Fix /proc/sys/vm/drop_caches to drop shadow entries from the page > cache. Not doing so has always been a bit strange, but since most > people drop cache and metadata cache together, the inode shrinker > would have taken care of them before - no more, so do it VM-side. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > fs/block_dev.c | 2 +- > fs/dax.c | 14 +++++ > fs/drop_caches.c | 2 +- > fs/inode.c | 106 ++++++++++++++++++++++++++-------- > fs/internal.h | 2 +- > include/linux/fs.h | 12 ++++ > include/linux/pagemap.h | 2 +- > include/linux/vm_event_item.h | 3 +- > mm/filemap.c | 39 ++++++++++--- > mm/huge_memory.c | 3 +- > mm/truncate.c | 34 ++++++++--- > mm/vmscan.c | 6 +- > mm/vmstat.c | 4 +- > mm/workingset.c | 4 ++ > 14 files changed, 183 insertions(+), 50 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 69bf2fb6f7cd..46f67147ad20 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev) > { > struct address_space *mapping = bdev->bd_inode->i_mapping; > > - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) > + if (mapping_empty(mapping)) > return; > > invalidate_bh_lrus(); > diff --git a/fs/dax.c b/fs/dax.c > index 35da144375a0..7f3ce4612272 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas, > { > unsigned long index = xas->xa_index; > bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */ > + int populated; > void *entry; > > retry: > + populated = 0; > xas_lock_irq(xas); > entry = get_unlocked_entry(xas, order); > > @@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas, > xas_store(xas, NULL); /* undo the PMD join */ > dax_wake_entry(xas, entry, true); > mapping->nrexceptional--; > + if (mapping_empty(mapping)) > + populated = -1; > entry = NULL; > xas_set(xas, index); > } > @@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas, > dax_lock_entry(xas, entry); > if (xas_error(xas)) > goto out_unlock; > + if (mapping_empty(mapping)) > + populated++; > mapping->nrexceptional++; > } > > out_unlock: > xas_unlock_irq(xas); > + if (populated == -1) > + inode_pages_clear(mapping->inode); > + else if (populated == 1) > + inode_pages_set(mapping->inode); > if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) > goto retry; > if (xas->xa_node == XA_ERROR(-ENOMEM)) > @@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping, > pgoff_t index, bool trunc) > { > XA_STATE(xas, &mapping->i_pages, index); > + bool empty = false; > int ret = 0; > void *entry; > > @@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping, > dax_disassociate_entry(entry, mapping, trunc); > xas_store(&xas, NULL); > mapping->nrexceptional--; > + empty = mapping_empty(mapping); > ret = 1; > out: > put_unlocked_entry(&xas, entry); > xas_unlock_irq(&xas); > + if (empty) > + inode_pages_clear(mapping->host); > return ret; > } > > diff --git a/fs/drop_caches.c b/fs/drop_caches.c > index dc1a1d5d825b..a5e9e9053474 100644 > --- a/fs/drop_caches.c > +++ b/fs/drop_caches.c > @@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) > * we need to reschedule to avoid softlockups. > */ > if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || > - (inode->i_mapping->nrpages == 0 && !need_resched())) { > + (mapping_empty(inode->i_mapping) && !need_resched())) { > spin_unlock(&inode->i_lock); > continue; > } > diff --git a/fs/inode.c b/fs/inode.c > index 7d57068b6b7a..575b780fa9bb 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -430,25 +430,87 @@ static void inode_lru_list_add(struct inode *inode) > inode->i_state |= I_REFERENCED; > } > > +static void inode_lru_list_del(struct inode *inode) > +{ > + if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) > + this_cpu_dec(nr_unused); > +} > + > /* > * Add inode to LRU if needed (inode is unused and clean). > * > * Needs inode->i_lock held. > */ > -void inode_add_lru(struct inode *inode) > +bool inode_add_lru(struct inode *inode) > { > - if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC | > - I_FREEING | I_WILL_FREE)) && > - !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE) > - inode_lru_list_add(inode); > + if (inode->i_state & > + (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES)) > + return false; > + if (atomic_read(&inode->i_count)) > + return false; > + if (!(inode->i_sb->s_flags & SB_ACTIVE)) > + return false; > + inode_lru_list_add(inode); > + return true; > } > > +/** > + * inode_pages_set - mark the inode as holding page cache > + * @inode: the inode whose first cache page was just added > + * > + * Tell the VFS that this inode has populated page cache and must not > + * be reclaimed by the inode shrinker. > + * > + * The caller must hold the page lock of the just-added page: by > + * pinning the page, the page cache cannot become depopulated, and we > + * can safely set I_PAGES without a race check under the i_pages lock. > + * > + * This function acquires the i_lock. > + */ > +void inode_pages_set(struct inode *inode) > +{ > + spin_lock(&inode->i_lock); > + if (!(inode->i_state & I_PAGES)) { > + inode->i_state |= I_PAGES; > + if (!list_empty(&inode->i_lru)) { > + count_vm_event(PGINODERESCUE); > + inode_lru_list_del(inode); > + } > + } > + spin_unlock(&inode->i_lock); > +} > > -static void inode_lru_list_del(struct inode *inode) > +/** > + * inode_pages_clear - mark the inode as not holding page cache > + * @inode: the inode whose last cache page was just removed > + * > + * Tell the VFS that the inode no longer holds page cache and that its > + * lifetime is to be handed over to the inode shrinker LRU. > + * > + * This function acquires the i_lock and the i_pages lock. > + */ > +void inode_pages_clear(struct inode *inode) > { > + struct address_space *mapping = &inode->i_data; > + bool add_to_lru = false; > + unsigned long flags; > > - if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) > - this_cpu_dec(nr_unused); > + spin_lock(&inode->i_lock); > + > + xa_lock_irqsave(&mapping->i_pages, flags); > + if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) { > + inode->i_state &= ~I_PAGES; > + add_to_lru = true; > + } > + xa_unlock_irqrestore(&mapping->i_pages, flags); > + > + if (add_to_lru) { > + WARN_ON_ONCE(!list_empty(&inode->i_lru)); > + if (inode_add_lru(inode)) > + __count_vm_event(PGINODEDELAYED); > + } > + > + spin_unlock(&inode->i_lock); > } > > /** > @@ -742,6 +804,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item, > if (!spin_trylock(&inode->i_lock)) > return LRU_SKIP; > > + WARN_ON_ONCE(inode->i_state & I_PAGES); > + > /* > * Referenced or dirty inodes are still in use. Give them another pass > * through the LRU as we canot reclaim them now. > @@ -761,23 +825,17 @@ static enum lru_status inode_lru_isolate(struct list_head *item, > return LRU_ROTATE; > } > > - if (inode_has_buffers(inode) || inode->i_data.nrpages) { > - __iget(inode); > + /* > + * Populated inodes shouldn't be on the shrinker LRU, but they > + * can be briefly visible when a new page is added to an inode > + * that was already linked but inode_pages_set() hasn't run > + * yet to move them off. > + */ > + if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { > + list_lru_isolate(lru, &inode->i_lru); > spin_unlock(&inode->i_lock); > - spin_unlock(lru_lock); > - if (remove_inode_buffers(inode)) { > - unsigned long reap; > - reap = invalidate_mapping_pages(&inode->i_data, 0, -1); > - if (current_is_kswapd()) > - __count_vm_events(KSWAPD_INODESTEAL, reap); > - else > - __count_vm_events(PGINODESTEAL, reap); > - if (current->reclaim_state) > - current->reclaim_state->reclaimed_slab += reap; > - } > - iput(inode); > - spin_lock(lru_lock); > - return LRU_RETRY; > + this_cpu_dec(nr_unused); > + return LRU_REMOVED; > } > > WARN_ON(inode->i_state & I_NEW); > diff --git a/fs/internal.h b/fs/internal.h > index f3f280b952a3..4a9dc77e817b 100644 > --- a/fs/internal.h > +++ b/fs/internal.h > @@ -139,7 +139,7 @@ extern int vfs_open(const struct path *, struct file *); > * inode.c > */ > extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); > -extern void inode_add_lru(struct inode *inode); > +extern bool inode_add_lru(struct inode *inode); > extern int dentry_needs_remove_privs(struct dentry *dentry); > > /* > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3cd4fe6b845e..a98d9dee39f4 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -585,6 +585,11 @@ static inline void mapping_allow_writable(struct address_space *mapping) > atomic_inc(&mapping->i_mmap_writable); > } > > +static inline bool mapping_empty(struct address_space *mapping) > +{ > + return mapping->nrpages + mapping->nrexceptional == 0; > +} > + > /* > * Use sequence counter to get consistent i_size on 32-bit processors. > */ > @@ -2150,6 +2155,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, > * > * I_CREATING New object's inode in the middle of setting up. > * > + * I_PAGES Inode is holding page cache that needs to get reclaimed > + * first before the inode can go onto the shrinker LRU. > + * > * Q: What is the difference between I_WILL_FREE and I_FREEING? > */ > #define I_DIRTY_SYNC (1 << 0) > @@ -2172,6 +2180,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, > #define I_WB_SWITCH (1 << 13) > #define I_OVL_INUSE (1 << 14) > #define I_CREATING (1 << 15) > +#define I_PAGES (1 << 16) > > #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) > #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) > @@ -3097,6 +3106,9 @@ static inline void remove_inode_hash(struct inode *inode) > __remove_inode_hash(inode); > } > > +extern void inode_pages_set(struct inode *inode); > +extern void inode_pages_clear(struct inode *inode); > + > extern void inode_sb_list_add(struct inode *inode); > > #ifdef CONFIG_BLOCK > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index ccb14b6a16b5..ae4d90bd0873 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -609,7 +609,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > int add_to_page_cache_lru(struct page *page, struct address_space *mapping, > pgoff_t index, gfp_t gfp_mask); > extern void delete_from_page_cache(struct page *page); > -extern void __delete_from_page_cache(struct page *page, void *shadow); > +extern bool __delete_from_page_cache(struct page *page, void *shadow); > int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); > void delete_from_page_cache_batch(struct address_space *mapping, > struct pagevec *pvec); > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 47a3441cf4c4..f31026ccf590 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_NUMA > PGSCAN_ZONE_RECLAIM_FAILED, > #endif > - PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL, > + SLABS_SCANNED, > + PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED, > KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY, > PAGEOUTRUN, PGROTATED, > DROP_PAGECACHE, DROP_SLAB, > diff --git a/mm/filemap.c b/mm/filemap.c > index 1784478270e1..fcc24b3b3540 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -116,8 +116,8 @@ > * ->tasklist_lock (memory_failure, collect_procs_ao) > */ > > -static void page_cache_delete(struct address_space *mapping, > - struct page *page, void *shadow) > +static bool __must_check page_cache_delete(struct address_space *mapping, > + struct page *page, void *shadow) > { > XA_STATE(xas, &mapping->i_pages, page->index); > unsigned int nr = 1; > @@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping, > smp_wmb(); > } > mapping->nrpages -= nr; > + > + return mapping_empty(mapping); > } > > static void unaccount_page_cache_page(struct address_space *mapping, > @@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping, > * Delete a page from the page cache and free it. Caller has to make > * sure the page is locked and that nobody else uses it - or that usage > * is safe. The caller must hold the i_pages lock. > + * > + * If this returns true, the caller must call inode_pages_clear() > + * after dropping the i_pages lock. > */ > -void __delete_from_page_cache(struct page *page, void *shadow) > +bool __must_check __delete_from_page_cache(struct page *page, void *shadow) > { > struct address_space *mapping = page->mapping; > > trace_mm_filemap_delete_from_page_cache(page); > > unaccount_page_cache_page(mapping, page); > - page_cache_delete(mapping, page, shadow); > + return page_cache_delete(mapping, page, shadow); > } > > static void page_cache_free_page(struct address_space *mapping, > @@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page) > { > struct address_space *mapping = page_mapping(page); > unsigned long flags; > + bool empty; > > BUG_ON(!PageLocked(page)); > xa_lock_irqsave(&mapping->i_pages, flags); > - __delete_from_page_cache(page, NULL); > + empty = __delete_from_page_cache(page, NULL); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > + if (empty) > + inode_pages_clear(mapping->host); > + > page_cache_free_page(mapping, page); > } > EXPORT_SYMBOL(delete_from_page_cache); > @@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache); > * > * The function expects the i_pages lock to be held. > */ > -static void page_cache_delete_batch(struct address_space *mapping, > - struct pagevec *pvec) > +static bool __must_check page_cache_delete_batch(struct address_space *mapping, > + struct pagevec *pvec) > { > XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index); > int total_pages = 0; > @@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping, > total_pages++; > } > mapping->nrpages -= total_pages; > + > + return mapping_empty(mapping); > } > > void delete_from_page_cache_batch(struct address_space *mapping, > struct pagevec *pvec) > { > int i; > + bool empty; > unsigned long flags; > > if (!pagevec_count(pvec)) > @@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping, > > unaccount_page_cache_page(mapping, pvec->pages[i]); > } > - page_cache_delete_batch(mapping, pvec); > + empty = page_cache_delete_batch(mapping, pvec); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > + if (empty) > + inode_pages_clear(mapping->host); > + > for (i = 0; i < pagevec_count(pvec); i++) > page_cache_free_page(mapping, pvec->pages[i]); > } > @@ -831,9 +846,10 @@ static int __add_to_page_cache_locked(struct page *page, > void **shadowp) > { > XA_STATE(xas, &mapping->i_pages, offset); > + int error; > int huge = PageHuge(page); > struct mem_cgroup *memcg; > - int error; > + bool populated = false; > void *old; > > VM_BUG_ON_PAGE(!PageLocked(page), page); > @@ -860,6 +876,7 @@ static int __add_to_page_cache_locked(struct page *page, > if (xas_error(&xas)) > goto unlock; > > + populated = mapping_empty(mapping); > if (xa_is_value(old)) { > mapping->nrexceptional--; > if (shadowp) > @@ -880,6 +897,10 @@ static int __add_to_page_cache_locked(struct page *page, > if (!huge) > mem_cgroup_commit_charge(page, memcg, false, false); > trace_mm_filemap_add_to_page_cache(page); > + > + if (populated) > + inode_pages_set(mapping->host); > + > return 0; > error: > page->mapping = NULL; > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index b08b199f9a11..8b3e33a52d93 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2535,7 +2535,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > /* Some pages can be beyond i_size: drop them from page cache */ > if (head[i].index >= end) { > ClearPageDirty(head + i); > - __delete_from_page_cache(head + i, NULL); > + /* We know we're not removing the last page */ > + (void)__delete_from_page_cache(head + i, NULL); > if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head)) > shmem_uncharge(head->mapping->host, 1); > put_page(head + i); > diff --git a/mm/truncate.c b/mm/truncate.c > index dd9ebc1da356..8fb6c2f762bc 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -31,24 +31,31 @@ > * itself locked. These unlocked entries need verification under the tree > * lock. > */ > -static inline void __clear_shadow_entry(struct address_space *mapping, > - pgoff_t index, void *entry) > +static bool __must_check __clear_shadow_entry(struct address_space *mapping, > + pgoff_t index, void *entry) > { > XA_STATE(xas, &mapping->i_pages, index); > > xas_set_update(&xas, workingset_update_node); > if (xas_load(&xas) != entry) > - return; > + return 0; > xas_store(&xas, NULL); > mapping->nrexceptional--; > + > + return mapping_empty(mapping); > } > > static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, > void *entry) > { > + bool empty; > + > xa_lock_irq(&mapping->i_pages); > - __clear_shadow_entry(mapping, index, entry); > + empty = __clear_shadow_entry(mapping, index, entry); > xa_unlock_irq(&mapping->i_pages); > + > + if (empty) > + inode_pages_clear(mapping->host); > } > > /* > @@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, > pgoff_t end) > { > int i, j; > - bool dax, lock; > + bool dax, lock, empty = false; > > /* Handled by shmem itself */ > if (shmem_mapping(mapping)) > @@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, > continue; > } > > - __clear_shadow_entry(mapping, index, page); > + if (__clear_shadow_entry(mapping, index, page)) > + empty = true; > } > > if (lock) > xa_unlock_irq(&mapping->i_pages); > + > + if (empty) > + inode_pages_clear(mapping->host); > + > pvec->nr = j; > } > > @@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping, > pgoff_t index; > int i; > > - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) > + if (mapping_empty(mapping)) > goto out; > > /* Offsets within partial pages */ > @@ -636,6 +648,7 @@ static int > invalidate_complete_page2(struct address_space *mapping, struct page *page) > { > unsigned long flags; > + bool empty; > > if (page->mapping != mapping) > return 0; > @@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) > goto failed; > > BUG_ON(page_has_private(page)); > - __delete_from_page_cache(page, NULL); > + empty = __delete_from_page_cache(page, NULL); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > + if (empty) > + inode_pages_clear(mapping->host); > + > if (mapping->a_ops->freepage) > mapping->a_ops->freepage(page); > > @@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, > int ret2 = 0; > int did_range_unmap = 0; > > - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) > + if (mapping_empty(mapping)) > goto out; > > pagevec_init(&pvec); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c05eb9efec07..c82e9831003f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > } else { > void (*freepage)(struct page *); > void *shadow = NULL; > + int empty; > > freepage = mapping->a_ops->freepage; > /* > @@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > if (reclaimed && page_is_file_cache(page) && > !mapping_exiting(mapping) && !dax_mapping(mapping)) > shadow = workingset_eviction(page, target_memcg); > - __delete_from_page_cache(page, shadow); > + empty = __delete_from_page_cache(page, shadow); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > + if (empty) > + inode_pages_clear(mapping->host); > + > if (freepage != NULL) > freepage(page); > } > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 78d53378db99..ae802253f71c 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1203,9 +1203,11 @@ const char * const vmstat_text[] = { > #ifdef CONFIG_NUMA > "zone_reclaim_failed", > #endif > - "pginodesteal", > "slabs_scanned", > + "pginodesteal", > "kswapd_inodesteal", > + "pginoderescue", > + "pginodedelayed", > "kswapd_low_wmark_hit_quickly", > "kswapd_high_wmark_hit_quickly", > "pageoutrun", > diff --git a/mm/workingset.c b/mm/workingset.c > index 474186b76ced..7ce9c74ebfdb 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, > struct xa_node *node = container_of(item, struct xa_node, private_list); > XA_STATE(xas, node->array, 0); > struct address_space *mapping; > + bool empty = false; > int ret; > > /* > @@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, > if (WARN_ON_ONCE(node->count != node->nr_values)) > goto out_invalid; > mapping->nrexceptional -= node->nr_values; > + empty = mapping_empty(mapping); > xas.xa_node = xa_parent_locked(&mapping->i_pages, node); > xas.xa_offset = node->offset; > xas.xa_shift = node->shift + XA_CHUNK_SHIFT; > @@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, > > out_invalid: > xa_unlock_irq(&mapping->i_pages); > + if (empty) > + inode_pages_clear(mapping->host); > ret = LRU_REMOVED_RETRY; > out: > cond_resched(); > -- > 2.24.1 > >
On Tue, Feb 11, 2020 at 03:44:38PM -0800, Andrew Morton wrote: > On Tue, 11 Feb 2020 14:31:01 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Tue, Feb 11, 2020 at 02:05:38PM -0500, Rik van Riel wrote: > > > On Tue, 2020-02-11 at 12:55 -0500, Johannes Weiner wrote: > > > > The VFS inode shrinker is currently allowed to reclaim inodes with > > > > populated page cache. As a result it can drop gigabytes of hot and > > > > active page cache on the floor without consulting the VM (recorded as > > > > "inodesteal" events in /proc/vmstat). > > > > > > > > This causes real problems in practice. Consider for example how the > > > > VM > > > > would cache a source tree, such as the Linux git tree. As large parts > > > > of the checked out files and the object database are accessed > > > > repeatedly, the page cache holding this data gets moved to the active > > > > list, where it's fully (and indefinitely) insulated from one-off > > > > cache > > > > moving through the inactive list. > > > > > > > This behavior of invalidating page cache from the inode shrinker goes > > > > back to even before the git import of the kernel tree. It may have > > > > been less noticeable when the VM itself didn't have real workingset > > > > protection, and floods of one-off cache would push out any active > > > > cache over time anyway. But the VM has come a long way since then and > > > > the inode shrinker is now actively subverting its caching strategy. > > > > > > Two things come to mind when looking at this: > > > - highmem > > > - NUMA > > > > > > IIRC one of the reasons reclaim is done in this way is > > > because a page cache page in one area of memory (highmem, > > > or a NUMA node) can end up pinning inode slab memory in > > > another memory area (normal zone, other NUMA node). > > > > That's a good point, highmem does ring a bell now that you mention it. > > Yup, that's why this mechanism exists. Here: > > https://marc.info/?l=git-commits-head&m=103646757213266&w=2 Ah, thanks for digging that up, I did not know that. > > If we still care, I think this could be solved by doing something > > similar to what we do with buffer_heads_over_limit: allow a lowmem > > allocation to reclaim page cache inside the highmem zone if the bhs > > (or inodes in this case) have accumulated excessively. > > Well, reclaiming highmem pagecache at random would be a painful way to > reclaim lowmem inodes. Better to pick an inode then shoot down all its > pagecache. Perhaps we could take its pagecache's aging into account. That reminds me of trying to correlate inode pages in reclaim to batch the cache tree lock, slab page objects in the shrinker to free whole pages etc. We never managed to actually do that. :-) > Testing this will be a challenge, but the issue was real - a 7GB > highmem machine isn't crazy and I expect the inode has become larger > since those days. Since the cache purging code was written for highmem scenarios, how about making it specific to CONFIG_HIGHMEM at least? That way we improve the situation for the more common setups, without regressing highmem configurations. And if somebody wanted to improve the CONFIG_HIGHMEM behavior as well, they could still do so. Somethig like the below delta on top of my patch? --- fs/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++---- include/linux/fs.h | 5 +++++ 2 files changed, 45 insertions(+), 4 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 575b780fa9bb..45b2abd4fef6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -454,6 +454,18 @@ bool inode_add_lru(struct inode *inode) return true; } +/* + * Usually, inodes become reclaimable when they are no longer + * referenced and their page cache has been reclaimed. The following + * API allows the VM to communicate cache population state to the VFS. + * + * However, on CONFIG_HIGHMEM we can't wait for the page cache to go + * away: cache pages allocated in a large highmem zone could pin + * struct inode memory allocated in relatively small lowmem zones. So + * when CONFIG_HIGHMEM is enabled, we tie cache to the inode lifetime. + */ + +#ifndef CONFIG_HIGHMEM /** * inode_pages_set - mark the inode as holding page cache * @inode: the inode whose first cache page was just added @@ -512,6 +524,7 @@ void inode_pages_clear(struct inode *inode) spin_unlock(&inode->i_lock); } +#endif /* !CONFIG_HIGHMEM */ /** * inode_sb_list_add - add inode to the superblock list of inodes @@ -826,16 +839,39 @@ static enum lru_status inode_lru_isolate(struct list_head *item, } /* - * Populated inodes shouldn't be on the shrinker LRU, but they - * can be briefly visible when a new page is added to an inode - * that was already linked but inode_pages_set() hasn't run - * yet to move them off. + * Usually, populated inodes shouldn't be on the shrinker LRU, + * but they can be briefly visible when a new page is added to + * an inode that was already linked but inode_pages_set() + * hasn't run yet to move them off. + * + * The other exception is on HIGHMEM systems: highmem cache + * can pin lowmem struct inodes, and we might be in dire + * straits in the lower zones. Purge cache to free the inode. */ if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { +#ifdef CONFIG_HIGHMEM + __iget(inode); + spin_unlock(&inode->i_lock); + spin_unlock(lru_lock); + if (remove_inode_buffers(inode)) { + unsigned long reap; + reap = invalidate_mapping_pages(&inode->i_data, 0, -1); + if (current_is_kswapd()) + __count_vm_events(KSWAPD_INODESTEAL, reap); + else + __count_vm_events(PGINODESTEAL, reap); + if (current->reclaim_state) + current->reclaim_state->reclaimed_slab += reap; + } + iput(inode); + spin_lock(lru_lock); + return LRU_RETRY; +#else list_lru_isolate(lru, &inode->i_lru); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; +#endif } WARN_ON(inode->i_state & I_NEW); diff --git a/include/linux/fs.h b/include/linux/fs.h index a98d9dee39f4..abdb3fd3432f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3106,8 +3106,13 @@ static inline void remove_inode_hash(struct inode *inode) __remove_inode_hash(inode); } +#ifndef CONFIG_HIGHMEM extern void inode_pages_set(struct inode *inode); extern void inode_pages_clear(struct inode *inode); +#else +static inline void inode_pages_set(struct inode *inode) {} +static inline void inode_pages_clear(struct inode *inode) {} +#endif extern void inode_sb_list_add(struct inode *inode);
On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote: > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Another variant of this problem was recently observed, where the > > kernel violates cgroups' memory.low protection settings and reclaims > > page cache way beyond the configured thresholds. It was followed by a > > proposal of a modified form of the reverted commit above, that > > implements memory.low-sensitive shrinker skipping over populated > > inodes on the LRU [1]. However, this proposal continues to run the > > risk of attracting disproportionate reclaim pressure to a pool of > > still-used inodes, > > Hi Johannes, > > If you really think that is a risk, what about bellow additional patch > to fix this risk ? > > diff --git a/fs/inode.c b/fs/inode.c > index 80dddbc..61862d9 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode, > goto out; > > cgroup_size = mem_cgroup_size(memcg); > - if (inode->i_data.nrpages + protection >= cgroup_size) > + if (inode->i_data.nrpages) > reclaimable = false; > > out: > > With this additional patch, we skip all inodes in this memcg until all > its page cache pages are reclaimed. Well that's something we've tried and had to revert because it caused issues in slab reclaim. See the History part of my changelog. > > while not addressing the more generic reclaim > > inversion problem outside of a very specific cgroup application. > > > > But I have a different understanding. This method works like a > knob. If you really care about your workingset (data), you should > turn it on (i.e. by using memcg protection to protect them), while > if you don't care about your workingset (data) then you'd better > turn it off. That would be more flexible. Regaring your case in the > commit log, why not protect your linux git tree with memcg > protection ? I can't imagine a scenario where I *wouldn't* care about my workingset, though. Why should it be opt-in, not the default?
On Wed, 12 Feb 2020 11:35:40 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote: > Since the cache purging code was written for highmem scenarios, how > about making it specific to CONFIG_HIGHMEM at least? Why do I have memories of suggesting this a couple of weeks ago ;) > That way we improve the situation for the more common setups, without > regressing highmem configurations. And if somebody wanted to improve > the CONFIG_HIGHMEM behavior as well, they could still do so. > > Somethig like the below delta on top of my patch? Does it need to be that complicated? What's wrong with --- a/fs/inode.c~a +++ a/fs/inode.c @@ -761,6 +761,10 @@ static enum lru_status inode_lru_isolate return LRU_ROTATE; } +#ifdef CONFIG_HIGHMEM + /* + * lengthy blah + */ if (inode_has_buffers(inode) || inode->i_data.nrpages) { __iget(inode); spin_unlock(&inode->i_lock); @@ -779,6 +783,7 @@ static enum lru_status inode_lru_isolate spin_lock(lru_lock); return LRU_RETRY; } +#endif WARN_ON(inode->i_state & I_NEW); inode->i_state |= I_FREEING;
On Wed, Feb 12, 2020 at 10:26:45AM -0800, Andrew Morton wrote: > On Wed, 12 Feb 2020 11:35:40 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote: > > > Since the cache purging code was written for highmem scenarios, how > > about making it specific to CONFIG_HIGHMEM at least? > > Why do I have memories of suggesting this a couple of weeks ago ;) Sorry, you did. I went back and found your email now. It completely slipped my mind after that thread went off into another direction. > > That way we improve the situation for the more common setups, without > > regressing highmem configurations. And if somebody wanted to improve > > the CONFIG_HIGHMEM behavior as well, they could still do so. > > > > Somethig like the below delta on top of my patch? > > Does it need to be that complicated? What's wrong with > > --- a/fs/inode.c~a > +++ a/fs/inode.c > @@ -761,6 +761,10 @@ static enum lru_status inode_lru_isolate > return LRU_ROTATE; > } > > +#ifdef CONFIG_HIGHMEM > + /* > + * lengthy blah > + */ > if (inode_has_buffers(inode) || inode->i_data.nrpages) { > __iget(inode); > spin_unlock(&inode->i_lock); > @@ -779,6 +783,7 @@ static enum lru_status inode_lru_isolate > spin_lock(lru_lock); > return LRU_RETRY; > } > +#endif Pages can show up here even under !CONFIG_HIGHMEM. Because of the lock order to maintain LRU state (i_lock -> xa_lock), when the page cache inserts new pages it doesn't unlink the inode from the LRU atomically, and the shrinker might get here before inode_pages_set(). In that case we need the shrinker to punt the inode off the LRU (the #else branch). > WARN_ON(inode->i_state & I_NEW); > inode->i_state |= I_FREEING; > _ > > Whatever we do will need plenty of testing. It wouldn't surprise me > if there are people who unknowingly benefit from this code on > 64-bit machines. If we agree this is the way to go, I can put the patch into our tree and gather data from the Facebook fleet before we merge it.
On Thu, Feb 13, 2020 at 12:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote: > > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > Another variant of this problem was recently observed, where the > > > kernel violates cgroups' memory.low protection settings and reclaims > > > page cache way beyond the configured thresholds. It was followed by a > > > proposal of a modified form of the reverted commit above, that > > > implements memory.low-sensitive shrinker skipping over populated > > > inodes on the LRU [1]. However, this proposal continues to run the > > > risk of attracting disproportionate reclaim pressure to a pool of > > > still-used inodes, > > > > Hi Johannes, > > > > If you really think that is a risk, what about bellow additional patch > > to fix this risk ? > > > > diff --git a/fs/inode.c b/fs/inode.c > > index 80dddbc..61862d9 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode, > > goto out; > > > > cgroup_size = mem_cgroup_size(memcg); > > - if (inode->i_data.nrpages + protection >= cgroup_size) > > + if (inode->i_data.nrpages) > > reclaimable = false; > > > > out: > > > > With this additional patch, we skip all inodes in this memcg until all > > its page cache pages are reclaimed. > > Well that's something we've tried and had to revert because it caused > issues in slab reclaim. See the History part of my changelog. > You misuderstood it. The reverted patch skips all inodes in the system, while this patch only works when you turn on memcg.{min, low} protection. IOW, that is not a default behavior, while it only works when you want it and only effect your targeted memcg rather than the whole system. > > > while not addressing the more generic reclaim > > > inversion problem outside of a very specific cgroup application. > > > > > > > But I have a different understanding. This method works like a > > knob. If you really care about your workingset (data), you should > > turn it on (i.e. by using memcg protection to protect them), while > > if you don't care about your workingset (data) then you'd better > > turn it off. That would be more flexible. Regaring your case in the > > commit log, why not protect your linux git tree with memcg > > protection ? > > I can't imagine a scenario where I *wouldn't* care about my > workingset, though. Why should it be opt-in, not the default? Because the default behavior has caused the XFS performace hit. (I haven't checked your patch carefully, so I don't know whehter your patch fix it yet.) Thanks Yafang
On Mi, 2020-02-12 at 08:50 +0000, Russell King - ARM Linux admin wrote: > On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote: > > On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > What's the situation with highmem on ARM? > > > > Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever > > needed it, and I was ranting at some people for repeating all the > > mistakes Intel did. > > > > But arm64 doesn't need it, and while 32-bit arm is obviosuly still > > selling, I think that in many ways the switch-over to 64-bit has been > > quicker on ARM than it was on x86. Partly because it happened later > > (so all the 64-bit teething pains were dealt with), but largely > > because everybody ended up actively discouraging 32-bit on the Android > > side. > > > > There were a couple of unfortunate early 32-bit arm server attempts, > > but they were - predictably - complete garbage and nobody bought them. > > They don't exist any more. > > > > So at least my gut feel is that the arm people don't have any big > > reason to push for maintaining HIGHMEM support either. > > > > But I'm adding a couple of arm people and the arm list just in case > > they have some input. > > > > [ Obvious background for newly added people: we're talking about > > making CONFIG_HIGHMEM a deprecated feature and saying that if you want > > to run with lots of memory on a 32-bit kernel, you're doing legacy > > stuff and can use a legacy kernel ] > > Well, the recent 32-bit ARM systems generally have more than 1G > of memory, so make use of highmem as a rule. You're probably > talking about crippling support for any 32-bit ARM system produced > in the last 8 to 10 years. I know of quite a few embedded ARMv7 systems that will need to be supported for at least 10 years from now, with RAM sizes between 1GB and even the full 4GB (well 3.75GB due to MMIO needing some address space). Deprecating highmem would severely cripple our ability to support those platforms with new kernels. Regards, Lucas
On Thu, Feb 13, 2020 at 09:47:29AM +0800, Yafang Shao wrote: > On Thu, Feb 13, 2020 at 12:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote: > > > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > Another variant of this problem was recently observed, where the > > > > kernel violates cgroups' memory.low protection settings and reclaims > > > > page cache way beyond the configured thresholds. It was followed by a > > > > proposal of a modified form of the reverted commit above, that > > > > implements memory.low-sensitive shrinker skipping over populated > > > > inodes on the LRU [1]. However, this proposal continues to run the > > > > risk of attracting disproportionate reclaim pressure to a pool of > > > > still-used inodes, > > > > > > Hi Johannes, > > > > > > If you really think that is a risk, what about bellow additional patch > > > to fix this risk ? > > > > > > diff --git a/fs/inode.c b/fs/inode.c > > > index 80dddbc..61862d9 100644 > > > --- a/fs/inode.c > > > +++ b/fs/inode.c > > > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode, > > > goto out; > > > > > > cgroup_size = mem_cgroup_size(memcg); > > > - if (inode->i_data.nrpages + protection >= cgroup_size) > > > + if (inode->i_data.nrpages) > > > reclaimable = false; > > > > > > out: > > > > > > With this additional patch, we skip all inodes in this memcg until all > > > its page cache pages are reclaimed. > > > > Well that's something we've tried and had to revert because it caused > > issues in slab reclaim. See the History part of my changelog. > > You misuderstood it. > The reverted patch skips all inodes in the system, while this patch > only works when you turn on memcg.{min, low} protection. > IOW, that is not a default behavior, while it only works when you want > it and only effect your targeted memcg rather than the whole system. I understand perfectly well. Keeping unreclaimable inodes on the shrinker LRU causes the shrinker to build up excessive pressure on all VFS objects. This is a bug. Making it cgroup-specific doesn't make it less of a bug, it just means you only hit the bug when you use cgroup memory protection. > > > > while not addressing the more generic reclaim > > > > inversion problem outside of a very specific cgroup application. > > > > > > > > > > But I have a different understanding. This method works like a > > > knob. If you really care about your workingset (data), you should > > > turn it on (i.e. by using memcg protection to protect them), while > > > if you don't care about your workingset (data) then you'd better > > > turn it off. That would be more flexible. Regaring your case in the > > > commit log, why not protect your linux git tree with memcg > > > protection ? > > > > I can't imagine a scenario where I *wouldn't* care about my > > workingset, though. Why should it be opt-in, not the default? > > Because the default behavior has caused the XFS performace hit. That means that with your proposal you cannot use cgroup memory protection for workloads that run on xfs. (And if I remember the bug report correctly, this wasn't just xfs. It also caused metadata caches on other filesystems to get trashed. xfs was just more pronounced because it does sync inode flushing from the shrinker, adding write stalls to the mix of metadata cache misses.) What I'm proposing is an implementation that protects hot page cache without causing excessive shrinker pressure and rotations.
On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > > On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote: > > On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > What's the situation with highmem on ARM? > > > > Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever > > needed it, and I was ranting at some people for repeating all the > > mistakes Intel did. > > > > But arm64 doesn't need it, and while 32-bit arm is obviosuly still > > selling, I think that in many ways the switch-over to 64-bit has been > > quicker on ARM than it was on x86. Partly because it happened later > > (so all the 64-bit teething pains were dealt with), but largely > > because everybody ended up actively discouraging 32-bit on the Android > > side. > > > > There were a couple of unfortunate early 32-bit arm server attempts, > > but they were - predictably - complete garbage and nobody bought them. > > They don't exist any more. I'd generally agree with that, the systems with more than 4GB tended to be high-end systems predating the Cortex-A53/A57 that quickly got replaced once there were actual 64-bit parts, this would include axm5516 (replaced with x86-64 cores after sale to Intel), hip04 (replaced with arm64), or ecx-2000 (Calxeda bankruptcy). The one 32-bit SoC that I can think of that can actually drive lots of RAM and is still actively marketed is TI Keystone-2/AM5K2. The embedded AM5K2 is listed supporting up to 8GB of RAM, but the verison in the HPE ProLiant m800 server could take up to 32GB (!). I added Santosh and Kishon to Cc, they can probably comment on how long they think users will upgrade kernels on these. I suspect these devices can live for a very long time in things like wireless base stations, but it's possible that they all run on old kernels anyway by now (and are not worried about y2038). > > So at least my gut feel is that the arm people don't have any big > > reason to push for maintaining HIGHMEM support either. > > > > But I'm adding a couple of arm people and the arm list just in case > > they have some input. > > > > [ Obvious background for newly added people: we're talking about > > making CONFIG_HIGHMEM a deprecated feature and saying that if you want > > to run with lots of memory on a 32-bit kernel, you're doing legacy > > stuff and can use a legacy kernel ] > > Well, the recent 32-bit ARM systems generally have more than 1G > of memory, so make use of highmem as a rule. You're probably > talking about crippling support for any 32-bit ARM system produced > in the last 8 to 10 years. What I'm observing in the newly added board support is that memory configurations are actually going down, driven by component cost. 512MB is really cheap (~$4) these days with a single 256Mx16 DDR3 chip or two 128Mx16. Going beyond 1GB is where things get expensive with either 4+ chips or LPDDR3/LPDDR4 memory. For designs with 1GB, we're probably better off just using CONFIG_VMSPLIT_3G_OPT (without LPAE) anyway, completely avoiding highmem. That is particularly true on systems with a custom kernel configuration. 2GB machines are less common, but are definitely important, e.g. MT6580 based Android phones and some industrial embedded machines that will live a long time. I've recently seen reports of odd behavior with CONFIG_VMSPLIT_2G and plus CONFIG_HIGHMEM and a 7:1 ratio of lowmem to highmem that apparently causes OOM despite lots of lowmem being free. I suspect a lot of those workloads would still be better off with a CONFIG_VMSPLIT_2G_OPT (1.75 GB user, 2GB linear map). That config unfortunately has a few problems, too: - nobody has implemented it - it won't work with LPAE and therefore cannot support hardware that relies on high physical addresses for RAM or MMIO (those could run CONFIG_VMSPLIT_2G at the cost of wasting 12.5% of RAM). - any workload that requires the full 3GB of virtual address space won't work at all. This might be e.g. MAP_FIXED users, or build servers linking large binaries. It will take a while to find out what kinds of workloads suffer the most from a different vmsplit and what can be done to address that, but we could start by changing the kernel defconfig and distro builds to see who complains ;-) I think 32-bit ARM machines with 3GB or more are getting very rare, but some still exist: - The Armada XP development board had a DIMM slot that could take large memory (possibly up to 8GB with LPAE). This never shipped as a commercial product, but distro build servers sometimes still run on this, or on the old Calxeda or Keystone server systems. - a few early i.MX6 boards (e.g. HummingBoard) came had 4GB of RAM, though none of these seem to be available any more. - High-end phones from 2013/2014 had 3GB LPDDR3 before getting obsoleted by 64-bit phones. Presumably none of these ever ran Linux-4.x or newer. - My main laptop is a RK3288 based Chromebook with 4GB that just got updated to linux-4.19 by Google. Official updates apparently stop this summer, but it could easily run Debian later on. - Some people run 32-bit kernels on a 64-bit Raspberry Pi 4 or on arm64 KVM with lots of RAM. These should probably all migrate to 64-bit kernels with compat user space anyway. In theory these could also run on a VMSPLIT_4G_4G-like setup, but I don't think anyone wants to go there. Deprecating highmem definitely impacts any such users significantly, though staying on an LTS kernel may be an option if there are only few of them. Arnd
On Thu, Feb 13, 2020 at 9:46 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Feb 13, 2020 at 09:47:29AM +0800, Yafang Shao wrote: > > On Thu, Feb 13, 2020 at 12:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Wed, Feb 12, 2020 at 08:25:45PM +0800, Yafang Shao wrote: > > > > On Wed, Feb 12, 2020 at 1:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > Another variant of this problem was recently observed, where the > > > > > kernel violates cgroups' memory.low protection settings and reclaims > > > > > page cache way beyond the configured thresholds. It was followed by a > > > > > proposal of a modified form of the reverted commit above, that > > > > > implements memory.low-sensitive shrinker skipping over populated > > > > > inodes on the LRU [1]. However, this proposal continues to run the > > > > > risk of attracting disproportionate reclaim pressure to a pool of > > > > > still-used inodes, > > > > > > > > Hi Johannes, > > > > > > > > If you really think that is a risk, what about bellow additional patch > > > > to fix this risk ? > > > > > > > > diff --git a/fs/inode.c b/fs/inode.c > > > > index 80dddbc..61862d9 100644 > > > > --- a/fs/inode.c > > > > +++ b/fs/inode.c > > > > @@ -760,7 +760,7 @@ static bool memcg_can_reclaim_inode(struct inode *inode, > > > > goto out; > > > > > > > > cgroup_size = mem_cgroup_size(memcg); > > > > - if (inode->i_data.nrpages + protection >= cgroup_size) > > > > + if (inode->i_data.nrpages) > > > > reclaimable = false; > > > > > > > > out: > > > > > > > > With this additional patch, we skip all inodes in this memcg until all > > > > its page cache pages are reclaimed. > > > > > > Well that's something we've tried and had to revert because it caused > > > issues in slab reclaim. See the History part of my changelog. > > > > You misuderstood it. > > The reverted patch skips all inodes in the system, while this patch > > only works when you turn on memcg.{min, low} protection. > > IOW, that is not a default behavior, while it only works when you want > > it and only effect your targeted memcg rather than the whole system. > > I understand perfectly well. > > Keeping unreclaimable inodes on the shrinker LRU causes the shrinker > to build up excessive pressure on all VFS objects. This is a > bug. Making it cgroup-specific doesn't make it less of a bug, it just > means you only hit the bug when you use cgroup memory protection. > What I mean to fix is really a cgroup-specific issue, but this issue may be different with what you're meaning to fix. (I will explain it bellow) Considering the excessive pressure the protected inodes may give to the shrinker, the protected page cache pages will give much more pressure on the reclaimer. If you mean to remove the protecrted inodes from the shrinker LRU, why not removing the protected page cache pages from the page cache LRU as well ? Well, what I really to mean is, that is how the memcg proctection works. > > > > > while not addressing the more generic reclaim > > > > > inversion problem outside of a very specific cgroup application. > > > > > > > > > > > > > But I have a different understanding. This method works like a > > > > knob. If you really care about your workingset (data), you should > > > > turn it on (i.e. by using memcg protection to protect them), while > > > > if you don't care about your workingset (data) then you'd better > > > > turn it off. That would be more flexible. Regaring your case in the > > > > commit log, why not protect your linux git tree with memcg > > > > protection ? > > > > > > I can't imagine a scenario where I *wouldn't* care about my > > > workingset, though. Why should it be opt-in, not the default? > > > > Because the default behavior has caused the XFS performace hit. > > That means that with your proposal you cannot use cgroup memory > protection for workloads that run on xfs. > Well, if you set memory.min to protect your workload inside a specific memcg, it means that you already know these memroy can't be used by your workload outside the memcg. That means, the performace of the workload outside the memcg may not as good as before. Then you should adjust your SLA or migrating this protected memcgs to other host or just killing this protected memcg. IOW, the result is *expected*. > (And if I remember the bug report correctly, this wasn't just xfs. It > also caused metadata caches on other filesystems to get trashed. xfs > was just more pronounced because it does sync inode flushing from the > shrinker, adding write stalls to the mix of metadata cache misses.) > > What I'm proposing is an implementation that protects hot page cache > without causing excessive shrinker pressure and rotations. That's the different between your issue and my issue. You're trying to fix the issue around the hot page cache, but what I want to fix may be cold page cache and it really is a memcg protection specific issue. Becuase the memcg protection can protect all page cache pages, even if the page cache pages are cold and the inodes are cold (in the tail of the list lru) as well. That is one of the reasons why memcg protect exist. (I know you are the author of memcg protection, but I have to clarify what memcg protect is.) Regarding your issue around the hot page cache pages, I have another question. If the page cache pages are hot, why are the inode of these page cahe pages cold (in the tail of the list lru) ? Per my understanding, if the page cache pages are hot, the inodes of them should be hot (not in the tail of the list lur) as well. That should be how the LRU works. Well, that doesn't mean I object to your patch. What I really want to clarify is that our issues are really different. Thanks Yafang
Hi Johannes, I love your patch! Yet something to improve: [auto build test ERROR on vfs/for-next] [also build test ERROR on linux/master linus/master v5.6-rc1 next-20200213] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Johannes-Weiner/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-LRU/20200214-083756 base: https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-next config: m68k-allmodconfig (attached as .config) compiler: m68k-linux-gcc (GCC) 7.5.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.5.0 make.cross ARCH=m68k If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): fs/dax.c: In function 'grab_mapping_entry': >> fs/dax.c:556:28: error: 'struct address_space' has no member named 'inode' inode_pages_clear(mapping->inode); ^~ fs/dax.c:558:26: error: 'struct address_space' has no member named 'inode' inode_pages_set(mapping->inode); ^~ vim +556 fs/dax.c 446 447 /* 448 * Find page cache entry at given index. If it is a DAX entry, return it 449 * with the entry locked. If the page cache doesn't contain an entry at 450 * that index, add a locked empty entry. 451 * 452 * When requesting an entry with size DAX_PMD, grab_mapping_entry() will 453 * either return that locked entry or will return VM_FAULT_FALLBACK. 454 * This will happen if there are any PTE entries within the PMD range 455 * that we are requesting. 456 * 457 * We always favor PTE entries over PMD entries. There isn't a flow where we 458 * evict PTE entries in order to 'upgrade' them to a PMD entry. A PMD 459 * insertion will fail if it finds any PTE entries already in the tree, and a 460 * PTE insertion will cause an existing PMD entry to be unmapped and 461 * downgraded to PTE entries. This happens for both PMD zero pages as 462 * well as PMD empty entries. 463 * 464 * The exception to this downgrade path is for PMD entries that have 465 * real storage backing them. We will leave these real PMD entries in 466 * the tree, and PTE writes will simply dirty the entire PMD entry. 467 * 468 * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For 469 * persistent memory the benefit is doubtful. We can add that later if we can 470 * show it helps. 471 * 472 * On error, this function does not return an ERR_PTR. Instead it returns 473 * a VM_FAULT code, encoded as an xarray internal entry. The ERR_PTR values 474 * overlap with xarray value entries. 475 */ 476 static void *grab_mapping_entry(struct xa_state *xas, 477 struct address_space *mapping, unsigned int order) 478 { 479 unsigned long index = xas->xa_index; 480 bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */ 481 int populated; 482 void *entry; 483 484 retry: 485 populated = 0; 486 xas_lock_irq(xas); 487 entry = get_unlocked_entry(xas, order); 488 489 if (entry) { 490 if (dax_is_conflict(entry)) 491 goto fallback; 492 if (!xa_is_value(entry)) { 493 xas_set_err(xas, EIO); 494 goto out_unlock; 495 } 496 497 if (order == 0) { 498 if (dax_is_pmd_entry(entry) && 499 (dax_is_zero_entry(entry) || 500 dax_is_empty_entry(entry))) { 501 pmd_downgrade = true; 502 } 503 } 504 } 505 506 if (pmd_downgrade) { 507 /* 508 * Make sure 'entry' remains valid while we drop 509 * the i_pages lock. 510 */ 511 dax_lock_entry(xas, entry); 512 513 /* 514 * Besides huge zero pages the only other thing that gets 515 * downgraded are empty entries which don't need to be 516 * unmapped. 517 */ 518 if (dax_is_zero_entry(entry)) { 519 xas_unlock_irq(xas); 520 unmap_mapping_pages(mapping, 521 xas->xa_index & ~PG_PMD_COLOUR, 522 PG_PMD_NR, false); 523 xas_reset(xas); 524 xas_lock_irq(xas); 525 } 526 527 dax_disassociate_entry(entry, mapping, false); 528 xas_store(xas, NULL); /* undo the PMD join */ 529 dax_wake_entry(xas, entry, true); 530 mapping->nrexceptional--; 531 if (mapping_empty(mapping)) 532 populated = -1; 533 entry = NULL; 534 xas_set(xas, index); 535 } 536 537 if (entry) { 538 dax_lock_entry(xas, entry); 539 } else { 540 unsigned long flags = DAX_EMPTY; 541 542 if (order > 0) 543 flags |= DAX_PMD; 544 entry = dax_make_entry(pfn_to_pfn_t(0), flags); 545 dax_lock_entry(xas, entry); 546 if (xas_error(xas)) 547 goto out_unlock; 548 if (mapping_empty(mapping)) 549 populated++; 550 mapping->nrexceptional++; 551 } 552 553 out_unlock: 554 xas_unlock_irq(xas); 555 if (populated == -1) > 556 inode_pages_clear(mapping->inode); 557 else if (populated == 1) 558 inode_pages_set(mapping->inode); 559 if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) 560 goto retry; 561 if (xas->xa_node == XA_ERROR(-ENOMEM)) 562 return xa_mk_internal(VM_FAULT_OOM); 563 if (xas_error(xas)) 564 return xa_mk_internal(VM_FAULT_SIGBUS); 565 return entry; 566 fallback: 567 xas_unlock_irq(xas); 568 return xa_mk_internal(VM_FAULT_FALLBACK); 569 } 570 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi Johannes,
I love your patch! Perhaps something to improve:
[auto build test WARNING on vfs/for-next]
[also build test WARNING on linux/master linus/master v5.6-rc1 next-20200214]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Johannes-Weiner/vfs-keep-inodes-with-page-cache-off-the-inode-shrinker-LRU/20200214-083756
base: https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-next
If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>
coccinelle warnings: (new ones prefixed by >>)
>> mm/truncate.c:41:9-10: WARNING: return of 0/1 in function '__clear_shadow_entry' with return type bool
Please review and possibly fold the followup patch.
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Hi Arnd, On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> wrote: > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote: > > > So at least my gut feel is that the arm people don't have any big > > > reason to push for maintaining HIGHMEM support either. > > > > > > But I'm adding a couple of arm people and the arm list just in case > > > they have some input. > > > > > > [ Obvious background for newly added people: we're talking about > > > making CONFIG_HIGHMEM a deprecated feature and saying that if you want > > > to run with lots of memory on a 32-bit kernel, you're doing legacy > > > stuff and can use a legacy kernel ] > > > > Well, the recent 32-bit ARM systems generally have more than 1G > > of memory, so make use of highmem as a rule. You're probably > > talking about crippling support for any 32-bit ARM system produced > > in the last 8 to 10 years. > > What I'm observing in the newly added board support is that memory > configurations are actually going down, driven by component cost. > 512MB is really cheap (~$4) these days with a single 256Mx16 DDR3 > chip or two 128Mx16. Going beyond 1GB is where things get expensive > with either 4+ chips or LPDDR3/LPDDR4 memory. > > For designs with 1GB, we're probably better off just using > CONFIG_VMSPLIT_3G_OPT (without LPAE) anyway, completely > avoiding highmem. That is particularly true on systems with a custom > kernel configuration. > > 2GB machines are less common, but are definitely important, e.g. > MT6580 based Android phones and some industrial embedded machines > that will live a long time. I've recently seen reports of odd behavior > with CONFIG_VMSPLIT_2G and plus CONFIG_HIGHMEM and a 7:1 > ratio of lowmem to highmem that apparently causes OOM despite lots > of lowmem being free. I suspect a lot of those workloads would still be > better off with a CONFIG_VMSPLIT_2G_OPT (1.75 GB user, 2GB > linear map). That config unfortunately has a few problems, too: > - nobody has implemented it > - it won't work with LPAE and therefore cannot support hardware > that relies on high physical addresses for RAM or MMIO > (those could run CONFIG_VMSPLIT_2G at the cost of wasting > 12.5% of RAM). > - any workload that requires the full 3GB of virtual address space won't > work at all. This might be e.g. MAP_FIXED users, or build servers > linking large binaries. > It will take a while to find out what kinds of workloads suffer the most > from a different vmsplit and what can be done to address that, but we > could start by changing the kernel defconfig and distro builds to see > who complains ;-) > > I think 32-bit ARM machines with 3GB or more are getting very rare, > but some still exist: > - The Armada XP development board had a DIMM slot that could take > large memory (possibly up to 8GB with LPAE). This never shipped as > a commercial product, but distro build servers sometimes still run on > this, or on the old Calxeda or Keystone server systems. > - a few early i.MX6 boards (e.g. HummingBoard) came had 4GB of > RAM, though none of these seem to be available any more. > - High-end phones from 2013/2014 had 3GB LPDDR3 before getting > obsoleted by 64-bit phones. Presumably none of these ever ran > Linux-4.x or newer. > - My main laptop is a RK3288 based Chromebook with 4GB that just > got updated to linux-4.19 by Google. Official updates apparently > stop this summer, but it could easily run Debian later on. > - Some people run 32-bit kernels on a 64-bit Raspberry Pi 4 or on > arm64 KVM with lots of RAM. These should probably all > migrate to 64-bit kernels with compat user space anyway. > In theory these could also run on a VMSPLIT_4G_4G-like setup, > but I don't think anyone wants to go there. Deprecating highmem > definitely impacts any such users significantly, though staying on > an LTS kernel may be an option if there are only few of them. The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even for 1 GiB or 2 GiB configurations) in two parts, one below and one above the 32-bit physical limit. Gr{oetje,eeting}s, Geert
On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven <geert@linux-m68k.org> wrote: > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> wrote: > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even > for 1 GiB or 2 GiB configurations) in two parts, one below and one above > the 32-bit physical limit. Good to know. I think there are several other chips that have dual-channel DDR3 and thus /can/ support this configuration, but this rarely happens. Are you aware of commercial products that use a 4GB configuration, aside from the reference board? For TI AM54x, there is apparently a variant of the Dragonbox Pyro with 4G, which is said to be shipping in the near future, see https://en.wikipedia.org/wiki/DragonBox_Pyra Arnd
Hi Arnd, On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote: > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven > <geert@linux-m68k.org> wrote: > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> wrote: > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin > > > <linux@armlinux.org.uk> wrote: > > > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even > > for 1 GiB or 2 GiB configurations) in two parts, one below and one above > > the 32-bit physical limit. > > Good to know. I think there are several other chips that have dual-channel > DDR3 and thus /can/ support this configuration, but this rarely happens. > Are you aware of commercial products that use a 4GB configuration, aside from > the reference board? Unfortunately I don't know. Chris Paterson might know. Gr{oetje,eeting}s, Geert
Hello Arnd, Geert, > From: Geert Uytterhoeven <geert@linux-m68k.org> > Sent: 16 February 2020 09:45 > To: Arnd Bergmann <arnd@arndb.de> > > Hi Arnd, > > On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote: > > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven > > <geert@linux-m68k.org> wrote: > > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> > wrote: > > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin > > > > <linux@armlinux.org.uk> wrote: > > > > > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even > > > for 1 GiB or 2 GiB configurations) in two parts, one below and one above > > > the 32-bit physical limit. Yep. One example is r8a7743-iwg20m.dtsi. > > > > Good to know. I think there are several other chips that have dual-channel > > DDR3 and thus /can/ support this configuration, but this rarely happens. > > Are you aware of commercial products that use a 4GB configuration, aside > from > > the reference board? iWave Systems make a range of SOM modules using the RZ/G1 SoCs. I believe there are options for some of these to use 4 GB, although 1 or 2 GB is used in the boards we've upstreamed support for. There are also other SOM vendors (e.g. Emtrion) and end users of RZ/G1, but I'm not sure of the details. Kind regards, Chris > > Unfortunately I don't know. > Chris Paterson might know. > > Gr{oetje,eeting}s, > > Geert > > -- > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux- > m68k.org > > In personal conversations with technical people, I call myself a hacker. But > when I'm talking to journalists I just say "programmer" or something like that. > -- Linus Torvalds
On Sun, Feb 16, 2020 at 8:54 PM Chris Paterson <Chris.Paterson2@renesas.com> wrote: > > Hello Arnd, Geert, > > > From: Geert Uytterhoeven <geert@linux-m68k.org> > > Sent: 16 February 2020 09:45 > > To: Arnd Bergmann <arnd@arndb.de> > > > > Hi Arnd, > > > > On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote: > > > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven > > > <geert@linux-m68k.org> wrote: > > > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> > > wrote: > > > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin > > > > > <linux@armlinux.org.uk> wrote: > > > > > > > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split (even > > > > for 1 GiB or 2 GiB configurations) in two parts, one below and one above > > > > the 32-bit physical limit. > > Yep. One example is r8a7743-iwg20m.dtsi. This one has 2x512MB, with half above the 4GiB limit. This means it needs LPAE to address high physical addresses (which is fine), but it does not need highmem if one uses an appropriate CONFIG_VMSPLIT_* option. > > > Good to know. I think there are several other chips that have dual-channel > > > DDR3 and thus /can/ support this configuration, but this rarely happens. > > > Are you aware of commercial products that use a 4GB configuration, aside > > from > > > the reference board? > > iWave Systems make a range of SOM modules using the RZ/G1 SoCs. > I believe there are options for some of these to use 4 GB, although 1 or 2 GB is > used in the boards we've upstreamed support for. > > There are also other SOM vendors (e.g. Emtrion) and end users of RZ/G1, > but I'm not sure of the details. Both iWave and Emtrion only seem to list boards with 2GB or less on their websites today (with up to 15 year availability). My guess is that they had the same problem as everyone else in finding the right memory chips in the required quantities and/or long-term availability. iWave lists "By default 1GB DDR3 and 4GB eMMC only supported. Contact iWave for memory expansion support." on some boards, but that doesn't mean they ever shipped a 4GB configuration. Arnd
Hi! > > Testing this will be a challenge, but the issue was real - a 7GB > > highmem machine isn't crazy and I expect the inode has become larger > > since those days. > > Hmm. I would say that in the intening years a 7GB highmem machine has > indeed become crazy. > > It used to be something we kind of supported. > > But we really should consider HIGHMEM to be something that is on the > deprecation list. In this day and age, there is no excuse for running > a 32-bit kernel with lots of physical memory. > > And if you really want to do that, and have some legacy hardware with > a legacy use case, maybe you should be using a legacy kernel. > > I'd personally be perfectly happy to start removing HIGHMEM support again. 7GB HIGHMEM machine may be unusual, but AFAICT HIGHMEM is need for configurations like 1GB x86 machine, and definitely for 3GB x86 machine. 32-bit machines with 1.5 to 4GB of RAM are still pretty common (I have two of those), and dropping HIGHMEM support will limit them to 0.8GB RAM and probably make them unusable even for simple web browsing. I have two such machines here, please don't break them :-). Best regards, Pavel
Hello, > From: Arnd Bergmann <arnd@arndb.de> > Sent: 16 February 2020 20:38 > > On Sun, Feb 16, 2020 at 8:54 PM Chris Paterson > <Chris.Paterson2@renesas.com> wrote: > > > > Hello Arnd, Geert, > > > > > From: Geert Uytterhoeven <geert@linux-m68k.org> > > > Sent: 16 February 2020 09:45 > > > To: Arnd Bergmann <arnd@arndb.de> > > > > > > Hi Arnd, > > > > > > On Sat, Feb 15, 2020 at 5:59 PM Arnd Bergmann <arnd@arndb.de> wrote: > > > > On Sat, Feb 15, 2020 at 12:25 PM Geert Uytterhoeven > > > > <geert@linux-m68k.org> wrote: > > > > > On Thu, Feb 13, 2020 at 5:54 PM Arnd Bergmann <arnd@arndb.de> > > > wrote: > > > > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin > > > > > > <linux@armlinux.org.uk> wrote: > > > > > > > > > > The CIP-supported RZ/G1 SoCs can have up to 4 GiB, typically split > (even > > > > > for 1 GiB or 2 GiB configurations) in two parts, one below and one > above > > > > > the 32-bit physical limit. > > > > Yep. One example is r8a7743-iwg20m.dtsi. > > This one has 2x512MB, with half above the 4GiB limit. This means it needs > LPAE to address high physical addresses (which is fine), but it does not need > highmem if one uses an appropriate CONFIG_VMSPLIT_* option. > > > > > Good to know. I think there are several other chips that have dual- > channel > > > > DDR3 and thus /can/ support this configuration, but this rarely happens. > > > > Are you aware of commercial products that use a 4GB configuration, > aside > > > from > > > > the reference board? > > > > iWave Systems make a range of SOM modules using the RZ/G1 SoCs. > > I believe there are options for some of these to use 4 GB, although 1 or 2 > GB is > > used in the boards we've upstreamed support for. > > > > There are also other SOM vendors (e.g. Emtrion) and end users of RZ/G1, > > but I'm not sure of the details. > > Both iWave and Emtrion only seem to list boards with 2GB or less on their > websites today (with up to 15 year availability). My guess is that they had > the same problem as everyone else in finding the right memory chips in > the required quantities and/or long-term availability. iWave lists "By default > 1GB DDR3 and 4GB eMMC only supported. Contact iWave for memory > expansion support." on some boards, but that doesn't mean they ever > shipped a 4GB configuration. I probably should have been clearer before - I actually have a couple of iWave RZ/G1M SOM modules with 4GB DDR on them. However I've never booted them nor do I know what the memory mapping is. Kind regards, Chris > > Arnd
On 2/13/20 8:52 AM, Arnd Bergmann wrote: > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: >> >> On Tue, Feb 11, 2020 at 05:03:02PM -0800, Linus Torvalds wrote: >>> On Tue, Feb 11, 2020 at 4:47 PM Andrew Morton <akpm@linux-foundation.org> wrote: >>>> >>>> What's the situation with highmem on ARM? >>> >>> Afaik it's exactly the same as highmem on x86 - only 32-bit ARM ever >>> needed it, and I was ranting at some people for repeating all the >>> mistakes Intel did. >>> >>> But arm64 doesn't need it, and while 32-bit arm is obviosuly still >>> selling, I think that in many ways the switch-over to 64-bit has been >>> quicker on ARM than it was on x86. Partly because it happened later >>> (so all the 64-bit teething pains were dealt with), but largely >>> because everybody ended up actively discouraging 32-bit on the Android >>> side. >>> >>> There were a couple of unfortunate early 32-bit arm server attempts, >>> but they were - predictably - complete garbage and nobody bought them. >>> They don't exist any more. > > I'd generally agree with that, the systems with more than 4GB tended to > be high-end systems predating the Cortex-A53/A57 that quickly got > replaced once there were actual 64-bit parts, this would include axm5516 > (replaced with x86-64 cores after sale to Intel), hip04 (replaced > with arm64), or ecx-2000 (Calxeda bankruptcy). > > The one 32-bit SoC that I can think of that can actually drive lots of > RAM and is still actively marketed is TI Keystone-2/AM5K2. > The embedded AM5K2 is listed supporting up to 8GB of RAM, but > the verison in the HPE ProLiant m800 server could take up to 32GB (!). > > I added Santosh and Kishon to Cc, they can probably comment on how > long they think users will upgrade kernels on these. I suspect these > devices can live for a very long time in things like wireless base stations, > but it's possible that they all run on old kernels anyway by now (and are > not worried about y2038). > >>> So at least my gut feel is that the arm people don't have any big >>> reason to push for maintaining HIGHMEM support either. >>> >>> But I'm adding a couple of arm people and the arm list just in case >>> they have some input. The Keystone generations of SOCs have been used in different areas and they will be used for long unless says otherwise. Apart from just split of lowmem and highmem, one of the peculiar thing with Keystome family of SOCs is the DDR is addressable from two addressing ranges. The lowmem address range is actually non-cached range and the higher range is the cacheable. So apart from LPAE, another change I needed to do back then is to boot the linux from lowmem with bootstrap MMU tables and then re-create MMU tables early (mmu init) and use higher range for entire memory so that L3 cache can be used. AFAIK, all the derived SOCs from Keystone architecture inherently use this feature. Regards, Santosh
On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote: > > On 2/13/20 8:52 AM, Arnd Bergmann wrote: > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > > The Keystone generations of SOCs have been used in different areas and > they will be used for long unless says otherwise. > > Apart from just split of lowmem and highmem, one of the peculiar thing > with Keystome family of SOCs is the DDR is addressable from two > addressing ranges. The lowmem address range is actually non-cached > range and the higher range is the cacheable. I'm aware of Keystone's special physical memory layout, but for the discussion here, this is actually irrelevant for the discussion about highmem here, which is only about the way we map all or part of the available physical memory into the 4GB of virtual address space. The far more important question is how much memory any users (in particular the subset that are going to update their kernels several years from now) actually have installed. Keystone-II is one of the rare 32-bit chips with fairly wide memory interfaces, having two 72-bit (with ECC) channels rather than the usual one or two channels of 32-bit DDR3. This means a relatively cheap 4GB configuration using eight 256Mx16 chips is possible, or even a 8GB using sixteen or eighteen 512Mx8. Do you have an estimate on how common these 4GB and 8GB configurations are in practice outside of the TI evaluation board? Arnd
+Nishant, Tero On 2/26/20 1:01 PM, Arnd Bergmann wrote: > On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote: >> >> On 2/13/20 8:52 AM, Arnd Bergmann wrote: >>> On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin >>> <linux@armlinux.org.uk> wrote: >> >> The Keystone generations of SOCs have been used in different areas and >> they will be used for long unless says otherwise. >> >> Apart from just split of lowmem and highmem, one of the peculiar thing >> with Keystome family of SOCs is the DDR is addressable from two >> addressing ranges. The lowmem address range is actually non-cached >> range and the higher range is the cacheable. > > I'm aware of Keystone's special physical memory layout, but for the > discussion here, this is actually irrelevant for the discussion about > highmem here, which is only about the way we map all or part of the > available physical memory into the 4GB of virtual address space. > > The far more important question is how much memory any users > (in particular the subset that are going to update their kernels > several years from now) actually have installed. Keystone-II is > one of the rare 32-bit chips with fairly wide memory interfaces, > having two 72-bit (with ECC) channels rather than the usual one > or two channels of 32-bit DDR3. This means a relatively cheap > 4GB configuration using eight 256Mx16 chips is possible, or > even a 8GB using sixteen or eighteen 512Mx8. > > Do you have an estimate on how common these 4GB and 8GB > configurations are in practice outside of the TI evaluation > board? > From my TI memories, many K2 customers were going to install more than 2G memory. Don't remember 8G, but 4G was the dominant one afair. Will let Nishant/Tero elaborate latest on this. regards, Santosh
On 13:11-20200226, santosh.shilimkar@oracle.com wrote: > +Nishant, Tero > > On 2/26/20 1:01 PM, Arnd Bergmann wrote: > > On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote: > > > > > > On 2/13/20 8:52 AM, Arnd Bergmann wrote: > > > > On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin > > > > <linux@armlinux.org.uk> wrote: > > > > > > The Keystone generations of SOCs have been used in different areas and > > > they will be used for long unless says otherwise. > > > > > > Apart from just split of lowmem and highmem, one of the peculiar thing > > > with Keystome family of SOCs is the DDR is addressable from two > > > addressing ranges. The lowmem address range is actually non-cached > > > range and the higher range is the cacheable. > > > > I'm aware of Keystone's special physical memory layout, but for the > > discussion here, this is actually irrelevant for the discussion about > > highmem here, which is only about the way we map all or part of the > > available physical memory into the 4GB of virtual address space. > > > > The far more important question is how much memory any users > > (in particular the subset that are going to update their kernels > > several years from now) actually have installed. Keystone-II is > > one of the rare 32-bit chips with fairly wide memory interfaces, > > having two 72-bit (with ECC) channels rather than the usual one > > or two channels of 32-bit DDR3. This means a relatively cheap > > 4GB configuration using eight 256Mx16 chips is possible, or > > even a 8GB using sixteen or eighteen 512Mx8. > > > > Do you have an estimate on how common these 4GB and 8GB > > configurations are in practice outside of the TI evaluation > > board? > > > From my TI memories, many K2 customers were going to install > more than 2G memory. Don't remember 8G, but 4G was the dominant > one afair. Will let Nishant/Tero elaborate latest on this. > Thanks for the headsup, it took a little to dig up the current situation: ~few 1000s still relevant spread between 4G and 8G (confirmed that both are present, relevant and in use). I wish we could sunset, but unfortunately, I am told(and agree) that we should'nt just leave products (and these are long term products stuck in critical parts in our world) hanging in the air, and migrations to newer kernel do still take place periodically (the best I can talk in public forum at least).
On 3/6/20 12:34 PM, Nishanth Menon wrote: > On 13:11-20200226, santosh.shilimkar@oracle.com wrote: >> +Nishant, Tero >> >> On 2/26/20 1:01 PM, Arnd Bergmann wrote: >>> On Wed, Feb 26, 2020 at 7:04 PM <santosh.shilimkar@oracle.com> wrote: >>>> >>>> On 2/13/20 8:52 AM, Arnd Bergmann wrote: >>>>> On Wed, Feb 12, 2020 at 9:50 AM Russell King - ARM Linux admin >>>>> <linux@armlinux.org.uk> wrote: >>>> >>>> The Keystone generations of SOCs have been used in different areas and >>>> they will be used for long unless says otherwise. >>>> >>>> Apart from just split of lowmem and highmem, one of the peculiar thing >>>> with Keystome family of SOCs is the DDR is addressable from two >>>> addressing ranges. The lowmem address range is actually non-cached >>>> range and the higher range is the cacheable. >>> >>> I'm aware of Keystone's special physical memory layout, but for the >>> discussion here, this is actually irrelevant for the discussion about >>> highmem here, which is only about the way we map all or part of the >>> available physical memory into the 4GB of virtual address space. >>> >>> The far more important question is how much memory any users >>> (in particular the subset that are going to update their kernels >>> several years from now) actually have installed. Keystone-II is >>> one of the rare 32-bit chips with fairly wide memory interfaces, >>> having two 72-bit (with ECC) channels rather than the usual one >>> or two channels of 32-bit DDR3. This means a relatively cheap >>> 4GB configuration using eight 256Mx16 chips is possible, or >>> even a 8GB using sixteen or eighteen 512Mx8. >>> >>> Do you have an estimate on how common these 4GB and 8GB >>> configurations are in practice outside of the TI evaluation >>> board? >>> >> From my TI memories, many K2 customers were going to install >> more than 2G memory. Don't remember 8G, but 4G was the dominant >> one afair. Will let Nishant/Tero elaborate latest on this. >> > > Thanks for the headsup, it took a little to dig up the current > situation: > > ~few 1000s still relevant spread between 4G and 8G (confirmed that both > are present, relevant and in use). > > I wish we could sunset, but unfortunately, I am told(and agree) > that we should'nt just leave products (and these are long term > products stuck in critical parts in our world) hanging in the air, and > migrations to newer kernel do still take place periodically (the best > I can talk in public forum at least). > Thanks Nishant !!
On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote: > On 13:11-20200226, santosh.shilimkar@oracle.com wrote: > > ~few 1000s still relevant spread between 4G and 8G (confirmed that both > are present, relevant and in use). > > I wish we could sunset, but unfortunately, I am told(and agree) > that we should'nt just leave products (and these are long term > products stuck in critical parts in our world) hanging in the air, and > migrations to newer kernel do still take place periodically (the best > I can talk in public forum at least). Thank you for the clear answer! I agree we should certainly not break any such use cases, and for the 8GB case there is not really a good replacement (using zram/zswap instead of highmem could work for some new workloads, but would be a rather risky change for an upgrade on already deployed systems). I hope it's ok to ask the same question every few years until you are reasonably sure that the users are ready to stop upgrading kernels beyond the following LTS kernel version. We can also do the same thing for the other 32-bit platforms that exceed the maximum amount of lowmem, and document which ones are known. In the meantime, it would seem useful to increase the amount of lowmem that can be used by default, using a combination of some of the changes mentioned earlier - add a VMSPLIT_2G_OPT config option for non-LPAE ARM kernels to handle the common i.MX6 case with 2GB of RAM without highmem - make VMSPLIT_2G_OPT (without LPAE) or VMSPLIT_2G (with LPAE) the default in most ARM defconfig files as well as distros, and disable highmem where possible, to see what breaks. - extend zswap to use all the available high memory for swap space when highmem is disabled. - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32) to see if it can be done, and what the overhead is. This is probably more work than the others combined, but also the most promising as it allows the most user address space and physical ram to be used. Arnd
On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote: > On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote: > > On 13:11-20200226, santosh.shilimkar@oracle.com wrote: > > > > > ~few 1000s still relevant spread between 4G and 8G (confirmed that both > > are present, relevant and in use). > > > > I wish we could sunset, but unfortunately, I am told(and agree) > > that we should'nt just leave products (and these are long term > > products stuck in critical parts in our world) hanging in the air, and > > migrations to newer kernel do still take place periodically (the best > > I can talk in public forum at least). > > Thank you for the clear answer! > > I agree we should certainly not break any such use cases, and for the > 8GB case there is not really a good replacement (using zram/zswap > instead of highmem could work for some new workloads, but would be a > rather risky change for an upgrade on already deployed systems). > > I hope it's ok to ask the same question every few years until you are > reasonably sure that the users are ready to stop upgrading kernels > beyond the following LTS kernel version. We can also do the same > thing for the other 32-bit platforms that exceed the maximum amount > of lowmem, and document which ones are known. > > In the meantime, it would seem useful to increase the amount of > lowmem that can be used by default, using a combination of some > of the changes mentioned earlier > > - add a VMSPLIT_2G_OPT config option for non-LPAE ARM kernels > to handle the common i.MX6 case with 2GB of RAM without highmem > > - make VMSPLIT_2G_OPT (without LPAE) or VMSPLIT_2G (with > LPAE) the default in most ARM defconfig files as well as distros, > and disable highmem where possible, to see what breaks. > > - extend zswap to use all the available high memory for swap space > when highmem is disabled. I don't think that's a good idea. Running debian stable kernels on my 8GB laptop, I have problems when leaving firefox running long before even half the 16GB of swap gets consumed - the entire machine slows down very quickly when it starts swapping more than about 2 or so GB. It seems either the kernel has become quite bad at selecting pages to evict. It gets to the point where any git operation has a battle to fight for RAM, despite not touching anything else other than git. The behaviour is much like firefox is locking memory into core, but that doesn't seem to be what's actually going on. I've never really got to the bottom of it though. This is with 64-bit kernel and userspace. So, I'd suggest that trading off RAM available through highmem for VM space available through zswap is likely a bad idea if you have a workload that requires 4GB of RAM on a 32-bit machine.
On Sun, Mar 8, 2020 at 3:20 PM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote: > > On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote: > > > On 13:11-20200226, santosh.shilimkar@oracle.com wrote: > > > - extend zswap to use all the available high memory for swap space > > when highmem is disabled. > > I don't think that's a good idea. Running debian stable kernels on my > 8GB laptop, I have problems when leaving firefox running long before > even half the 16GB of swap gets consumed - the entire machine slows > down very quickly when it starts swapping more than about 2 or so GB. > It seems either the kernel has become quite bad at selecting pages to > evict. > > It gets to the point where any git operation has a battle to fight > for RAM, despite not touching anything else other than git. > > The behaviour is much like firefox is locking memory into core, but > that doesn't seem to be what's actually going on. I've never really > got to the bottom of it though. > > This is with 64-bit kernel and userspace. I agree there is something going wrong on your machine, but I don't really see how that relates to my suggestion. > So, I'd suggest that trading off RAM available through highmem for VM > space available through zswap is likely a bad idea if you have a > workload that requires 4GB of RAM on a 32-bit machine. Aside from every workload being different, I was thinking of these general observations: - If we are looking at a future without highmem, then it's better to use the extra memory for something than not using it. zswap seems like a reasonable use. - A lot of embedded systems are configured to have no swap at all, which can be for good or not-so-good reasons. Having some swap space available often improves things, even if it comes out of RAM. - A particularly important case to optimize for is 2GB of RAM with LPAE enabled. With CONFIG_VMSPLIT_2G and highmem, this leads to the paradox -ENOMEM when 256MB of highmem are full while plenty of lowmem is available. With highmem disabled, you avoid that at the cost of losing 12% of RAM. - With 4GB+ of RAM and CONFIG_VMSPLIT_2G or CONFIG_VMSPLIT_3G, using gigabytes of RAM for swap space would usually be worse than highmem, but once we have VMSPLIT_4G_4G, it's the same situation as above with 6% of RAM used for zswap instead of highmem. Arnd
On Mon, Mar 09, 2020 at 02:33:26PM +0100, Arnd Bergmann wrote: > On Sun, Mar 8, 2020 at 3:20 PM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote: > > > On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote: > > > > On 13:11-20200226, santosh.shilimkar@oracle.com wrote: > > > > > - extend zswap to use all the available high memory for swap space > > > when highmem is disabled. > > > > I don't think that's a good idea. Running debian stable kernels on my > > 8GB laptop, I have problems when leaving firefox running long before > > even half the 16GB of swap gets consumed - the entire machine slows > > down very quickly when it starts swapping more than about 2 or so GB. > > It seems either the kernel has become quite bad at selecting pages to > > evict. > > > > It gets to the point where any git operation has a battle to fight > > for RAM, despite not touching anything else other than git. > > > > The behaviour is much like firefox is locking memory into core, but > > that doesn't seem to be what's actually going on. I've never really > > got to the bottom of it though. > > > > This is with 64-bit kernel and userspace. > > I agree there is something going wrong on your machine, but I > don't really see how that relates to my suggestion. You are suggesting for a 4GB machine to use 2GB of RAM for normal usage via an optimised virtual space layout, and 2GB of RAM for swap using ZSWAP, rather than having 4GB of RAM available via the present highmem / lowmem system. I'm saying that is quite risky given the behaviours I'm seeing, where modern Linux seems to be struggling to work out what pages it should be evicting. I think Linux performs way better when it doesn't have to use swap. > > So, I'd suggest that trading off RAM available through highmem for VM > > space available through zswap is likely a bad idea if you have a > > workload that requires 4GB of RAM on a 32-bit machine. > > Aside from every workload being different, I was thinking of > these general observations: > > - If we are looking at a future without highmem, then it's better to use > the extra memory for something than not using it. zswap seems like > a reasonable use. I think that's still a very open question, one which hasn't been answered yet. > - A lot of embedded systems are configured to have no swap at all, > which can be for good or not-so-good reasons. Having some > swap space available often improves things, even if it comes > out of RAM. How do you come up with that assertion? What is the evidence behind it? This is kind of the crux of my point in the previous email: Linux with swap performs way worse for me - if I had 16GB of RAM in my laptop, I bet it would perform better than my current 8GB with a 16GB swap file - where, when the swap file gets around 8GB full, the system as a whole starts to struggle. That's about a 50/50 split of VM space between RAM and swap. Proposing 2GB+ swap 2GB RAM would potentially place these machines into the same situation as my laptop, and if it also results in a loss of performance, we could end up with regression reports. > - A particularly important case to optimize for is 2GB of RAM with > LPAE enabled. With CONFIG_VMSPLIT_2G and highmem, this > leads to the paradox -ENOMEM when 256MB of highmem are > full while plenty of lowmem is available. With highmem disabled, > you avoid that at the cost of losing 12% of RAM. What happened to requests for memory from highmem being able to be sourced from lowmem if highmem wasn't available? That used to be standard kernel behaviour. > - With 4GB+ of RAM and CONFIG_VMSPLIT_2G or > CONFIG_VMSPLIT_3G, using gigabytes of RAM for swap > space would usually be worse than highmem, but once > we have VMSPLIT_4G_4G, it's the same situation as above > with 6% of RAM used for zswap instead of highmem. I think the chances of that happening are very slim - I doubt there is the will to invest the energy amongst what is left of the 32-bit ARM community.
On Mon, Mar 9, 2020 at 3:05 PM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > > On Mon, Mar 09, 2020 at 02:33:26PM +0100, Arnd Bergmann wrote: > > On Sun, Mar 8, 2020 at 3:20 PM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > > > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote: > > > > On Fri, Mar 6, 2020 at 9:36 PM Nishanth Menon <nm@ti.com> wrote: > > > > > On 13:11-20200226, santosh.shilimkar@oracle.com wrote: > > > > > > > - extend zswap to use all the available high memory for swap space > > > > when highmem is disabled. > > > > > > I don't think that's a good idea. Running debian stable kernels on my > > > 8GB laptop, I have problems when leaving firefox running long before > > > even half the 16GB of swap gets consumed - the entire machine slows > > > down very quickly when it starts swapping more than about 2 or so GB. > > > It seems either the kernel has become quite bad at selecting pages to > > > evict. > > > > > > It gets to the point where any git operation has a battle to fight > > > for RAM, despite not touching anything else other than git. > > > > > > The behaviour is much like firefox is locking memory into core, but > > > that doesn't seem to be what's actually going on. I've never really > > > got to the bottom of it though. > > > > > > This is with 64-bit kernel and userspace. > > > > I agree there is something going wrong on your machine, but I > > don't really see how that relates to my suggestion. > > You are suggesting for a 4GB machine to use 2GB of RAM for normal > usage via an optimised virtual space layout, and 2GB of RAM for > swap using ZSWAP, rather than having 4GB of RAM available via the > present highmem / lowmem system. No, that would not be good. The cases where I would hope to get improvements out of zswap are: - 1GB of RAM with VMSPLIT_3G, when VMSPLIT_3G_OPT and VMSPLIT_2G don't work because of user address space requirements - 2GB of RAM with VMSPLIT_2G - 4GB of RAM if we add VMSPLIT_4G_4G > > - A lot of embedded systems are configured to have no swap at all, > > which can be for good or not-so-good reasons. Having some > > swap space available often improves things, even if it comes > > out of RAM. > > How do you come up with that assertion? What is the evidence behind > it? The idea of zswap is that it's faster to compress/uncompress data than to actually access a slow disk. So if you already have a swap space, it gives you another performance tier inbetween direct-mapped pages and the slow swap. If you don't have a physical swap space, then reserving a little bit of RAM for compressed swap means that rarely used pages take up less space and you end up with more RAM available for the workload you want to run. > This is kind of the crux of my point in the previous email: Linux > with swap performs way worse for me - if I had 16GB of RAM in my > laptop, I bet it would perform better than my current 8GB with a > 16GB swap file - where, when the swap file gets around 8GB full, > the system as a whole starts to struggle. > > That's about a 50/50 split of VM space between RAM and swap. As I said above I agree that very few workloads would behave better from using using 1.75GB RAM plus 2.25GB zswap (storing maybe 6GB of data) compared to highmem. To deal with 4GB systems, we probably need either highmem or VMSPLIT_4G_4G. > > - A particularly important case to optimize for is 2GB of RAM with > > LPAE enabled. With CONFIG_VMSPLIT_2G and highmem, this > > leads to the paradox -ENOMEM when 256MB of highmem are > > full while plenty of lowmem is available. With highmem disabled, > > you avoid that at the cost of losing 12% of RAM. > > What happened to requests for memory from highmem being able to be > sourced from lowmem if highmem wasn't available? That used to be > standard kernel behaviour. AFAICT this is how it's supposed to work, but for some reason it doesn't always. I don't know the details, but have heard of recent complaints about it. I don't think it's the actual get_free_pages failing, but rather some heuristic looking at the number of free pages. > > - With 4GB+ of RAM and CONFIG_VMSPLIT_2G or > > CONFIG_VMSPLIT_3G, using gigabytes of RAM for swap > > space would usually be worse than highmem, but once > > we have VMSPLIT_4G_4G, it's the same situation as above > > with 6% of RAM used for zswap instead of highmem. > > I think the chances of that happening are very slim - I doubt there > is the will to invest the energy amongst what is left of the 32-bit > ARM community. Right. But I think it makes sense to discuss what it would take to do it anyway, and to see who would be interested in funding or implementing VMSPLIT_4G_4G. Whether it happens or not comes down to another tradeoff: Without it, we have to keep highmem around for a much long timer to support systems with 4GB of RAM along with systems that need both 2GB of physical RAM and 3GB of user address space, while adding VMSPLIT_4G_4G soon means we can probably kill off highmem after everybody with more 8GB of RAM or more has stopped upgrading kernels. Unlike the 2GB case, this is something we can realistically plan for. What is going to be much harder I fear is to find someone to implement it on MIPS32, which seems to be a decade ahead of 32-bit ARM in its decline, and also has a small number of users with 4GB or more, and architecturally it seems harder to implement or impossible depending on the type of MIPS MMU. Arnd
On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote: > - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32) > to see if it can be done, and what the overhead is. This is probably > more work than the others combined, but also the most promising > as it allows the most user address space and physical ram to be used. A rough outline of such support (and likely to miss some corner cases): 1. Kernel runs with its own ASID and non-global page tables. 2. Trampoline code on exception entry/exit to handle the TTBR0 switching between user and kernel. 3. uaccess routines need to be reworked to pin the user pages in memory (get_user_pages()) and access them via the kernel address space. Point 3 is probably the ugliest and it would introduce a noticeable slowdown in certain syscalls.
On Mon, Mar 09, 2020 at 03:59:45PM +0000, Catalin Marinas wrote: > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote: > > - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32) > > to see if it can be done, and what the overhead is. This is probably > > more work than the others combined, but also the most promising > > as it allows the most user address space and physical ram to be used. > > A rough outline of such support (and likely to miss some corner cases): > > 1. Kernel runs with its own ASID and non-global page tables. > > 2. Trampoline code on exception entry/exit to handle the TTBR0 switching > between user and kernel. > > 3. uaccess routines need to be reworked to pin the user pages in memory > (get_user_pages()) and access them via the kernel address space. > > Point 3 is probably the ugliest and it would introduce a noticeable > slowdown in certain syscalls. We also need to consider that it has implications for the single-kernel support; a kernel doing this kind of switching would likely be horrid for a kernel supporting v6+ with VIPT aliasing caches. Would we be adding a new red line between kernels supporting VIPT-aliasing caches (present in earlier v6 implementations) and kernels using this system?
On Mon, Mar 09, 2020 at 04:09:19PM +0000, Russell King wrote: > On Mon, Mar 09, 2020 at 03:59:45PM +0000, Catalin Marinas wrote: > > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote: > > > - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32) > > > to see if it can be done, and what the overhead is. This is probably > > > more work than the others combined, but also the most promising > > > as it allows the most user address space and physical ram to be used. > > > > A rough outline of such support (and likely to miss some corner cases): > > > > 1. Kernel runs with its own ASID and non-global page tables. > > > > 2. Trampoline code on exception entry/exit to handle the TTBR0 switching > > between user and kernel. > > > > 3. uaccess routines need to be reworked to pin the user pages in memory > > (get_user_pages()) and access them via the kernel address space. > > > > Point 3 is probably the ugliest and it would introduce a noticeable > > slowdown in certain syscalls. > > We also need to consider that it has implications for the single-kernel > support; a kernel doing this kind of switching would likely be horrid > for a kernel supporting v6+ with VIPT aliasing caches. Good point. I think with VIPT aliasing cache uaccess would have to flush the cache before/after access, depending on direction. > Would we be adding a new red line between kernels supporting > VIPT-aliasing caches (present in earlier v6 implementations) and > kernels using this system? get_user_pages() should handle the flush_dcache_page() call and the latter would dial with the aliases. But this adds heavily to the cost of the uaccess. Maybe some trick with temporarily locking the user page table and copying the user pmd into a dedicated kernel pmd, then accessing the user via this location. The fault handler would need to figure out the real user address and I'm not sure how we deal with the page table lock (or mmap_sem). An alternative to the above would be to have all uaccess routines in a trampoline which restores the user pgd but with only a couple of pmds for mapping the kernel address temporarily. This would avoid the issue of concurrent modification of the user page tables. Anyway, I don't think any of the above looks better than highmem.
On Mon, Mar 9, 2020 at 5:09 PM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > On Mon, Mar 09, 2020 at 03:59:45PM +0000, Catalin Marinas wrote: > > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote: > > > - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32) > > > to see if it can be done, and what the overhead is. This is probably > > > more work than the others combined, but also the most promising > > > as it allows the most user address space and physical ram to be used. > > > > A rough outline of such support (and likely to miss some corner cases): > > > > 1. Kernel runs with its own ASID and non-global page tables. > > > > 2. Trampoline code on exception entry/exit to handle the TTBR0 switching > > between user and kernel. > > > > 3. uaccess routines need to be reworked to pin the user pages in memory > > (get_user_pages()) and access them via the kernel address space. > > > > Point 3 is probably the ugliest and it would introduce a noticeable > > slowdown in certain syscalls. There are probably a number of ways to do the basic design. The idea I had (again, probably missing more corner cases than either of you two that actually understand the details of the mmu): - Assuming we have LPAE, run the kernel vmlinux and modules inside the vmalloc space, in the top 256MB or 512MB on TTBR1 - Map all the physical RAM (up to 3.75GB) into a reserved ASID with TTBR0 - Flip TTBR0 on kernel entry/exit, and again during user access. This is probably more work to implement than your idea, but I would hope this has a lower overhead on most microarchitectures as it doesn't require pinning the pages. Depending on the microarchitecture, I'd hope the overhead would be comparable to that of ARM64_SW_TTBR0_PAN. > We also need to consider that it has implications for the single-kernel > support; a kernel doing this kind of switching would likely be horrid > for a kernel supporting v6+ with VIPT aliasing caches. Would we be > adding a new red line between kernels supporting VIPT-aliasing caches > (present in earlier v6 implementations) and kernels using this system? I would initially do it for LPAE only, given that this is already an incompatible config option. I don't think there are any v6 machines with more than 1GB of RAM (the maximum for AST2500), and the only distro that ships a v6+ multiplatform kernel is Raspbian, which in turn needs a separate LPAE kernel for the large-memory machines anyway. Only doing it for LPAE would still cover the vast majority of systems that actually shipped with more than 2GB. There are a couple of exceptions, i.e. early Cubox i4x4, the Calxeda Highbank developer system and the Novena Laptop, which I would guess have a limited life expectancy (before users stop updating kernels) no longer than the 8GB Keystone-2. Based on that, I would hope that the ARMv7 distros can keep shipping the two kernel images they already ship: - The non-LPAE kernel modified to VMSPLIT_2G_OPT, not using highmem on anything up to 2GB, but still supporting the handful of remaining Cortex-A9s with 4GB using highmem until they are completely obsolete. - The LPAE kernel modified to use a newly added VMSPLIT_4G_4G, with details to be worked out. Most new systems tend to be based on Cortex-A7 with no more than 2GB, so those could run either configuration well. If we find the 2GB of user address space too limiting for the non-LPAE config, or I missed some important pre-LPAE systems with 4GB that need to be supported for longer than other highmem systems, that can probably be added later. Arnd
I am worried this went quite tangent to the original patch under discussion here, but let me clarify at least one point. On Mon 09-03-20 16:04:54, Arnd Bergmann wrote: > On Mon, Mar 9, 2020 at 3:05 PM Russell King - ARM Linux admin [...] > > What happened to requests for memory from highmem being able to be > > sourced from lowmem if highmem wasn't available? That used to be > > standard kernel behaviour. > > AFAICT this is how it's supposed to work, but for some reason it > doesn't always. I don't know the details, but have heard of recent > complaints about it. I don't think it's the actual get_free_pages > failing, but rather some heuristic looking at the number of free pages. This is indeed the case. There are low memory reserves which are not allowed for requests which can be satisfied from higher zones. This is the case for many many years. Just have a look at lowmem_reserve and their usage in __zone_watermark_ok. The layout of the reserves can be configured by /proc/sys/vm/lowmem_reserve_ratio. HTH
On Mon, Mar 09, 2020 at 08:46:18PM +0100, Arnd Bergmann wrote: > On Mon, Mar 9, 2020 at 5:09 PM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > On Mon, Mar 09, 2020 at 03:59:45PM +0000, Catalin Marinas wrote: > > > On Sun, Mar 08, 2020 at 11:58:52AM +0100, Arnd Bergmann wrote: > > > > - revisit CONFIG_VMSPLIT_4G_4G for arm32 (and maybe mips32) > > > > to see if it can be done, and what the overhead is. This is probably > > > > more work than the others combined, but also the most promising > > > > as it allows the most user address space and physical ram to be used. > > > > > > A rough outline of such support (and likely to miss some corner cases): > > > > > > 1. Kernel runs with its own ASID and non-global page tables. > > > > > > 2. Trampoline code on exception entry/exit to handle the TTBR0 switching > > > between user and kernel. > > > > > > 3. uaccess routines need to be reworked to pin the user pages in memory > > > (get_user_pages()) and access them via the kernel address space. > > > > > > Point 3 is probably the ugliest and it would introduce a noticeable > > > slowdown in certain syscalls. > > There are probably a number of ways to do the basic design. The idea > I had (again, probably missing more corner cases than either of you > two that actually understand the details of the mmu): > > - Assuming we have LPAE, run the kernel vmlinux and modules inside > the vmalloc space, in the top 256MB or 512MB on TTBR1 > > - Map all the physical RAM (up to 3.75GB) into a reserved ASID > with TTBR0 > > - Flip TTBR0 on kernel entry/exit, and again during user access. > > This is probably more work to implement than your idea, but > I would hope this has a lower overhead on most microarchitectures > as it doesn't require pinning the pages. Depending on the > microarchitecture, I'd hope the overhead would be comparable > to that of ARM64_SW_TTBR0_PAN. This still doesn't solve the copy_{from,to}_user() case where both address spaces need to be available during copy. So you either pin the user pages in memory and access them via the kernel mapping or you temporarily map (kmap?) the destination/source kernel address. The overhead I'd expect to be significantly greater than ARM64_SW_TTBR0_PAN for the uaccess routines. For user entry/exit, your suggestion is probably comparable with SW PAN.
On Wed, Mar 11, 2020 at 3:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > - Flip TTBR0 on kernel entry/exit, and again during user access. > > > > This is probably more work to implement than your idea, but > > I would hope this has a lower overhead on most microarchitectures > > as it doesn't require pinning the pages. Depending on the > > microarchitecture, I'd hope the overhead would be comparable > > to that of ARM64_SW_TTBR0_PAN. > > This still doesn't solve the copy_{from,to}_user() case where both > address spaces need to be available during copy. So you either pin the > user pages in memory and access them via the kernel mapping or you > temporarily map (kmap?) the destination/source kernel address. The > overhead I'd expect to be significantly greater than ARM64_SW_TTBR0_PAN > for the uaccess routines. For user entry/exit, your suggestion is > probably comparable with SW PAN. Good point, that is indeed a larger overhead. The simplest implementation I had in mind would use the code from arch/arm/lib/copy_from_user.S and flip ttbr0 between each ldm and stm (up to 32 bytes), but I have no idea of the cost of storing to ttbr0, so this might be even more expensive. Do you have an estimate of how long writing to TTBR0_64 takes on Cortex-A7 and A15, respectively? Another way might be to use a use a temporary buffer that is already mapped, and add a memcpy() through L1-cache to reduce the number of ttbr0 changes. The buffer would probably have to be on the stack, which limits the size, but for large copies get_user_pages()+memcpy() may end up being faster anyway. Arnd
On Wed, Mar 11, 2020 at 05:59:53PM +0100, Arnd Bergmann wrote: > On Wed, Mar 11, 2020 at 3:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > - Flip TTBR0 on kernel entry/exit, and again during user access. > > > > > > This is probably more work to implement than your idea, but > > > I would hope this has a lower overhead on most microarchitectures > > > as it doesn't require pinning the pages. Depending on the > > > microarchitecture, I'd hope the overhead would be comparable > > > to that of ARM64_SW_TTBR0_PAN. > > > > This still doesn't solve the copy_{from,to}_user() case where both > > address spaces need to be available during copy. So you either pin the > > user pages in memory and access them via the kernel mapping or you > > temporarily map (kmap?) the destination/source kernel address. The > > overhead I'd expect to be significantly greater than ARM64_SW_TTBR0_PAN > > for the uaccess routines. For user entry/exit, your suggestion is > > probably comparable with SW PAN. > > Good point, that is indeed a larger overhead. The simplest implementation > I had in mind would use the code from arch/arm/lib/copy_from_user.S and > flip ttbr0 between each ldm and stm (up to 32 bytes), but I have no idea > of the cost of storing to ttbr0, so this might be even more expensive. Do you > have an estimate of how long writing to TTBR0_64 takes on Cortex-A7 > and A15, respectively? I don't have numbers but it's usually not cheap since you need an ISB to synchronise the context after TTBR0 update (basically flushing the pipeline). > Another way might be to use a use a temporary buffer that is already > mapped, and add a memcpy() through L1-cache to reduce the number > of ttbr0 changes. The buffer would probably have to be on the stack, > which limits the size, but for large copies get_user_pages()+memcpy() > may end up being faster anyway. IIRC, the x86 attempt from Ingo some years ago was using get_user_pages() for uaccess. Depending on the size of the buffer, this may be faster than copying twice.
On Wed, Mar 11, 2020 at 6:26 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > On Wed, Mar 11, 2020 at 05:59:53PM +0100, Arnd Bergmann wrote: > > On Wed, Mar 11, 2020 at 3:29 PM Catalin Marinas <catalin.marinas@arm.com> wrote: > > Do you have an estimate of how long writing to TTBR0_64 takes on Cortex-A7 > > and A15, respectively? > > I don't have numbers but it's usually not cheap since you need an ISB to > synchronise the context after TTBR0 update (basically flushing the > pipeline). Ok. > > Another way might be to use a use a temporary buffer that is already > > mapped, and add a memcpy() through L1-cache to reduce the number > > of ttbr0 changes. The buffer would probably have to be on the stack, > > which limits the size, but for large copies get_user_pages()+memcpy() > > may end up being faster anyway. > > IIRC, the x86 attempt from Ingo some years ago was using > get_user_pages() for uaccess. Depending on the size of the buffer, this > may be faster than copying twice. I guess the tradeoffs for that were rather different, as x86 back then had no ASIDs, so changing the page tables required a full TLB flush. Arnd
On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote: > The VFS inode shrinker is currently allowed to reclaim inodes with > populated page cache. As a result it can drop gigabytes of hot and > active page cache on the floor without consulting the VM (recorded as > "inodesteal" events in /proc/vmstat). I'm sending a rebased version of this patch. We've been running with this change in the Facebook fleet since February with no ill side effects observed. However, I just spent several hours chasing a mysterious reclaim problem that turned out to be this bug again on an unpatched system. In the scenario I was debugging, the problem wasn't that we were losing cache, but that we were losing the non-resident information for previously evicted cache. I understood the file set enough to know it was thrashing like crazy, but it didn't register as refaults to the kernel. Without detecting the refaults, reclaim wouldn't start swapping to relieve the struggling cache (plenty of cold anon memory around). It also meant the IO delays of those refaults didn't contribute to memory pressure in psi, which made userspace blind to the situation as well. The first aspect means we can get stuck in pathological thrashing, the second means userspace OOM detection breaks and we can leave servers (or Android devices, for that matter) hopelessly livelocked. New patch attached below. I hope we can get this fixed in 5.8, it's really quite a big hole in our cache management strategy. --- From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Tue, 11 Feb 2020 12:55:07 -0500 Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU The VFS inode shrinker is currently allowed to reclaim cold inodes with populated page cache. This behavior goes back to CONFIG_HIGHMEM setups, which required the ability to drop page cache in large highem zones to free up struct inodes in comparatively tiny lowmem zones. However, it has significant side effects that are hard to justify on systems without highmem: - It can drop gigabytes of hot and active page cache on the floor without consulting the VM (recorded as "inodesteal" events in /proc/vmstat). Such an "aging inversion" between unreferenced inodes holding hot cache easily happens in practice: for example, a git tree whose objects are accessed frequently but no open file descriptors are maintained throughout. - It can als prematurely drop non-resident info stored inside the inodes, which can let massive cache thrashing go unnoticed. This in turn means we're making the wrong decisions in page reclaim and can get stuck in pathological thrashing. The thrashing also doesn't show up as memory pressure in psi, causing failure in the userspace OOM detection chain, which can leave a system (or Android device e.g.) stranded in a livelock. History As this keeps causing problems for people, there have been several attempts to address this. One recent attempt was to make the inode shrinker simply skip over inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim inodes with many attached pages"). However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many attached pages"") because it caused severe reclaim performance problems: Inodes that sit on the shrinker LRU are attracting reclaim pressure away from the page cache and toward the VFS. If we then permanently exempt sizable portions of this pool from actually getting reclaimed when looked at, this pressure accumulates as deferred shrinker work (a mechanism for *temporarily* unreclaimable objects) until it causes mayhem in the VFS cache pools. In the bug quoted in 69056ee6a8a3 in particular, the excessive pressure drove the XFS shrinker into dirty objects, where it caused synchronous, IO-bound stalls, even as there was plenty of clean page cache that should have been reclaimed instead. Another variant of this problem was recently observed, where the kernel violates cgroups' memory.low protection settings and reclaims page cache way beyond the configured thresholds. It was followed by a proposal of a modified form of the reverted commit above, that implements memory.low-sensitive shrinker skipping over populated inodes on the LRU [1]. However, this proposal continues to run the risk of attracting disproportionate reclaim pressure to a pool of still-used inodes, while not addressing the more generic reclaim inversion problem outside of a very specific cgroup application. [1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/ Solution This patch fixes the aging inversion described above on !CONFIG_HIGHMEM systems, without reintroducing the problems associated with excessive shrinker LRU rotations, by keeping populated inodes off the shrinker LRUs entirely. Currently, inodes are kept off the shrinker LRU as long as they have an elevated i_count, indicating an active user. Unfortunately, the page cache cannot simply hold an i_count reference, because unlink() *should* result in the inode being dropped and its cache invalidated. Instead, this patch makes iput_final() consult the state of the page cache and punt the LRU linking to the VM if the inode is still populated; the VM in turn checks the inode state when it depopulates the page cache, and adds the inode to the LRU if necessary. This is not unlike what we do for dirty inodes, which are moved off the LRU permanently until writeback completion puts them back on (iff still unused). We can reuse the same code -- inode_add_lru() - here. This is also not unlike page reclaim, where the lower VM layer has to negotiate state with the higher VFS layer. Follow existing precedence and handle the inversion as much as possible on the VM side: - introduce an I_PAGES flag that the VM maintains under the i_lock, so that any inode code holding that lock can check the page cache state without having to lock and inspect the struct address_space - introduce inode_pages_set() and inode_pages_clear() to maintain the inode LRU state from the VM side, then update all cache mutators to use them when populating the first cache entry or clearing the last With this, the concept of "inodesteal" - where the inode shrinker drops page cache - is relegated to CONFIG_HIGHMEM systems only. The VM is in charge of the cache, the shrinker in charge of struct inode. Footnotes - For debuggability, add vmstat counters that track the number of times a new cache entry pulls a previously unused inode off the LRU (pginoderescue), as well as how many times existing cache deferred an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters for backwards compatibility, but they'll just show 0 now. - Fix /proc/sys/vm/drop_caches to drop shadow entries from the page cache. Not doing so has always been a bit strange, but since most people drop cache and metadata cache together, the inode shrinker would have taken care of them before - no more, so do it VM-side. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- fs/block_dev.c | 2 +- fs/dax.c | 14 +++++ fs/drop_caches.c | 2 +- fs/inode.c | 112 +++++++++++++++++++++++++++++++--- fs/internal.h | 2 +- include/linux/fs.h | 17 ++++++ include/linux/pagemap.h | 2 +- include/linux/vm_event_item.h | 3 +- mm/filemap.c | 37 ++++++++--- mm/huge_memory.c | 3 +- mm/truncate.c | 34 ++++++++--- mm/vmscan.c | 6 +- mm/vmstat.c | 4 +- mm/workingset.c | 4 ++ 14 files changed, 208 insertions(+), 34 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 4c26dcd9dba3..e0f73647c848 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev) { struct address_space *mapping = bdev->bd_inode->i_mapping; - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) + if (mapping_empty(mapping)) return; invalidate_bh_lrus(); diff --git a/fs/dax.c b/fs/dax.c index 11b16729b86f..89d1245a96ce 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas, { unsigned long index = xas->xa_index; bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */ + int populated; void *entry; retry: + populated = 0; xas_lock_irq(xas); entry = get_unlocked_entry(xas, order); @@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas, xas_store(xas, NULL); /* undo the PMD join */ dax_wake_entry(xas, entry, true); mapping->nrexceptional--; + if (mapping_empty(mapping)) + populated = -1; entry = NULL; xas_set(xas, index); } @@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas, dax_lock_entry(xas, entry); if (xas_error(xas)) goto out_unlock; + if (mapping_empty(mapping)) + populated++; mapping->nrexceptional++; } out_unlock: xas_unlock_irq(xas); + if (populated == -1) + inode_pages_clear(mapping->inode); + else if (populated == 1) + inode_pages_set(mapping->inode); if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) goto retry; if (xas->xa_node == XA_ERROR(-ENOMEM)) @@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping, pgoff_t index, bool trunc) { XA_STATE(xas, &mapping->i_pages, index); + bool empty = false; int ret = 0; void *entry; @@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping, dax_disassociate_entry(entry, mapping, trunc); xas_store(&xas, NULL); mapping->nrexceptional--; + empty = mapping_empty(mapping); ret = 1; out: put_unlocked_entry(&xas, entry); xas_unlock_irq(&xas); + if (empty) + inode_pages_clear(mapping->host); return ret; } diff --git a/fs/drop_caches.c b/fs/drop_caches.c index f00fcc4a4f72..20e845ff43f2 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) * we need to reschedule to avoid softlockups. */ if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || - (inode->i_mapping->nrpages == 0 && !need_resched())) { + (mapping_empty(inode->i_mapping) && !need_resched())) { spin_unlock(&inode->i_lock); continue; } diff --git a/fs/inode.c b/fs/inode.c index 9fcec07a9d7c..9da373244db7 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -431,26 +431,101 @@ static void inode_lru_list_add(struct inode *inode) inode->i_state |= I_REFERENCED; } +static void inode_lru_list_del(struct inode *inode) +{ + if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) + this_cpu_dec(nr_unused); +} + /* * Add inode to LRU if needed (inode is unused and clean). * * Needs inode->i_lock held. */ -void inode_add_lru(struct inode *inode) +bool inode_add_lru(struct inode *inode) { - if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC | - I_FREEING | I_WILL_FREE)) && - !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE) - inode_lru_list_add(inode); + if (inode->i_state & + (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES)) + return false; + if (atomic_read(&inode->i_count)) + return false; + if (!(inode->i_sb->s_flags & SB_ACTIVE)) + return false; + inode_lru_list_add(inode); + return true; } +/* + * Usually, inodes become reclaimable when they are no longer + * referenced and their page cache has been reclaimed. The following + * API allows the VM to communicate cache population state to the VFS. + * + * However, on CONFIG_HIGHMEM we can't wait for the page cache to go + * away: cache pages allocated in a large highmem zone could pin + * struct inode memory allocated in relatively small lowmem zones. So + * when CONFIG_HIGHMEM is enabled, we tie cache to the inode lifetime. + */ -static void inode_lru_list_del(struct inode *inode) +#ifndef CONFIG_HIGHMEM +/** + * inode_pages_set - mark the inode as holding page cache + * @inode: the inode whose first cache page was just added + * + * Tell the VFS that this inode has populated page cache and must not + * be reclaimed by the inode shrinker. + * + * The caller must hold the page lock of the just-added page: by + * pinning the page, the page cache cannot become depopulated, and we + * can safely set I_PAGES without a race check under the i_pages lock. + * + * This function acquires the i_lock. + */ +void inode_pages_set(struct inode *inode) { + spin_lock(&inode->i_lock); + if (!(inode->i_state & I_PAGES)) { + inode->i_state |= I_PAGES; + if (!list_empty(&inode->i_lru)) { + count_vm_event(PGINODERESCUE); + inode_lru_list_del(inode); + } + } + spin_unlock(&inode->i_lock); +} - if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) - this_cpu_dec(nr_unused); +/** + * inode_pages_clear - mark the inode as not holding page cache + * @inode: the inode whose last cache page was just removed + * + * Tell the VFS that the inode no longer holds page cache and that its + * lifetime is to be handed over to the inode shrinker LRU. + * + * This function acquires the i_lock and the i_pages lock. + */ +void inode_pages_clear(struct inode *inode) +{ + struct address_space *mapping = &inode->i_data; + bool add_to_lru = false; + unsigned long flags; + + spin_lock(&inode->i_lock); + + xa_lock_irqsave(&mapping->i_pages, flags); + if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) { + inode->i_state &= ~I_PAGES; + add_to_lru = true; + } + xa_unlock_irqrestore(&mapping->i_pages, flags); + + if (add_to_lru) { + WARN_ON_ONCE(!list_empty(&inode->i_lru)); + if (inode_add_lru(inode)) + __count_vm_event(PGINODEDELAYED); + } + + spin_unlock(&inode->i_lock); } +#endif /* !CONFIG_HIGHMEM */ /** * inode_sb_list_add - add inode to the superblock list of inodes @@ -743,6 +818,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item, if (!spin_trylock(&inode->i_lock)) return LRU_SKIP; + WARN_ON_ONCE(inode->i_state & I_PAGES); + /* * Referenced or dirty inodes are still in use. Give them another pass * through the LRU as we canot reclaim them now. @@ -762,7 +839,18 @@ static enum lru_status inode_lru_isolate(struct list_head *item, return LRU_ROTATE; } - if (inode_has_buffers(inode) || inode->i_data.nrpages) { + /* + * Usually, populated inodes shouldn't be on the shrinker LRU, + * but they can be briefly visible when a new page is added to + * an inode that was already linked but inode_pages_set() + * hasn't run yet to move them off. + * + * The other exception is on HIGHMEM systems: highmem cache + * can pin lowmem struct inodes, and we might be in dire + * straits in the lower zones. Purge cache to free the inode. + */ + if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { +#ifdef CONFIG_HIGHMEM __iget(inode); spin_unlock(&inode->i_lock); spin_unlock(lru_lock); @@ -779,6 +867,12 @@ static enum lru_status inode_lru_isolate(struct list_head *item, iput(inode); spin_lock(lru_lock); return LRU_RETRY; +#else + list_lru_isolate(lru, &inode->i_lru); + spin_unlock(&inode->i_lock); + this_cpu_dec(nr_unused); + return LRU_REMOVED; +#endif } WARN_ON(inode->i_state & I_NEW); diff --git a/fs/internal.h b/fs/internal.h index aa5d45524e87..2ca77c0afb0b 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -137,7 +137,7 @@ extern int vfs_open(const struct path *, struct file *); * inode.c */ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); -extern void inode_add_lru(struct inode *inode); +extern bool inode_add_lru(struct inode *inode); extern int dentry_needs_remove_privs(struct dentry *dentry); /* diff --git a/include/linux/fs.h b/include/linux/fs.h index e4a1f28435fe..c3137ee87cb2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -592,6 +592,11 @@ static inline void mapping_allow_writable(struct address_space *mapping) atomic_inc(&mapping->i_mmap_writable); } +static inline bool mapping_empty(struct address_space *mapping) +{ + return mapping->nrpages + mapping->nrexceptional == 0; +} + /* * Use sequence counter to get consistent i_size on 32-bit processors. */ @@ -2165,6 +2170,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, * * I_CREATING New object's inode in the middle of setting up. * + * I_PAGES Inode is holding page cache that needs to get reclaimed + * first before the inode can go onto the shrinker LRU. + * * Q: What is the difference between I_WILL_FREE and I_FREEING? */ #define I_DIRTY_SYNC (1 << 0) @@ -2187,6 +2195,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, #define I_WB_SWITCH (1 << 13) #define I_OVL_INUSE (1 << 14) #define I_CREATING (1 << 15) +#define I_PAGES (1 << 16) #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) @@ -3124,6 +3133,14 @@ static inline void remove_inode_hash(struct inode *inode) __remove_inode_hash(inode); } +#ifndef CONFIG_HIGHMEM +extern void inode_pages_set(struct inode *inode); +extern void inode_pages_clear(struct inode *inode); +#else +static inline void inode_pages_set(struct inode *inode) {} +static inline void inode_pages_clear(struct inode *inode) {} +#endif + extern void inode_sb_list_add(struct inode *inode); #ifdef CONFIG_BLOCK diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c6348c50136f..c3646b79489e 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -613,7 +613,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, int add_to_page_cache_lru(struct page *page, struct address_space *mapping, pgoff_t index, gfp_t gfp_mask); extern void delete_from_page_cache(struct page *page); -extern void __delete_from_page_cache(struct page *page, void *shadow); +extern bool __delete_from_page_cache(struct page *page, void *shadow); int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); void delete_from_page_cache_batch(struct address_space *mapping, struct pagevec *pvec); diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index ffef0f279747..de362c9cda29 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_NUMA PGSCAN_ZONE_RECLAIM_FAILED, #endif - PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL, + SLABS_SCANNED, + PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED, KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY, PAGEOUTRUN, PGROTATED, DROP_PAGECACHE, DROP_SLAB, diff --git a/mm/filemap.c b/mm/filemap.c index 792e22e1e3c0..9cc423dcb16c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,8 +116,8 @@ * ->tasklist_lock (memory_failure, collect_procs_ao) */ -static void page_cache_delete(struct address_space *mapping, - struct page *page, void *shadow) +static bool __must_check page_cache_delete(struct address_space *mapping, + struct page *page, void *shadow) { XA_STATE(xas, &mapping->i_pages, page->index); unsigned int nr = 1; @@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping, smp_wmb(); } mapping->nrpages -= nr; + + return mapping_empty(mapping); } static void unaccount_page_cache_page(struct address_space *mapping, @@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping, * Delete a page from the page cache and free it. Caller has to make * sure the page is locked and that nobody else uses it - or that usage * is safe. The caller must hold the i_pages lock. + * + * If this returns true, the caller must call inode_pages_clear() + * after dropping the i_pages lock. */ -void __delete_from_page_cache(struct page *page, void *shadow) +bool __must_check __delete_from_page_cache(struct page *page, void *shadow) { struct address_space *mapping = page->mapping; trace_mm_filemap_delete_from_page_cache(page); unaccount_page_cache_page(mapping, page); - page_cache_delete(mapping, page, shadow); + return page_cache_delete(mapping, page, shadow); } static void page_cache_free_page(struct address_space *mapping, @@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page) { struct address_space *mapping = page_mapping(page); unsigned long flags; + bool empty; BUG_ON(!PageLocked(page)); xa_lock_irqsave(&mapping->i_pages, flags); - __delete_from_page_cache(page, NULL); + empty = __delete_from_page_cache(page, NULL); xa_unlock_irqrestore(&mapping->i_pages, flags); + if (empty) + inode_pages_clear(mapping->host); + page_cache_free_page(mapping, page); } EXPORT_SYMBOL(delete_from_page_cache); @@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache); * * The function expects the i_pages lock to be held. */ -static void page_cache_delete_batch(struct address_space *mapping, - struct pagevec *pvec) +static bool __must_check page_cache_delete_batch(struct address_space *mapping, + struct pagevec *pvec) { XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index); int total_pages = 0; @@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping, total_pages++; } mapping->nrpages -= total_pages; + + return mapping_empty(mapping); } void delete_from_page_cache_batch(struct address_space *mapping, struct pagevec *pvec) { int i; + bool empty; unsigned long flags; if (!pagevec_count(pvec)) @@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping, unaccount_page_cache_page(mapping, pvec->pages[i]); } - page_cache_delete_batch(mapping, pvec); + empty = page_cache_delete_batch(mapping, pvec); xa_unlock_irqrestore(&mapping->i_pages, flags); + if (empty) + inode_pages_clear(mapping->host); + for (i = 0; i < pagevec_count(pvec); i++) page_cache_free_page(mapping, pvec->pages[i]); } @@ -833,6 +848,7 @@ static int __add_to_page_cache_locked(struct page *page, { XA_STATE(xas, &mapping->i_pages, offset); int huge = PageHuge(page); + bool populated = false; int error; void *old; @@ -859,6 +875,7 @@ static int __add_to_page_cache_locked(struct page *page, if (xas_error(&xas)) goto unlock; + populated = mapping_empty(mapping); if (xa_is_value(old)) { mapping->nrexceptional--; if (shadowp) @@ -879,6 +896,10 @@ static int __add_to_page_cache_locked(struct page *page, } trace_mm_filemap_add_to_page_cache(page); + + if (populated) + inode_pages_set(mapping->host); + return 0; error: page->mapping = NULL; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 23763452b52a..fbe9a7297759 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2406,7 +2406,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, /* Some pages can be beyond i_size: drop them from page cache */ if (head[i].index >= end) { ClearPageDirty(head + i); - __delete_from_page_cache(head + i, NULL); + /* We know we're not removing the last page */ + (void)__delete_from_page_cache(head + i, NULL); if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head)) shmem_uncharge(head->mapping->host, 1); put_page(head + i); diff --git a/mm/truncate.c b/mm/truncate.c index dd9ebc1da356..8fb6c2f762bc 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -31,24 +31,31 @@ * itself locked. These unlocked entries need verification under the tree * lock. */ -static inline void __clear_shadow_entry(struct address_space *mapping, - pgoff_t index, void *entry) +static bool __must_check __clear_shadow_entry(struct address_space *mapping, + pgoff_t index, void *entry) { XA_STATE(xas, &mapping->i_pages, index); xas_set_update(&xas, workingset_update_node); if (xas_load(&xas) != entry) - return; + return 0; xas_store(&xas, NULL); mapping->nrexceptional--; + + return mapping_empty(mapping); } static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, void *entry) { + bool empty; + xa_lock_irq(&mapping->i_pages); - __clear_shadow_entry(mapping, index, entry); + empty = __clear_shadow_entry(mapping, index, entry); xa_unlock_irq(&mapping->i_pages); + + if (empty) + inode_pages_clear(mapping->host); } /* @@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, pgoff_t end) { int i, j; - bool dax, lock; + bool dax, lock, empty = false; /* Handled by shmem itself */ if (shmem_mapping(mapping)) @@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, continue; } - __clear_shadow_entry(mapping, index, page); + if (__clear_shadow_entry(mapping, index, page)) + empty = true; } if (lock) xa_unlock_irq(&mapping->i_pages); + + if (empty) + inode_pages_clear(mapping->host); + pvec->nr = j; } @@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping, pgoff_t index; int i; - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) + if (mapping_empty(mapping)) goto out; /* Offsets within partial pages */ @@ -636,6 +648,7 @@ static int invalidate_complete_page2(struct address_space *mapping, struct page *page) { unsigned long flags; + bool empty; if (page->mapping != mapping) return 0; @@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) goto failed; BUG_ON(page_has_private(page)); - __delete_from_page_cache(page, NULL); + empty = __delete_from_page_cache(page, NULL); xa_unlock_irqrestore(&mapping->i_pages, flags); + if (empty) + inode_pages_clear(mapping->host); + if (mapping->a_ops->freepage) mapping->a_ops->freepage(page); @@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, int ret2 = 0; int did_range_unmap = 0; - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) + if (mapping_empty(mapping)) goto out; pagevec_init(&pvec); diff --git a/mm/vmscan.c b/mm/vmscan.c index cc555903a332..2ae2df605006 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, } else { void (*freepage)(struct page *); void *shadow = NULL; + int empty; freepage = mapping->a_ops->freepage; /* @@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (reclaimed && page_is_file_lru(page) && !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(page, target_memcg); - __delete_from_page_cache(page, shadow); + empty = __delete_from_page_cache(page, shadow); xa_unlock_irqrestore(&mapping->i_pages, flags); + if (empty) + inode_pages_clear(mapping->host); + if (freepage != NULL) freepage(page); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 35219271796f..0fdbec92265a 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1205,9 +1205,11 @@ const char * const vmstat_text[] = { #ifdef CONFIG_NUMA "zone_reclaim_failed", #endif - "pginodesteal", "slabs_scanned", + "pginodesteal", "kswapd_inodesteal", + "pginoderescue", + "pginodedelayed", "kswapd_low_wmark_hit_quickly", "kswapd_high_wmark_hit_quickly", "pageoutrun", diff --git a/mm/workingset.c b/mm/workingset.c index 474186b76ced..7ce9c74ebfdb 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, struct xa_node *node = container_of(item, struct xa_node, private_list); XA_STATE(xas, node->array, 0); struct address_space *mapping; + bool empty = false; int ret; /* @@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, if (WARN_ON_ONCE(node->count != node->nr_values)) goto out_invalid; mapping->nrexceptional -= node->nr_values; + empty = mapping_empty(mapping); xas.xa_node = xa_parent_locked(&mapping->i_pages, node); xas.xa_offset = node->offset; xas.xa_shift = node->shift + XA_CHUNK_SHIFT; @@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, out_invalid: xa_unlock_irq(&mapping->i_pages); + if (empty) + inode_pages_clear(mapping->host); ret = LRU_REMOVED_RETRY; out: cond_resched();
On Wed, May 13, 2020 at 5:29 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote: > > The VFS inode shrinker is currently allowed to reclaim inodes with > > populated page cache. As a result it can drop gigabytes of hot and > > active page cache on the floor without consulting the VM (recorded as > > "inodesteal" events in /proc/vmstat). > > I'm sending a rebased version of this patch. > > We've been running with this change in the Facebook fleet since > February with no ill side effects observed. > > However, I just spent several hours chasing a mysterious reclaim > problem that turned out to be this bug again on an unpatched system. > > In the scenario I was debugging, the problem wasn't that we were > losing cache, but that we were losing the non-resident information for > previously evicted cache. > > I understood the file set enough to know it was thrashing like crazy, > but it didn't register as refaults to the kernel. Without detecting > the refaults, reclaim wouldn't start swapping to relieve the > struggling cache (plenty of cold anon memory around). It also meant > the IO delays of those refaults didn't contribute to memory pressure > in psi, which made userspace blind to the situation as well. > > The first aspect means we can get stuck in pathological thrashing, the > second means userspace OOM detection breaks and we can leave servers > (or Android devices, for that matter) hopelessly livelocked. > > New patch attached below. I hope we can get this fixed in 5.8, it's > really quite a big hole in our cache management strategy. > > --- > From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001 > From: Johannes Weiner <hannes@cmpxchg.org> > Date: Tue, 11 Feb 2020 12:55:07 -0500 > Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU > > The VFS inode shrinker is currently allowed to reclaim cold inodes > with populated page cache. This behavior goes back to CONFIG_HIGHMEM > setups, which required the ability to drop page cache in large highem > zones to free up struct inodes in comparatively tiny lowmem zones. > > However, it has significant side effects that are hard to justify on > systems without highmem: > > - It can drop gigabytes of hot and active page cache on the floor > without consulting the VM (recorded as "inodesteal" events in > /proc/vmstat). Such an "aging inversion" between unreferenced inodes > holding hot cache easily happens in practice: for example, a git tree > whose objects are accessed frequently but no open file descriptors are > maintained throughout. > Hi Johannes, I think it is reasonable to keep inodes with _active_ page cache off the inode shrinker LRU, but I'm not sure whether it is proper to keep the inodes with _only_ inactive page cache off the inode list lru neither. Per my understanding, if the inode has only inactive page cache, then invalidate all these inactive page cache could save the reclaimer's time, IOW, it may improve the performance in this case. > - It can als prematurely drop non-resident info stored inside the > inodes, which can let massive cache thrashing go unnoticed. This in > turn means we're making the wrong decisions in page reclaim and can > get stuck in pathological thrashing. The thrashing also doesn't show > up as memory pressure in psi, causing failure in the userspace OOM > detection chain, which can leave a system (or Android device e.g.) > stranded in a livelock. > > History > > As this keeps causing problems for people, there have been several > attempts to address this. > > One recent attempt was to make the inode shrinker simply skip over > inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim > inodes with many attached pages"). > > However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm: > don't reclaim inodes with many attached pages"") because it caused > severe reclaim performance problems: Inodes that sit on the shrinker > LRU are attracting reclaim pressure away from the page cache and > toward the VFS. If we then permanently exempt sizable portions of this > pool from actually getting reclaimed when looked at, this pressure > accumulates as deferred shrinker work (a mechanism for *temporarily* > unreclaimable objects) until it causes mayhem in the VFS cache pools. > > In the bug quoted in 69056ee6a8a3 in particular, the excessive > pressure drove the XFS shrinker into dirty objects, where it caused > synchronous, IO-bound stalls, even as there was plenty of clean page > cache that should have been reclaimed instead. > > Another variant of this problem was recently observed, where the > kernel violates cgroups' memory.low protection settings and reclaims > page cache way beyond the configured thresholds. It was followed by a > proposal of a modified form of the reverted commit above, that > implements memory.low-sensitive shrinker skipping over populated > inodes on the LRU [1]. However, this proposal continues to run the > risk of attracting disproportionate reclaim pressure to a pool of > still-used inodes, while not addressing the more generic reclaim > inversion problem outside of a very specific cgroup application. > > [1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/ > > Solution > > This patch fixes the aging inversion described above on > !CONFIG_HIGHMEM systems, without reintroducing the problems associated > with excessive shrinker LRU rotations, by keeping populated inodes off > the shrinker LRUs entirely. > > Currently, inodes are kept off the shrinker LRU as long as they have > an elevated i_count, indicating an active user. Unfortunately, the > page cache cannot simply hold an i_count reference, because unlink() > *should* result in the inode being dropped and its cache invalidated. > > Instead, this patch makes iput_final() consult the state of the page > cache and punt the LRU linking to the VM if the inode is still > populated; the VM in turn checks the inode state when it depopulates > the page cache, and adds the inode to the LRU if necessary. > > This is not unlike what we do for dirty inodes, which are moved off > the LRU permanently until writeback completion puts them back on (iff > still unused). We can reuse the same code -- inode_add_lru() - here. > > This is also not unlike page reclaim, where the lower VM layer has to > negotiate state with the higher VFS layer. Follow existing precedence > and handle the inversion as much as possible on the VM side: > > - introduce an I_PAGES flag that the VM maintains under the i_lock, so > that any inode code holding that lock can check the page cache state > without having to lock and inspect the struct address_space > > - introduce inode_pages_set() and inode_pages_clear() to maintain the > inode LRU state from the VM side, then update all cache mutators to > use them when populating the first cache entry or clearing the last > > With this, the concept of "inodesteal" - where the inode shrinker > drops page cache - is relegated to CONFIG_HIGHMEM systems only. The VM > is in charge of the cache, the shrinker in charge of struct inode. > > Footnotes > > - For debuggability, add vmstat counters that track the number of > times a new cache entry pulls a previously unused inode off the LRU > (pginoderescue), as well as how many times existing cache deferred > an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters > for backwards compatibility, but they'll just show 0 now. > > - Fix /proc/sys/vm/drop_caches to drop shadow entries from the page > cache. Not doing so has always been a bit strange, but since most > people drop cache and metadata cache together, the inode shrinker > would have taken care of them before - no more, so do it VM-side. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > fs/block_dev.c | 2 +- > fs/dax.c | 14 +++++ > fs/drop_caches.c | 2 +- > fs/inode.c | 112 +++++++++++++++++++++++++++++++--- > fs/internal.h | 2 +- > include/linux/fs.h | 17 ++++++ > include/linux/pagemap.h | 2 +- > include/linux/vm_event_item.h | 3 +- > mm/filemap.c | 37 ++++++++--- > mm/huge_memory.c | 3 +- > mm/truncate.c | 34 ++++++++--- > mm/vmscan.c | 6 +- > mm/vmstat.c | 4 +- > mm/workingset.c | 4 ++ > 14 files changed, 208 insertions(+), 34 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 4c26dcd9dba3..e0f73647c848 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev) > { > struct address_space *mapping = bdev->bd_inode->i_mapping; > > - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) > + if (mapping_empty(mapping)) > return; > > invalidate_bh_lrus(); > diff --git a/fs/dax.c b/fs/dax.c > index 11b16729b86f..89d1245a96ce 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas, > { > unsigned long index = xas->xa_index; > bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */ > + int populated; > void *entry; > > retry: > + populated = 0; > xas_lock_irq(xas); > entry = get_unlocked_entry(xas, order); > > @@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas, > xas_store(xas, NULL); /* undo the PMD join */ > dax_wake_entry(xas, entry, true); > mapping->nrexceptional--; > + if (mapping_empty(mapping)) > + populated = -1; > entry = NULL; > xas_set(xas, index); > } > @@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas, > dax_lock_entry(xas, entry); > if (xas_error(xas)) > goto out_unlock; > + if (mapping_empty(mapping)) > + populated++; > mapping->nrexceptional++; > } > > out_unlock: > xas_unlock_irq(xas); > + if (populated == -1) > + inode_pages_clear(mapping->inode); > + else if (populated == 1) > + inode_pages_set(mapping->inode); > if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) > goto retry; > if (xas->xa_node == XA_ERROR(-ENOMEM)) > @@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping, > pgoff_t index, bool trunc) > { > XA_STATE(xas, &mapping->i_pages, index); > + bool empty = false; > int ret = 0; > void *entry; > > @@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping, > dax_disassociate_entry(entry, mapping, trunc); > xas_store(&xas, NULL); > mapping->nrexceptional--; > + empty = mapping_empty(mapping); > ret = 1; > out: > put_unlocked_entry(&xas, entry); > xas_unlock_irq(&xas); > + if (empty) > + inode_pages_clear(mapping->host); > return ret; > } > > diff --git a/fs/drop_caches.c b/fs/drop_caches.c > index f00fcc4a4f72..20e845ff43f2 100644 > --- a/fs/drop_caches.c > +++ b/fs/drop_caches.c > @@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) > * we need to reschedule to avoid softlockups. > */ > if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || > - (inode->i_mapping->nrpages == 0 && !need_resched())) { > + (mapping_empty(inode->i_mapping) && !need_resched())) { > spin_unlock(&inode->i_lock); > continue; > } > diff --git a/fs/inode.c b/fs/inode.c > index 9fcec07a9d7c..9da373244db7 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -431,26 +431,101 @@ static void inode_lru_list_add(struct inode *inode) > inode->i_state |= I_REFERENCED; > } > > +static void inode_lru_list_del(struct inode *inode) > +{ > + if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) > + this_cpu_dec(nr_unused); > +} > + > /* > * Add inode to LRU if needed (inode is unused and clean). > * > * Needs inode->i_lock held. > */ > -void inode_add_lru(struct inode *inode) > +bool inode_add_lru(struct inode *inode) > { > - if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC | > - I_FREEING | I_WILL_FREE)) && > - !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE) > - inode_lru_list_add(inode); > + if (inode->i_state & > + (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES)) > + return false; > + if (atomic_read(&inode->i_count)) > + return false; > + if (!(inode->i_sb->s_flags & SB_ACTIVE)) > + return false; > + inode_lru_list_add(inode); > + return true; > } > > +/* > + * Usually, inodes become reclaimable when they are no longer > + * referenced and their page cache has been reclaimed. The following > + * API allows the VM to communicate cache population state to the VFS. > + * > + * However, on CONFIG_HIGHMEM we can't wait for the page cache to go > + * away: cache pages allocated in a large highmem zone could pin > + * struct inode memory allocated in relatively small lowmem zones. So > + * when CONFIG_HIGHMEM is enabled, we tie cache to the inode lifetime. > + */ > > -static void inode_lru_list_del(struct inode *inode) > +#ifndef CONFIG_HIGHMEM > +/** > + * inode_pages_set - mark the inode as holding page cache > + * @inode: the inode whose first cache page was just added > + * > + * Tell the VFS that this inode has populated page cache and must not > + * be reclaimed by the inode shrinker. > + * > + * The caller must hold the page lock of the just-added page: by > + * pinning the page, the page cache cannot become depopulated, and we > + * can safely set I_PAGES without a race check under the i_pages lock. > + * > + * This function acquires the i_lock. > + */ > +void inode_pages_set(struct inode *inode) > { > + spin_lock(&inode->i_lock); > + if (!(inode->i_state & I_PAGES)) { > + inode->i_state |= I_PAGES; > + if (!list_empty(&inode->i_lru)) { > + count_vm_event(PGINODERESCUE); > + inode_lru_list_del(inode); > + } > + } > + spin_unlock(&inode->i_lock); > +} > > - if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) > - this_cpu_dec(nr_unused); > +/** > + * inode_pages_clear - mark the inode as not holding page cache > + * @inode: the inode whose last cache page was just removed > + * > + * Tell the VFS that the inode no longer holds page cache and that its > + * lifetime is to be handed over to the inode shrinker LRU. > + * > + * This function acquires the i_lock and the i_pages lock. > + */ > +void inode_pages_clear(struct inode *inode) > +{ > + struct address_space *mapping = &inode->i_data; > + bool add_to_lru = false; > + unsigned long flags; > + > + spin_lock(&inode->i_lock); > + > + xa_lock_irqsave(&mapping->i_pages, flags); > + if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) { > + inode->i_state &= ~I_PAGES; > + add_to_lru = true; > + } > + xa_unlock_irqrestore(&mapping->i_pages, flags); > + > + if (add_to_lru) { > + WARN_ON_ONCE(!list_empty(&inode->i_lru)); > + if (inode_add_lru(inode)) > + __count_vm_event(PGINODEDELAYED); > + } > + > + spin_unlock(&inode->i_lock); > } > +#endif /* !CONFIG_HIGHMEM */ > > /** > * inode_sb_list_add - add inode to the superblock list of inodes > @@ -743,6 +818,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item, > if (!spin_trylock(&inode->i_lock)) > return LRU_SKIP; > > + WARN_ON_ONCE(inode->i_state & I_PAGES); > + > /* > * Referenced or dirty inodes are still in use. Give them another pass > * through the LRU as we canot reclaim them now. > @@ -762,7 +839,18 @@ static enum lru_status inode_lru_isolate(struct list_head *item, > return LRU_ROTATE; > } > > - if (inode_has_buffers(inode) || inode->i_data.nrpages) { > + /* > + * Usually, populated inodes shouldn't be on the shrinker LRU, > + * but they can be briefly visible when a new page is added to > + * an inode that was already linked but inode_pages_set() > + * hasn't run yet to move them off. > + * > + * The other exception is on HIGHMEM systems: highmem cache > + * can pin lowmem struct inodes, and we might be in dire > + * straits in the lower zones. Purge cache to free the inode. > + */ > + if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { > +#ifdef CONFIG_HIGHMEM > __iget(inode); > spin_unlock(&inode->i_lock); > spin_unlock(lru_lock); > @@ -779,6 +867,12 @@ static enum lru_status inode_lru_isolate(struct list_head *item, > iput(inode); > spin_lock(lru_lock); > return LRU_RETRY; > +#else > + list_lru_isolate(lru, &inode->i_lru); > + spin_unlock(&inode->i_lock); > + this_cpu_dec(nr_unused); > + return LRU_REMOVED; > +#endif > } > > WARN_ON(inode->i_state & I_NEW); > diff --git a/fs/internal.h b/fs/internal.h > index aa5d45524e87..2ca77c0afb0b 100644 > --- a/fs/internal.h > +++ b/fs/internal.h > @@ -137,7 +137,7 @@ extern int vfs_open(const struct path *, struct file *); > * inode.c > */ > extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); > -extern void inode_add_lru(struct inode *inode); > +extern bool inode_add_lru(struct inode *inode); > extern int dentry_needs_remove_privs(struct dentry *dentry); > > /* > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e4a1f28435fe..c3137ee87cb2 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -592,6 +592,11 @@ static inline void mapping_allow_writable(struct address_space *mapping) > atomic_inc(&mapping->i_mmap_writable); > } > > +static inline bool mapping_empty(struct address_space *mapping) > +{ > + return mapping->nrpages + mapping->nrexceptional == 0; > +} > + > /* > * Use sequence counter to get consistent i_size on 32-bit processors. > */ > @@ -2165,6 +2170,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, > * > * I_CREATING New object's inode in the middle of setting up. > * > + * I_PAGES Inode is holding page cache that needs to get reclaimed > + * first before the inode can go onto the shrinker LRU. > + * > * Q: What is the difference between I_WILL_FREE and I_FREEING? > */ > #define I_DIRTY_SYNC (1 << 0) > @@ -2187,6 +2195,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, > #define I_WB_SWITCH (1 << 13) > #define I_OVL_INUSE (1 << 14) > #define I_CREATING (1 << 15) > +#define I_PAGES (1 << 16) > > #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) > #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) > @@ -3124,6 +3133,14 @@ static inline void remove_inode_hash(struct inode *inode) > __remove_inode_hash(inode); > } > > +#ifndef CONFIG_HIGHMEM > +extern void inode_pages_set(struct inode *inode); > +extern void inode_pages_clear(struct inode *inode); > +#else > +static inline void inode_pages_set(struct inode *inode) {} > +static inline void inode_pages_clear(struct inode *inode) {} > +#endif > + > extern void inode_sb_list_add(struct inode *inode); > > #ifdef CONFIG_BLOCK > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index c6348c50136f..c3646b79489e 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -613,7 +613,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, > int add_to_page_cache_lru(struct page *page, struct address_space *mapping, > pgoff_t index, gfp_t gfp_mask); > extern void delete_from_page_cache(struct page *page); > -extern void __delete_from_page_cache(struct page *page, void *shadow); > +extern bool __delete_from_page_cache(struct page *page, void *shadow); > int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); > void delete_from_page_cache_batch(struct address_space *mapping, > struct pagevec *pvec); > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index ffef0f279747..de362c9cda29 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_NUMA > PGSCAN_ZONE_RECLAIM_FAILED, > #endif > - PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL, > + SLABS_SCANNED, > + PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED, > KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY, > PAGEOUTRUN, PGROTATED, > DROP_PAGECACHE, DROP_SLAB, > diff --git a/mm/filemap.c b/mm/filemap.c > index 792e22e1e3c0..9cc423dcb16c 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -116,8 +116,8 @@ > * ->tasklist_lock (memory_failure, collect_procs_ao) > */ > > -static void page_cache_delete(struct address_space *mapping, > - struct page *page, void *shadow) > +static bool __must_check page_cache_delete(struct address_space *mapping, > + struct page *page, void *shadow) > { > XA_STATE(xas, &mapping->i_pages, page->index); > unsigned int nr = 1; > @@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping, > smp_wmb(); > } > mapping->nrpages -= nr; > + > + return mapping_empty(mapping); > } > > static void unaccount_page_cache_page(struct address_space *mapping, > @@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping, > * Delete a page from the page cache and free it. Caller has to make > * sure the page is locked and that nobody else uses it - or that usage > * is safe. The caller must hold the i_pages lock. > + * > + * If this returns true, the caller must call inode_pages_clear() > + * after dropping the i_pages lock. > */ > -void __delete_from_page_cache(struct page *page, void *shadow) > +bool __must_check __delete_from_page_cache(struct page *page, void *shadow) > { > struct address_space *mapping = page->mapping; > > trace_mm_filemap_delete_from_page_cache(page); > > unaccount_page_cache_page(mapping, page); > - page_cache_delete(mapping, page, shadow); > + return page_cache_delete(mapping, page, shadow); > } > > static void page_cache_free_page(struct address_space *mapping, > @@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page) > { > struct address_space *mapping = page_mapping(page); > unsigned long flags; > + bool empty; > > BUG_ON(!PageLocked(page)); > xa_lock_irqsave(&mapping->i_pages, flags); > - __delete_from_page_cache(page, NULL); > + empty = __delete_from_page_cache(page, NULL); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > + if (empty) > + inode_pages_clear(mapping->host); > + > page_cache_free_page(mapping, page); > } > EXPORT_SYMBOL(delete_from_page_cache); > @@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache); > * > * The function expects the i_pages lock to be held. > */ > -static void page_cache_delete_batch(struct address_space *mapping, > - struct pagevec *pvec) > +static bool __must_check page_cache_delete_batch(struct address_space *mapping, > + struct pagevec *pvec) > { > XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index); > int total_pages = 0; > @@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping, > total_pages++; > } > mapping->nrpages -= total_pages; > + > + return mapping_empty(mapping); > } > > void delete_from_page_cache_batch(struct address_space *mapping, > struct pagevec *pvec) > { > int i; > + bool empty; > unsigned long flags; > > if (!pagevec_count(pvec)) > @@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping, > > unaccount_page_cache_page(mapping, pvec->pages[i]); > } > - page_cache_delete_batch(mapping, pvec); > + empty = page_cache_delete_batch(mapping, pvec); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > + if (empty) > + inode_pages_clear(mapping->host); > + > for (i = 0; i < pagevec_count(pvec); i++) > page_cache_free_page(mapping, pvec->pages[i]); > } > @@ -833,6 +848,7 @@ static int __add_to_page_cache_locked(struct page *page, > { > XA_STATE(xas, &mapping->i_pages, offset); > int huge = PageHuge(page); > + bool populated = false; > int error; > void *old; > > @@ -859,6 +875,7 @@ static int __add_to_page_cache_locked(struct page *page, > if (xas_error(&xas)) > goto unlock; > > + populated = mapping_empty(mapping); > if (xa_is_value(old)) { > mapping->nrexceptional--; > if (shadowp) > @@ -879,6 +896,10 @@ static int __add_to_page_cache_locked(struct page *page, > } > > trace_mm_filemap_add_to_page_cache(page); > + > + if (populated) > + inode_pages_set(mapping->host); > + > return 0; > error: > page->mapping = NULL; > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 23763452b52a..fbe9a7297759 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2406,7 +2406,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > /* Some pages can be beyond i_size: drop them from page cache */ > if (head[i].index >= end) { > ClearPageDirty(head + i); > - __delete_from_page_cache(head + i, NULL); > + /* We know we're not removing the last page */ > + (void)__delete_from_page_cache(head + i, NULL); > if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head)) > shmem_uncharge(head->mapping->host, 1); > put_page(head + i); > diff --git a/mm/truncate.c b/mm/truncate.c > index dd9ebc1da356..8fb6c2f762bc 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -31,24 +31,31 @@ > * itself locked. These unlocked entries need verification under the tree > * lock. > */ > -static inline void __clear_shadow_entry(struct address_space *mapping, > - pgoff_t index, void *entry) > +static bool __must_check __clear_shadow_entry(struct address_space *mapping, > + pgoff_t index, void *entry) > { > XA_STATE(xas, &mapping->i_pages, index); > > xas_set_update(&xas, workingset_update_node); > if (xas_load(&xas) != entry) > - return; > + return 0; > xas_store(&xas, NULL); > mapping->nrexceptional--; > + > + return mapping_empty(mapping); > } > > static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, > void *entry) > { > + bool empty; > + > xa_lock_irq(&mapping->i_pages); > - __clear_shadow_entry(mapping, index, entry); > + empty = __clear_shadow_entry(mapping, index, entry); > xa_unlock_irq(&mapping->i_pages); > + > + if (empty) > + inode_pages_clear(mapping->host); > } > > /* > @@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, > pgoff_t end) > { > int i, j; > - bool dax, lock; > + bool dax, lock, empty = false; > > /* Handled by shmem itself */ > if (shmem_mapping(mapping)) > @@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, > continue; > } > > - __clear_shadow_entry(mapping, index, page); > + if (__clear_shadow_entry(mapping, index, page)) > + empty = true; > } > > if (lock) > xa_unlock_irq(&mapping->i_pages); > + > + if (empty) > + inode_pages_clear(mapping->host); > + > pvec->nr = j; > } > > @@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping, > pgoff_t index; > int i; > > - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) > + if (mapping_empty(mapping)) > goto out; > > /* Offsets within partial pages */ > @@ -636,6 +648,7 @@ static int > invalidate_complete_page2(struct address_space *mapping, struct page *page) > { > unsigned long flags; > + bool empty; > > if (page->mapping != mapping) > return 0; > @@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) > goto failed; > > BUG_ON(page_has_private(page)); > - __delete_from_page_cache(page, NULL); > + empty = __delete_from_page_cache(page, NULL); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > + if (empty) > + inode_pages_clear(mapping->host); > + > if (mapping->a_ops->freepage) > mapping->a_ops->freepage(page); > > @@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, > int ret2 = 0; > int did_range_unmap = 0; > > - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) > + if (mapping_empty(mapping)) > goto out; > > pagevec_init(&pvec); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index cc555903a332..2ae2df605006 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > } else { > void (*freepage)(struct page *); > void *shadow = NULL; > + int empty; > > freepage = mapping->a_ops->freepage; > /* > @@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > if (reclaimed && page_is_file_lru(page) && > !mapping_exiting(mapping) && !dax_mapping(mapping)) > shadow = workingset_eviction(page, target_memcg); > - __delete_from_page_cache(page, shadow); > + empty = __delete_from_page_cache(page, shadow); > xa_unlock_irqrestore(&mapping->i_pages, flags); > > + if (empty) > + inode_pages_clear(mapping->host); > + > if (freepage != NULL) > freepage(page); > } > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 35219271796f..0fdbec92265a 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1205,9 +1205,11 @@ const char * const vmstat_text[] = { > #ifdef CONFIG_NUMA > "zone_reclaim_failed", > #endif > - "pginodesteal", > "slabs_scanned", > + "pginodesteal", > "kswapd_inodesteal", > + "pginoderescue", > + "pginodedelayed", > "kswapd_low_wmark_hit_quickly", > "kswapd_high_wmark_hit_quickly", > "pageoutrun", > diff --git a/mm/workingset.c b/mm/workingset.c > index 474186b76ced..7ce9c74ebfdb 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, > struct xa_node *node = container_of(item, struct xa_node, private_list); > XA_STATE(xas, node->array, 0); > struct address_space *mapping; > + bool empty = false; > int ret; > > /* > @@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, > if (WARN_ON_ONCE(node->count != node->nr_values)) > goto out_invalid; > mapping->nrexceptional -= node->nr_values; > + empty = mapping_empty(mapping); > xas.xa_node = xa_parent_locked(&mapping->i_pages, node); > xas.xa_offset = node->offset; > xas.xa_shift = node->shift + XA_CHUNK_SHIFT; > @@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, > > out_invalid: > xa_unlock_irq(&mapping->i_pages); > + if (empty) > + inode_pages_clear(mapping->host); > ret = LRU_REMOVED_RETRY; > out: > cond_resched(); > -- > 2.26.2 >
On Wed, May 13, 2020 at 09:32:58AM +0800, Yafang Shao wrote: > On Wed, May 13, 2020 at 5:29 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote: > > > The VFS inode shrinker is currently allowed to reclaim inodes with > > > populated page cache. As a result it can drop gigabytes of hot and > > > active page cache on the floor without consulting the VM (recorded as > > > "inodesteal" events in /proc/vmstat). > > > > I'm sending a rebased version of this patch. > > > > We've been running with this change in the Facebook fleet since > > February with no ill side effects observed. > > > > However, I just spent several hours chasing a mysterious reclaim > > problem that turned out to be this bug again on an unpatched system. > > > > In the scenario I was debugging, the problem wasn't that we were > > losing cache, but that we were losing the non-resident information for > > previously evicted cache. > > > > I understood the file set enough to know it was thrashing like crazy, > > but it didn't register as refaults to the kernel. Without detecting > > the refaults, reclaim wouldn't start swapping to relieve the > > struggling cache (plenty of cold anon memory around). It also meant > > the IO delays of those refaults didn't contribute to memory pressure > > in psi, which made userspace blind to the situation as well. > > > > The first aspect means we can get stuck in pathological thrashing, the > > second means userspace OOM detection breaks and we can leave servers > > (or Android devices, for that matter) hopelessly livelocked. > > > > New patch attached below. I hope we can get this fixed in 5.8, it's > > really quite a big hole in our cache management strategy. > > > > --- > > From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001 > > From: Johannes Weiner <hannes@cmpxchg.org> > > Date: Tue, 11 Feb 2020 12:55:07 -0500 > > Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU > > > > The VFS inode shrinker is currently allowed to reclaim cold inodes > > with populated page cache. This behavior goes back to CONFIG_HIGHMEM > > setups, which required the ability to drop page cache in large highem > > zones to free up struct inodes in comparatively tiny lowmem zones. > > > > However, it has significant side effects that are hard to justify on > > systems without highmem: > > > > - It can drop gigabytes of hot and active page cache on the floor > > without consulting the VM (recorded as "inodesteal" events in > > /proc/vmstat). Such an "aging inversion" between unreferenced inodes > > holding hot cache easily happens in practice: for example, a git tree > > whose objects are accessed frequently but no open file descriptors are > > maintained throughout. > > > > Hi Johannes, > > I think it is reasonable to keep inodes with _active_ page cache off > the inode shrinker LRU, but I'm not sure whether it is proper to keep > the inodes with _only_ inactive page cache off the inode list lru > neither. Per my understanding, if the inode has only inactive page > cache, then invalidate all these inactive page cache could save the > reclaimer's time, IOW, it may improve the performance in this case. The shrinker doesn't know whether pages are active or inactive. There is a PageActive() flag, but that's a sampled state that's only uptodate when page reclaim is running. All the active pages could be stale and getting deactivated on the next scan; all the inactive pages could have page table references that would get them activated on the next reclaim run etc. You'd have to duplicate aspects of page reclaim itself to be sure you're axing the right pages. It also wouldn't be a reliable optimization. This only happens when there is a disconnect between the inode and the cache life time, which is true for some situations but not others.
On Tue, 12 May 2020 17:29:36 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > ... > > Solution > > This patch fixes the aging inversion described above on > !CONFIG_HIGHMEM systems, without reintroducing the problems associated > with excessive shrinker LRU rotations, by keeping populated inodes off > the shrinker LRUs entirely. > > Currently, inodes are kept off the shrinker LRU as long as they have > an elevated i_count, indicating an active user. Unfortunately, the > page cache cannot simply hold an i_count reference, because unlink() > *should* result in the inode being dropped and its cache invalidated. > > Instead, this patch makes iput_final() consult the state of the page > cache and punt the LRU linking to the VM if the inode is still > populated; the VM in turn checks the inode state when it depopulates > the page cache, and adds the inode to the LRU if necessary. > > This is not unlike what we do for dirty inodes, which are moved off > the LRU permanently until writeback completion puts them back on (iff > still unused). We can reuse the same code -- inode_add_lru() - here. > > This is also not unlike page reclaim, where the lower VM layer has to > negotiate state with the higher VFS layer. Follow existing precedence > and handle the inversion as much as possible on the VM side: > > - introduce an I_PAGES flag that the VM maintains under the i_lock, so > that any inode code holding that lock can check the page cache state > without having to lock and inspect the struct address_space Maintaining the same info in two places is a hassle. Is this optimization worthwhile? > - introduce inode_pages_set() and inode_pages_clear() to maintain the > inode LRU state from the VM side, then update all cache mutators to > use them when populating the first cache entry or clearing the last > > With this, the concept of "inodesteal" - where the inode shrinker > drops page cache - is relegated to CONFIG_HIGHMEM systems only. The VM > is in charge of the cache, the shrinker in charge of struct inode. How tested is this on highmem machines? > Footnotes > > - For debuggability, add vmstat counters that track the number of > times a new cache entry pulls a previously unused inode off the LRU > (pginoderescue), as well as how many times existing cache deferred > an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters > for backwards compatibility, but they'll just show 0 now. > > - Fix /proc/sys/vm/drop_caches to drop shadow entries from the page > cache. Not doing so has always been a bit strange, but since most > people drop cache and metadata cache together, the inode shrinker > would have taken care of them before - no more, so do it VM-side. > > ... > > 14 files changed, 208 insertions(+), 34 deletions(-) Patch is surprisingly large.
On Tue, 12 May 2020 17:29:36 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > + inode_pages_clear(mapping->inode); > + else if (populated == 1) > + inode_pages_set(mapping->inode); mapping->host... I have to assume this version wasn't runtime tested, so I'll drop it.
On Wed, May 13, 2020 at 07:24:10PM -0700, Andrew Morton wrote: > On Tue, 12 May 2020 17:29:36 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > > + inode_pages_clear(mapping->inode); > > + else if (populated == 1) > > + inode_pages_set(mapping->inode); > > mapping->host... I'll fix that. > I have to assume this version wasn't runtime tested, so I'll drop it. It was, but I don't have DAX-suitable hardware so that config option isn't enabled anywhere. I'll try to test this with a brd. Sorry about that.
On Wed, May 13, 2020 at 02:15:19PM -0700, Andrew Morton wrote: > On Tue, 12 May 2020 17:29:36 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > ... > > > > Solution > > > > This patch fixes the aging inversion described above on > > !CONFIG_HIGHMEM systems, without reintroducing the problems associated > > with excessive shrinker LRU rotations, by keeping populated inodes off > > the shrinker LRUs entirely. > > > > Currently, inodes are kept off the shrinker LRU as long as they have > > an elevated i_count, indicating an active user. Unfortunately, the > > page cache cannot simply hold an i_count reference, because unlink() > > *should* result in the inode being dropped and its cache invalidated. > > > > Instead, this patch makes iput_final() consult the state of the page > > cache and punt the LRU linking to the VM if the inode is still > > populated; the VM in turn checks the inode state when it depopulates > > the page cache, and adds the inode to the LRU if necessary. > > > > This is not unlike what we do for dirty inodes, which are moved off > > the LRU permanently until writeback completion puts them back on (iff > > still unused). We can reuse the same code -- inode_add_lru() - here. > > > > This is also not unlike page reclaim, where the lower VM layer has to > > negotiate state with the higher VFS layer. Follow existing precedence > > and handle the inversion as much as possible on the VM side: > > > > - introduce an I_PAGES flag that the VM maintains under the i_lock, so > > that any inode code holding that lock can check the page cache state > > without having to lock and inspect the struct address_space > > Maintaining the same info in two places is a hassle. Is this > optimization worthwhile? Hm, maybe not. I'll try to get rid of it and test cache / LRU state directly. > > - introduce inode_pages_set() and inode_pages_clear() to maintain the > > inode LRU state from the VM side, then update all cache mutators to > > use them when populating the first cache entry or clearing the last > > > > With this, the concept of "inodesteal" - where the inode shrinker > > drops page cache - is relegated to CONFIG_HIGHMEM systems only. The VM > > is in charge of the cache, the shrinker in charge of struct inode. > > How tested is this on highmem machines? I don't have a highmem machine, but my code is ifdeffed out on CONFIG_HIGHMEM so the behavior shouldn't have changed there.
diff --git a/fs/block_dev.c b/fs/block_dev.c index 69bf2fb6f7cd..46f67147ad20 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -79,7 +79,7 @@ void kill_bdev(struct block_device *bdev) { struct address_space *mapping = bdev->bd_inode->i_mapping; - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) + if (mapping_empty(mapping)) return; invalidate_bh_lrus(); diff --git a/fs/dax.c b/fs/dax.c index 35da144375a0..7f3ce4612272 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -478,9 +478,11 @@ static void *grab_mapping_entry(struct xa_state *xas, { unsigned long index = xas->xa_index; bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */ + int populated; void *entry; retry: + populated = 0; xas_lock_irq(xas); entry = get_unlocked_entry(xas, order); @@ -526,6 +528,8 @@ static void *grab_mapping_entry(struct xa_state *xas, xas_store(xas, NULL); /* undo the PMD join */ dax_wake_entry(xas, entry, true); mapping->nrexceptional--; + if (mapping_empty(mapping)) + populated = -1; entry = NULL; xas_set(xas, index); } @@ -541,11 +545,17 @@ static void *grab_mapping_entry(struct xa_state *xas, dax_lock_entry(xas, entry); if (xas_error(xas)) goto out_unlock; + if (mapping_empty(mapping)) + populated++; mapping->nrexceptional++; } out_unlock: xas_unlock_irq(xas); + if (populated == -1) + inode_pages_clear(mapping->inode); + else if (populated == 1) + inode_pages_set(mapping->inode); if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) goto retry; if (xas->xa_node == XA_ERROR(-ENOMEM)) @@ -631,6 +641,7 @@ static int __dax_invalidate_entry(struct address_space *mapping, pgoff_t index, bool trunc) { XA_STATE(xas, &mapping->i_pages, index); + bool empty = false; int ret = 0; void *entry; @@ -645,10 +656,13 @@ static int __dax_invalidate_entry(struct address_space *mapping, dax_disassociate_entry(entry, mapping, trunc); xas_store(&xas, NULL); mapping->nrexceptional--; + empty = mapping_empty(mapping); ret = 1; out: put_unlocked_entry(&xas, entry); xas_unlock_irq(&xas); + if (empty) + inode_pages_clear(mapping->host); return ret; } diff --git a/fs/drop_caches.c b/fs/drop_caches.c index dc1a1d5d825b..a5e9e9053474 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -27,7 +27,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) * we need to reschedule to avoid softlockups. */ if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || - (inode->i_mapping->nrpages == 0 && !need_resched())) { + (mapping_empty(inode->i_mapping) && !need_resched())) { spin_unlock(&inode->i_lock); continue; } diff --git a/fs/inode.c b/fs/inode.c index 7d57068b6b7a..575b780fa9bb 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -430,25 +430,87 @@ static void inode_lru_list_add(struct inode *inode) inode->i_state |= I_REFERENCED; } +static void inode_lru_list_del(struct inode *inode) +{ + if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) + this_cpu_dec(nr_unused); +} + /* * Add inode to LRU if needed (inode is unused and clean). * * Needs inode->i_lock held. */ -void inode_add_lru(struct inode *inode) +bool inode_add_lru(struct inode *inode) { - if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC | - I_FREEING | I_WILL_FREE)) && - !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE) - inode_lru_list_add(inode); + if (inode->i_state & + (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE | I_PAGES)) + return false; + if (atomic_read(&inode->i_count)) + return false; + if (!(inode->i_sb->s_flags & SB_ACTIVE)) + return false; + inode_lru_list_add(inode); + return true; } +/** + * inode_pages_set - mark the inode as holding page cache + * @inode: the inode whose first cache page was just added + * + * Tell the VFS that this inode has populated page cache and must not + * be reclaimed by the inode shrinker. + * + * The caller must hold the page lock of the just-added page: by + * pinning the page, the page cache cannot become depopulated, and we + * can safely set I_PAGES without a race check under the i_pages lock. + * + * This function acquires the i_lock. + */ +void inode_pages_set(struct inode *inode) +{ + spin_lock(&inode->i_lock); + if (!(inode->i_state & I_PAGES)) { + inode->i_state |= I_PAGES; + if (!list_empty(&inode->i_lru)) { + count_vm_event(PGINODERESCUE); + inode_lru_list_del(inode); + } + } + spin_unlock(&inode->i_lock); +} -static void inode_lru_list_del(struct inode *inode) +/** + * inode_pages_clear - mark the inode as not holding page cache + * @inode: the inode whose last cache page was just removed + * + * Tell the VFS that the inode no longer holds page cache and that its + * lifetime is to be handed over to the inode shrinker LRU. + * + * This function acquires the i_lock and the i_pages lock. + */ +void inode_pages_clear(struct inode *inode) { + struct address_space *mapping = &inode->i_data; + bool add_to_lru = false; + unsigned long flags; - if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) - this_cpu_dec(nr_unused); + spin_lock(&inode->i_lock); + + xa_lock_irqsave(&mapping->i_pages, flags); + if ((inode->i_state & I_PAGES) && mapping_empty(mapping)) { + inode->i_state &= ~I_PAGES; + add_to_lru = true; + } + xa_unlock_irqrestore(&mapping->i_pages, flags); + + if (add_to_lru) { + WARN_ON_ONCE(!list_empty(&inode->i_lru)); + if (inode_add_lru(inode)) + __count_vm_event(PGINODEDELAYED); + } + + spin_unlock(&inode->i_lock); } /** @@ -742,6 +804,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item, if (!spin_trylock(&inode->i_lock)) return LRU_SKIP; + WARN_ON_ONCE(inode->i_state & I_PAGES); + /* * Referenced or dirty inodes are still in use. Give them another pass * through the LRU as we canot reclaim them now. @@ -761,23 +825,17 @@ static enum lru_status inode_lru_isolate(struct list_head *item, return LRU_ROTATE; } - if (inode_has_buffers(inode) || inode->i_data.nrpages) { - __iget(inode); + /* + * Populated inodes shouldn't be on the shrinker LRU, but they + * can be briefly visible when a new page is added to an inode + * that was already linked but inode_pages_set() hasn't run + * yet to move them off. + */ + if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { + list_lru_isolate(lru, &inode->i_lru); spin_unlock(&inode->i_lock); - spin_unlock(lru_lock); - if (remove_inode_buffers(inode)) { - unsigned long reap; - reap = invalidate_mapping_pages(&inode->i_data, 0, -1); - if (current_is_kswapd()) - __count_vm_events(KSWAPD_INODESTEAL, reap); - else - __count_vm_events(PGINODESTEAL, reap); - if (current->reclaim_state) - current->reclaim_state->reclaimed_slab += reap; - } - iput(inode); - spin_lock(lru_lock); - return LRU_RETRY; + this_cpu_dec(nr_unused); + return LRU_REMOVED; } WARN_ON(inode->i_state & I_NEW); diff --git a/fs/internal.h b/fs/internal.h index f3f280b952a3..4a9dc77e817b 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -139,7 +139,7 @@ extern int vfs_open(const struct path *, struct file *); * inode.c */ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); -extern void inode_add_lru(struct inode *inode); +extern bool inode_add_lru(struct inode *inode); extern int dentry_needs_remove_privs(struct dentry *dentry); /* diff --git a/include/linux/fs.h b/include/linux/fs.h index 3cd4fe6b845e..a98d9dee39f4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -585,6 +585,11 @@ static inline void mapping_allow_writable(struct address_space *mapping) atomic_inc(&mapping->i_mmap_writable); } +static inline bool mapping_empty(struct address_space *mapping) +{ + return mapping->nrpages + mapping->nrexceptional == 0; +} + /* * Use sequence counter to get consistent i_size on 32-bit processors. */ @@ -2150,6 +2155,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, * * I_CREATING New object's inode in the middle of setting up. * + * I_PAGES Inode is holding page cache that needs to get reclaimed + * first before the inode can go onto the shrinker LRU. + * * Q: What is the difference between I_WILL_FREE and I_FREEING? */ #define I_DIRTY_SYNC (1 << 0) @@ -2172,6 +2180,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, #define I_WB_SWITCH (1 << 13) #define I_OVL_INUSE (1 << 14) #define I_CREATING (1 << 15) +#define I_PAGES (1 << 16) #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) @@ -3097,6 +3106,9 @@ static inline void remove_inode_hash(struct inode *inode) __remove_inode_hash(inode); } +extern void inode_pages_set(struct inode *inode); +extern void inode_pages_clear(struct inode *inode); + extern void inode_sb_list_add(struct inode *inode); #ifdef CONFIG_BLOCK diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index ccb14b6a16b5..ae4d90bd0873 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -609,7 +609,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, int add_to_page_cache_lru(struct page *page, struct address_space *mapping, pgoff_t index, gfp_t gfp_mask); extern void delete_from_page_cache(struct page *page); -extern void __delete_from_page_cache(struct page *page, void *shadow); +extern bool __delete_from_page_cache(struct page *page, void *shadow); int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); void delete_from_page_cache_batch(struct address_space *mapping, struct pagevec *pvec); diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 47a3441cf4c4..f31026ccf590 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -38,7 +38,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_NUMA PGSCAN_ZONE_RECLAIM_FAILED, #endif - PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL, + SLABS_SCANNED, + PGINODESTEAL, KSWAPD_INODESTEAL, PGINODERESCUE, PGINODEDELAYED, KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY, PAGEOUTRUN, PGROTATED, DROP_PAGECACHE, DROP_SLAB, diff --git a/mm/filemap.c b/mm/filemap.c index 1784478270e1..fcc24b3b3540 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,8 +116,8 @@ * ->tasklist_lock (memory_failure, collect_procs_ao) */ -static void page_cache_delete(struct address_space *mapping, - struct page *page, void *shadow) +static bool __must_check page_cache_delete(struct address_space *mapping, + struct page *page, void *shadow) { XA_STATE(xas, &mapping->i_pages, page->index); unsigned int nr = 1; @@ -151,6 +151,8 @@ static void page_cache_delete(struct address_space *mapping, smp_wmb(); } mapping->nrpages -= nr; + + return mapping_empty(mapping); } static void unaccount_page_cache_page(struct address_space *mapping, @@ -227,15 +229,18 @@ static void unaccount_page_cache_page(struct address_space *mapping, * Delete a page from the page cache and free it. Caller has to make * sure the page is locked and that nobody else uses it - or that usage * is safe. The caller must hold the i_pages lock. + * + * If this returns true, the caller must call inode_pages_clear() + * after dropping the i_pages lock. */ -void __delete_from_page_cache(struct page *page, void *shadow) +bool __must_check __delete_from_page_cache(struct page *page, void *shadow) { struct address_space *mapping = page->mapping; trace_mm_filemap_delete_from_page_cache(page); unaccount_page_cache_page(mapping, page); - page_cache_delete(mapping, page, shadow); + return page_cache_delete(mapping, page, shadow); } static void page_cache_free_page(struct address_space *mapping, @@ -267,12 +272,16 @@ void delete_from_page_cache(struct page *page) { struct address_space *mapping = page_mapping(page); unsigned long flags; + bool empty; BUG_ON(!PageLocked(page)); xa_lock_irqsave(&mapping->i_pages, flags); - __delete_from_page_cache(page, NULL); + empty = __delete_from_page_cache(page, NULL); xa_unlock_irqrestore(&mapping->i_pages, flags); + if (empty) + inode_pages_clear(mapping->host); + page_cache_free_page(mapping, page); } EXPORT_SYMBOL(delete_from_page_cache); @@ -291,8 +300,8 @@ EXPORT_SYMBOL(delete_from_page_cache); * * The function expects the i_pages lock to be held. */ -static void page_cache_delete_batch(struct address_space *mapping, - struct pagevec *pvec) +static bool __must_check page_cache_delete_batch(struct address_space *mapping, + struct pagevec *pvec) { XA_STATE(xas, &mapping->i_pages, pvec->pages[0]->index); int total_pages = 0; @@ -337,12 +346,15 @@ static void page_cache_delete_batch(struct address_space *mapping, total_pages++; } mapping->nrpages -= total_pages; + + return mapping_empty(mapping); } void delete_from_page_cache_batch(struct address_space *mapping, struct pagevec *pvec) { int i; + bool empty; unsigned long flags; if (!pagevec_count(pvec)) @@ -354,9 +366,12 @@ void delete_from_page_cache_batch(struct address_space *mapping, unaccount_page_cache_page(mapping, pvec->pages[i]); } - page_cache_delete_batch(mapping, pvec); + empty = page_cache_delete_batch(mapping, pvec); xa_unlock_irqrestore(&mapping->i_pages, flags); + if (empty) + inode_pages_clear(mapping->host); + for (i = 0; i < pagevec_count(pvec); i++) page_cache_free_page(mapping, pvec->pages[i]); } @@ -831,9 +846,10 @@ static int __add_to_page_cache_locked(struct page *page, void **shadowp) { XA_STATE(xas, &mapping->i_pages, offset); + int error; int huge = PageHuge(page); struct mem_cgroup *memcg; - int error; + bool populated = false; void *old; VM_BUG_ON_PAGE(!PageLocked(page), page); @@ -860,6 +876,7 @@ static int __add_to_page_cache_locked(struct page *page, if (xas_error(&xas)) goto unlock; + populated = mapping_empty(mapping); if (xa_is_value(old)) { mapping->nrexceptional--; if (shadowp) @@ -880,6 +897,10 @@ static int __add_to_page_cache_locked(struct page *page, if (!huge) mem_cgroup_commit_charge(page, memcg, false, false); trace_mm_filemap_add_to_page_cache(page); + + if (populated) + inode_pages_set(mapping->host); + return 0; error: page->mapping = NULL; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b08b199f9a11..8b3e33a52d93 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2535,7 +2535,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, /* Some pages can be beyond i_size: drop them from page cache */ if (head[i].index >= end) { ClearPageDirty(head + i); - __delete_from_page_cache(head + i, NULL); + /* We know we're not removing the last page */ + (void)__delete_from_page_cache(head + i, NULL); if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head)) shmem_uncharge(head->mapping->host, 1); put_page(head + i); diff --git a/mm/truncate.c b/mm/truncate.c index dd9ebc1da356..8fb6c2f762bc 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -31,24 +31,31 @@ * itself locked. These unlocked entries need verification under the tree * lock. */ -static inline void __clear_shadow_entry(struct address_space *mapping, - pgoff_t index, void *entry) +static bool __must_check __clear_shadow_entry(struct address_space *mapping, + pgoff_t index, void *entry) { XA_STATE(xas, &mapping->i_pages, index); xas_set_update(&xas, workingset_update_node); if (xas_load(&xas) != entry) - return; + return 0; xas_store(&xas, NULL); mapping->nrexceptional--; + + return mapping_empty(mapping); } static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, void *entry) { + bool empty; + xa_lock_irq(&mapping->i_pages); - __clear_shadow_entry(mapping, index, entry); + empty = __clear_shadow_entry(mapping, index, entry); xa_unlock_irq(&mapping->i_pages); + + if (empty) + inode_pages_clear(mapping->host); } /* @@ -61,7 +68,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, pgoff_t end) { int i, j; - bool dax, lock; + bool dax, lock, empty = false; /* Handled by shmem itself */ if (shmem_mapping(mapping)) @@ -96,11 +103,16 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, continue; } - __clear_shadow_entry(mapping, index, page); + if (__clear_shadow_entry(mapping, index, page)) + empty = true; } if (lock) xa_unlock_irq(&mapping->i_pages); + + if (empty) + inode_pages_clear(mapping->host); + pvec->nr = j; } @@ -300,7 +312,7 @@ void truncate_inode_pages_range(struct address_space *mapping, pgoff_t index; int i; - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) + if (mapping_empty(mapping)) goto out; /* Offsets within partial pages */ @@ -636,6 +648,7 @@ static int invalidate_complete_page2(struct address_space *mapping, struct page *page) { unsigned long flags; + bool empty; if (page->mapping != mapping) return 0; @@ -648,9 +661,12 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) goto failed; BUG_ON(page_has_private(page)); - __delete_from_page_cache(page, NULL); + empty = __delete_from_page_cache(page, NULL); xa_unlock_irqrestore(&mapping->i_pages, flags); + if (empty) + inode_pages_clear(mapping->host); + if (mapping->a_ops->freepage) mapping->a_ops->freepage(page); @@ -692,7 +708,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, int ret2 = 0; int did_range_unmap = 0; - if (mapping->nrpages == 0 && mapping->nrexceptional == 0) + if (mapping_empty(mapping)) goto out; pagevec_init(&pvec); diff --git a/mm/vmscan.c b/mm/vmscan.c index c05eb9efec07..c82e9831003f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -901,6 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, } else { void (*freepage)(struct page *); void *shadow = NULL; + int empty; freepage = mapping->a_ops->freepage; /* @@ -922,9 +923,12 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (reclaimed && page_is_file_cache(page) && !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(page, target_memcg); - __delete_from_page_cache(page, shadow); + empty = __delete_from_page_cache(page, shadow); xa_unlock_irqrestore(&mapping->i_pages, flags); + if (empty) + inode_pages_clear(mapping->host); + if (freepage != NULL) freepage(page); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 78d53378db99..ae802253f71c 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1203,9 +1203,11 @@ const char * const vmstat_text[] = { #ifdef CONFIG_NUMA "zone_reclaim_failed", #endif - "pginodesteal", "slabs_scanned", + "pginodesteal", "kswapd_inodesteal", + "pginoderescue", + "pginodedelayed", "kswapd_low_wmark_hit_quickly", "kswapd_high_wmark_hit_quickly", "pageoutrun", diff --git a/mm/workingset.c b/mm/workingset.c index 474186b76ced..7ce9c74ebfdb 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -491,6 +491,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, struct xa_node *node = container_of(item, struct xa_node, private_list); XA_STATE(xas, node->array, 0); struct address_space *mapping; + bool empty = false; int ret; /* @@ -529,6 +530,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, if (WARN_ON_ONCE(node->count != node->nr_values)) goto out_invalid; mapping->nrexceptional -= node->nr_values; + empty = mapping_empty(mapping); xas.xa_node = xa_parent_locked(&mapping->i_pages, node); xas.xa_offset = node->offset; xas.xa_shift = node->shift + XA_CHUNK_SHIFT; @@ -542,6 +544,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, out_invalid: xa_unlock_irq(&mapping->i_pages); + if (empty) + inode_pages_clear(mapping->host); ret = LRU_REMOVED_RETRY; out: cond_resched();
The VFS inode shrinker is currently allowed to reclaim inodes with populated page cache. As a result it can drop gigabytes of hot and active page cache on the floor without consulting the VM (recorded as "inodesteal" events in /proc/vmstat). This causes real problems in practice. Consider for example how the VM would cache a source tree, such as the Linux git tree. As large parts of the checked out files and the object database are accessed repeatedly, the page cache holding this data gets moved to the active list, where it's fully (and indefinitely) insulated from one-off cache moving through the inactive list. However, due to the way users interact with the tree, no ongoing open file descriptors into the source tree are maintained, and the inodes end up on the "unused inode" shrinker LRU. A larger burst of one-off cache (find, updatedb, etc.) can now drive the VFS shrinkers to drop first the dentries and then the inodes - inodes that contain the most valuable data currently held by the page cache - while there is plenty of one-off cache that could be reclaimed instead. This doesn't make sense. The inodes aren't really "unused" as long as the VM deems it worthwhile to hold on to their page cache. And the shrinker can't possibly guess what is and isn't valuable to the VM based on recent inode reference information alone (we could delete several thousand lines of reclaim code if it could). History This behavior of invalidating page cache from the inode shrinker goes back to even before the git import of the kernel tree. It may have been less noticeable when the VM itself didn't have real workingset protection, and floods of one-off cache would push out any active cache over time anyway. But the VM has come a long way since then and the inode shrinker is now actively subverting its caching strategy. As this keeps causing problems for people, there have been several attempts to address this. One recent attempt was to make the inode shrinker simply skip over inodes that still contain pages: a76cf1a474d7 ("mm: don't reclaim inodes with many attached pages"). However, this change had to be reverted in 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many attached pages"") because it caused severe reclaim performance problems: Inodes that sit on the shrinker LRU are attracting reclaim pressure away from the page cache and toward the VFS. If we then permanently exempt sizable portions of this pool from actually getting reclaimed when looked at, this pressure accumulates as deferred shrinker work (a mechanism for *temporarily* unreclaimable objects) until it causes mayhem in the VFS cache pools. In the bug quoted in 69056ee6a8a3 in particular, the excessive pressure drove the XFS shrinker into dirty objects, where it caused synchronous, IO-bound stalls, even as there was plenty of clean page cache that should have been reclaimed instead. Another variant of this problem was recently observed, where the kernel violates cgroups' memory.low protection settings and reclaims page cache way beyond the configured thresholds. It was followed by a proposal of a modified form of the reverted commit above, that implements memory.low-sensitive shrinker skipping over populated inodes on the LRU [1]. However, this proposal continues to run the risk of attracting disproportionate reclaim pressure to a pool of still-used inodes, while not addressing the more generic reclaim inversion problem outside of a very specific cgroup application. [1] https://lore.kernel.org/linux-mm/1578499437-1664-1-git-send-email-laoar.shao@gmail.com/ Solution To fix the reclaim inversion described in the beginning, without reintroducing the problems associated with shrinker LRU rotations, this patch keeps populated inodes off the LRUs entirely. Currently, inodes are kept off the shrinker LRU as long as they have an elevated i_count, indicating an active user. Unfortunately, the page cache cannot simply hold an i_count reference, because unlink() *should* result in the inode being dropped and its cache invalidated. Instead, this patch makes iput_final() consult the state of the page cache and punt the LRU linking to the VM if the inode is still populated; the VM in turn checks the inode state when it depopulates the page cache, and adds the inode to the LRU if necessary. This is not unlike what we do for dirty inodes, which are moved off the LRU permanently until writeback completion puts them back on (iff still unused). We can reuse the same code -- inode_add_lru() - here. This is also not unlike page reclaim, where the lower VM layer has to negotiate state with the higher VFS layer. Follow existing precedence and handle the inversion as much as possible on the VM side: - introduce an I_PAGES flag that the VM maintains under the i_lock, so that any inode code holding that lock can check the page cache state without having to lock and inspect the struct address_space - introduce inode_pages_set() and inode_pages_clear() to maintain the inode LRU state from the VM side, then update all cache mutators to use them when populating the first cache entry or clearing the last With this, the concept of "inodesteal" - where the inode shrinker drops page cache - is a thing of the past. The VM is in charge of the page cache, the inode shrinker is in charge of freeing struct inode. Footnotes - For debuggability, add vmstat counters that track the number of times a new cache entry pulls a previously unused inode off the LRU (pginoderescue), as well as how many times existing cache deferred an LRU addition. Keep the pginodesteal/kswapd_inodesteal counters for backwards compatibility, but they'll just show 0 now. - Fix /proc/sys/vm/drop_caches to drop shadow entries from the page cache. Not doing so has always been a bit strange, but since most people drop cache and metadata cache together, the inode shrinker would have taken care of them before - no more, so do it VM-side. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- fs/block_dev.c | 2 +- fs/dax.c | 14 +++++ fs/drop_caches.c | 2 +- fs/inode.c | 106 ++++++++++++++++++++++++++-------- fs/internal.h | 2 +- include/linux/fs.h | 12 ++++ include/linux/pagemap.h | 2 +- include/linux/vm_event_item.h | 3 +- mm/filemap.c | 39 ++++++++++--- mm/huge_memory.c | 3 +- mm/truncate.c | 34 ++++++++--- mm/vmscan.c | 6 +- mm/vmstat.c | 4 +- mm/workingset.c | 4 ++ 14 files changed, 183 insertions(+), 50 deletions(-)