Message ID | 20210614211904.14420-4-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/4] mm: remove irqsave/restore locking from contexts with irqs enabled | expand |
On Mon, 14 Jun 2021 17:19:04 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > Historically (pre-2.5), the inode shrinker used to reclaim only empty > inodes and skip over those that still contained page cache. This > caused problems on highmem hosts: struct inode could put fill lowmem > zones before the cache was getting reclaimed in the highmem zones. > > To address this, the inode shrinker started to strip page cache to > facilitate reclaiming lowmem. However, this comes with its own set of > problems: the shrinkers may drop actively used page cache just because > the inodes are not currently open or dirty - think working with a > large git tree. It further doesn't respect cgroup memory protection > settings and can cause priority inversions between containers. > > Nowadays, the page cache also holds non-resident info for evicted > cache pages in order to detect refaults. We've come to rely heavily on > this data inside reclaim for protecting the cache workingset and > driving swap behavior. We also use it to quantify and report workload > health through psi. The latter in turn is used for fleet health > monitoring, as well as driving automated memory sizing of workloads > and containers, proactive reclaim and memory offloading schemes. > > The consequences of dropping page cache prematurely is that we're > seeing subtle and not-so-subtle failures in all of the above-mentioned > scenarios, with the workload generally entering unexpected thrashing > states while losing the ability to reliably detect it. > > To fix this on non-highmem systems at least, going back to rotating > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 > ("mm: don't reclaim inodes with many attached pages")) and failed > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many > attached pages"")). The issue is mostly that shrinker pools attract > pressure based on their size, and when objects get skipped the > shrinkers remember this as deferred reclaim work. This accumulates > excessive pressure on the remaining inodes, and we can quickly eat > into heavily used ones, or dirty ones that require IO to reclaim, when > there potentially is plenty of cold, clean cache around still. > > Instead, this patch keeps populated inodes off the inode LRU in the > first place - just like an open file or dirty state would. An > otherwise clean and unused inode then gets queued when the last cache > entry disappears. This solves the problem without reintroducing the > reclaim issues, and generally is a bit more scalable than having to > wade through potentially hundreds of thousands of busy inodes. > > Locking is a bit tricky because the locks protecting the inode state > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are > serialized through i_lock, taken before the i_pages lock, to make sure > depopulated inodes are queued reliably. Additions may race with > deletions, but we'll check again in the shrinker. If additions race > with the shrinker itself, we're protected by the i_lock: if > find_inode() or iput() win, the shrinker will bail on the elevated > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the > inode, it will set I_FREEING and inhibit further igets(), which will > cause the other side to create a new instance of the inode instead. > And what hitherto unexpected problems will this one cause, sigh. How exhaustively has this approach been tested?
On Mon, Jun 14, 2021 at 02:59:12PM -0700, Andrew Morton wrote: > On Mon, 14 Jun 2021 17:19:04 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > > Historically (pre-2.5), the inode shrinker used to reclaim only empty > > inodes and skip over those that still contained page cache. This > > caused problems on highmem hosts: struct inode could put fill lowmem > > zones before the cache was getting reclaimed in the highmem zones. > > > > To address this, the inode shrinker started to strip page cache to > > facilitate reclaiming lowmem. However, this comes with its own set of > > problems: the shrinkers may drop actively used page cache just because > > the inodes are not currently open or dirty - think working with a > > large git tree. It further doesn't respect cgroup memory protection > > settings and can cause priority inversions between containers. > > > > Nowadays, the page cache also holds non-resident info for evicted > > cache pages in order to detect refaults. We've come to rely heavily on > > this data inside reclaim for protecting the cache workingset and > > driving swap behavior. We also use it to quantify and report workload > > health through psi. The latter in turn is used for fleet health > > monitoring, as well as driving automated memory sizing of workloads > > and containers, proactive reclaim and memory offloading schemes. > > > > The consequences of dropping page cache prematurely is that we're > > seeing subtle and not-so-subtle failures in all of the above-mentioned > > scenarios, with the workload generally entering unexpected thrashing > > states while losing the ability to reliably detect it. > > > > To fix this on non-highmem systems at least, going back to rotating > > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 > > ("mm: don't reclaim inodes with many attached pages")) and failed > > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many > > attached pages"")). The issue is mostly that shrinker pools attract > > pressure based on their size, and when objects get skipped the > > shrinkers remember this as deferred reclaim work. This accumulates > > excessive pressure on the remaining inodes, and we can quickly eat > > into heavily used ones, or dirty ones that require IO to reclaim, when > > there potentially is plenty of cold, clean cache around still. > > > > Instead, this patch keeps populated inodes off the inode LRU in the > > first place - just like an open file or dirty state would. An > > otherwise clean and unused inode then gets queued when the last cache > > entry disappears. This solves the problem without reintroducing the > > reclaim issues, and generally is a bit more scalable than having to > > wade through potentially hundreds of thousands of busy inodes. > > > > Locking is a bit tricky because the locks protecting the inode state > > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the > > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are > > serialized through i_lock, taken before the i_pages lock, to make sure > > depopulated inodes are queued reliably. Additions may race with > > deletions, but we'll check again in the shrinker. If additions race > > with the shrinker itself, we're protected by the i_lock: if > > find_inode() or iput() win, the shrinker will bail on the elevated > > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the > > inode, it will set I_FREEING and inhibit further igets(), which will > > cause the other side to create a new instance of the inode instead. > > > > And what hitherto unexpected problems will this one cause, sigh. Yeah, I wish we could have stuck with simple rotations, but I can see how they can become problematic - even though we haven't been able to reproduce the issue which led to the revert in the FB fleet. > How exhaustively has this approach been tested? This specific patch I've put through various inode / cache loads on my local test rig, drgn-inspected the sb inode lists to check for leaked inodes etc. - they've held up. We've had the previous version (which was more complicated) running on the entire Facebook fleet for a while. I did switch us back to simple rotations in a recent rebase, though, because the one-liner is easier to carry out of tree and (luckily) isn't causing problems for us. It would be good to have a fix upstream that works for everybody, though. The above-mentioned containerization and pressure detection failures aren't theoretical, we can't run upstream in our fleet. We're also trying to publish containerization software that showcases how the different cgroup controllers play together, how automated container sizing works etc, but many of those scenarios are broken in the upstream kernel. A kernel patch dependency is a painful one for a project like this. Tejun can probably provide more context on this.
On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote: > Historically (pre-2.5), the inode shrinker used to reclaim only empty > inodes and skip over those that still contained page cache. This > caused problems on highmem hosts: struct inode could put fill lowmem > zones before the cache was getting reclaimed in the highmem zones. > > To address this, the inode shrinker started to strip page cache to > facilitate reclaiming lowmem. However, this comes with its own set of > problems: the shrinkers may drop actively used page cache just because > the inodes are not currently open or dirty - think working with a > large git tree. It further doesn't respect cgroup memory protection > settings and can cause priority inversions between containers. > > Nowadays, the page cache also holds non-resident info for evicted > cache pages in order to detect refaults. We've come to rely heavily on > this data inside reclaim for protecting the cache workingset and > driving swap behavior. We also use it to quantify and report workload > health through psi. The latter in turn is used for fleet health > monitoring, as well as driving automated memory sizing of workloads > and containers, proactive reclaim and memory offloading schemes. > > The consequences of dropping page cache prematurely is that we're > seeing subtle and not-so-subtle failures in all of the above-mentioned > scenarios, with the workload generally entering unexpected thrashing > states while losing the ability to reliably detect it. > > To fix this on non-highmem systems at least, going back to rotating > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 > ("mm: don't reclaim inodes with many attached pages")) and failed > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many > attached pages"")). The issue is mostly that shrinker pools attract > pressure based on their size, and when objects get skipped the > shrinkers remember this as deferred reclaim work. This accumulates > excessive pressure on the remaining inodes, and we can quickly eat > into heavily used ones, or dirty ones that require IO to reclaim, when > there potentially is plenty of cold, clean cache around still. > > Instead, this patch keeps populated inodes off the inode LRU in the > first place - just like an open file or dirty state would. An > otherwise clean and unused inode then gets queued when the last cache > entry disappears. This solves the problem without reintroducing the > reclaim issues, and generally is a bit more scalable than having to > wade through potentially hundreds of thousands of busy inodes. > > Locking is a bit tricky because the locks protecting the inode state > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are > serialized through i_lock, taken before the i_pages lock, to make sure > depopulated inodes are queued reliably. Additions may race with > deletions, but we'll check again in the shrinker. If additions race > with the shrinker itself, we're protected by the i_lock: if > find_inode() or iput() win, the shrinker will bail on the elevated > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the > inode, it will set I_FREEING and inhibit further igets(), which will > cause the other side to create a new instance of the inode instead. This makes an awful mess of the inode locking and, to me, is a serious layering violation.... > index e89df447fae3..c9956fac640e 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -23,6 +23,56 @@ static inline bool mapping_empty(struct address_space *mapping) > return xa_empty(&mapping->i_pages); > } > > +/* > + * mapping_shrinkable - test if page cache state allows inode reclaim > + * @mapping: the page cache mapping > + * > + * This checks the mapping's cache state for the pupose of inode > + * reclaim and LRU management. > + * > + * The caller is expected to hold the i_lock, but is not required to > + * hold the i_pages lock, which usually protects cache state. That's > + * because the i_lock and the list_lru lock that protect the inode and > + * its LRU state don't nest inside the irq-safe i_pages lock. > + * > + * Cache deletions are performed under the i_lock, which ensures that > + * when an inode goes empty, it will reliably get queued on the LRU. > + * > + * Cache additions do not acquire the i_lock and may race with this > + * check, in which case we'll report the inode as shrinkable when it > + * has cache pages. This is okay: the shrinker also checks the > + * refcount and the referenced bit, which will be elevated or set in > + * the process of adding new cache pages to an inode. > + */ .... because you're expanding the inode->i_lock to now cover the page cache additions and removal and that massively increases the scope of the i_lock. The ilock is for internal inode serialisation purposes, not serialisation with external subsystems that inodes interact with like mappings. > +static inline bool mapping_shrinkable(struct address_space *mapping) > +{ > + void *head; > + > + /* > + * On highmem systems, there could be lowmem pressure from the > + * inodes before there is highmem pressure from the page > + * cache. Make inodes shrinkable regardless of cache state. > + */ > + if (IS_ENABLED(CONFIG_HIGHMEM)) > + return true; > + > + /* Cache completely empty? Shrink away. */ > + head = rcu_access_pointer(mapping->i_pages.xa_head); > + if (!head) > + return true; > + > + /* > + * The xarray stores single offset-0 entries directly in the > + * head pointer, which allows non-resident page cache entries > + * to escape the shadow shrinker's list of xarray nodes. The > + * inode shrinker needs to pick them up under memory pressure. > + */ > + if (!xa_is_node(head) && xa_is_value(head)) > + return true; > + > + return false; > +} It just occurred to me: mapping_shrinkable() == available for shrinking, not "mapping can be shrunk by having pages freed from it". If this lives, it really needs a better name and API - this isn't applying a check of whether the mapping can be shrunk, it indicates whether the inode hosting the mapping is considered freeable or not. > /* > * Bits in mapping->flags. > */ > diff --git a/mm/filemap.c b/mm/filemap.c > index 819d2589abef..0d0d72ced961 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -260,9 +260,13 @@ void delete_from_page_cache(struct page *page) > struct address_space *mapping = page_mapping(page); > > BUG_ON(!PageLocked(page)); > + spin_lock(&mapping->host->i_lock); > xa_lock_irq(&mapping->i_pages); > __delete_from_page_cache(page, NULL); > xa_unlock_irq(&mapping->i_pages); > + if (mapping_shrinkable(mapping)) > + inode_add_lru(mapping->host); > + spin_unlock(&mapping->host->i_lock); > This, to me is an example of the layering problesm here. No mapping specific function should be using locks that belong to the mapping host for internal mapping serialisation. Especially considering that inode_add_lru() is defined in fs/internal.h - nothing outside the core fs code should really be using inode_add_lru(), just lik enothing outside of fs code should be using the inode->i_lock for anything. This just doesn't seem like a robust, maintainable solution to the problem you're trying to address.... > @@ -73,8 +77,10 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, > return; > > dax = dax_mapping(mapping); > - if (!dax) > + if (!dax) { > + spin_lock(&mapping->host->i_lock); > xa_lock_irq(&mapping->i_pages); > + } > > for (i = j; i < pagevec_count(pvec); i++) { > struct page *page = pvec->pages[i]; > @@ -93,8 +99,12 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, > __clear_shadow_entry(mapping, index, page); > } > > - if (!dax) > + if (!dax) { > xa_unlock_irq(&mapping->i_pages); > + if (mapping_shrinkable(mapping)) > + inode_add_lru(mapping->host); > + spin_unlock(&mapping->host->i_lock); > + } > pvec->nr = j; > } No. Just no. > diff --git a/mm/vmscan.c b/mm/vmscan.c > index cc5d7cd75935..6dd5ef8a11bc 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1055,6 +1055,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > BUG_ON(!PageLocked(page)); > BUG_ON(mapping != page_mapping(page)); > > + if (!PageSwapCache(page)) > + spin_lock(&mapping->host->i_lock); > xa_lock_irq(&mapping->i_pages); > /* > * The non racy check for a busy page. > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > shadow = workingset_eviction(page, target_memcg); > __delete_from_page_cache(page, shadow); > xa_unlock_irq(&mapping->i_pages); > + if (mapping_shrinkable(mapping)) > + inode_add_lru(mapping->host); > + spin_unlock(&mapping->host->i_lock); > No. Inode locks have absolutely no place serialising core vmscan algorithms. Really, all this complexity because: | The issue is mostly that shrinker pools attract pressure based on | their size, and when objects get skipped the shrinkers remember this | as deferred reclaim work. And so you want inodes with reclaimable mappings to avoid being considered deferred work for the inode shrinker. That's what is occuring because we are currently returning LRU_RETRY to them, or would also happen if we returned LRU_SKIP or LRU_ROTATE just to ignore them. However, it's trivial to change inode_lru_isolate() to do something like: if (inode_has_buffers(inode) || inode->i_data.nrpages) { if (!IS_ENABLED(CONFIG_HIGHMEM)) { spin_unlock(&inode->i_lock); return LRU_ROTATE_NODEFER; } ..... } And in __list_lru_walk_one() just add: case LRU_ROTATE_NODEFER: isolated++; /* fallthrough */ case LRU_ROTATE: list_move_tail(item, &l->list); break; And now inodes with active page cache rotated to the tail of the list and are considered to have had work done on them. Hence they don't add to the work accumulation that the shrinker infrastructure defers, and so will allow the page reclaim to do it's stuff with page reclaim before such inodes will get reclaimed. That's *much* simpler than your proposed patch and should get you pretty much the same result. Cheers, Dave.
On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote: > > Historically (pre-2.5), the inode shrinker used to reclaim only empty > > inodes and skip over those that still contained page cache. This > > caused problems on highmem hosts: struct inode could put fill lowmem > > zones before the cache was getting reclaimed in the highmem zones. > > > > To address this, the inode shrinker started to strip page cache to > > facilitate reclaiming lowmem. However, this comes with its own set of > > problems: the shrinkers may drop actively used page cache just because > > the inodes are not currently open or dirty - think working with a > > large git tree. It further doesn't respect cgroup memory protection > > settings and can cause priority inversions between containers. > > > > Nowadays, the page cache also holds non-resident info for evicted > > cache pages in order to detect refaults. We've come to rely heavily on > > this data inside reclaim for protecting the cache workingset and > > driving swap behavior. We also use it to quantify and report workload > > health through psi. The latter in turn is used for fleet health > > monitoring, as well as driving automated memory sizing of workloads > > and containers, proactive reclaim and memory offloading schemes. > > > > The consequences of dropping page cache prematurely is that we're > > seeing subtle and not-so-subtle failures in all of the above-mentioned > > scenarios, with the workload generally entering unexpected thrashing > > states while losing the ability to reliably detect it. > > > > To fix this on non-highmem systems at least, going back to rotating > > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 > > ("mm: don't reclaim inodes with many attached pages")) and failed > > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many > > attached pages"")). The issue is mostly that shrinker pools attract > > pressure based on their size, and when objects get skipped the > > shrinkers remember this as deferred reclaim work. This accumulates > > excessive pressure on the remaining inodes, and we can quickly eat > > into heavily used ones, or dirty ones that require IO to reclaim, when > > there potentially is plenty of cold, clean cache around still. > > > > Instead, this patch keeps populated inodes off the inode LRU in the > > first place - just like an open file or dirty state would. An > > otherwise clean and unused inode then gets queued when the last cache > > entry disappears. This solves the problem without reintroducing the > > reclaim issues, and generally is a bit more scalable than having to > > wade through potentially hundreds of thousands of busy inodes. > > > > Locking is a bit tricky because the locks protecting the inode state > > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the > > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are > > serialized through i_lock, taken before the i_pages lock, to make sure > > depopulated inodes are queued reliably. Additions may race with > > deletions, but we'll check again in the shrinker. If additions race > > with the shrinker itself, we're protected by the i_lock: if > > find_inode() or iput() win, the shrinker will bail on the elevated > > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the > > inode, it will set I_FREEING and inhibit further igets(), which will > > cause the other side to create a new instance of the inode instead. > > This makes an awful mess of the inode locking and, to me, is a > serious layering violation.... The direction through the layers is no different than set_page_dirty() calling mark_inode_dirty() from under the page lock. This is simply where the state change of the inode originates. What's the benefit of having the code pretend otherwise? > > index e89df447fae3..c9956fac640e 100644 > > --- a/include/linux/pagemap.h > > +++ b/include/linux/pagemap.h > > @@ -23,6 +23,56 @@ static inline bool mapping_empty(struct address_space *mapping) > > return xa_empty(&mapping->i_pages); > > } > > > > +/* > > + * mapping_shrinkable - test if page cache state allows inode reclaim > > + * @mapping: the page cache mapping > > + * > > + * This checks the mapping's cache state for the pupose of inode > > + * reclaim and LRU management. > > + * > > + * The caller is expected to hold the i_lock, but is not required to > > + * hold the i_pages lock, which usually protects cache state. That's > > + * because the i_lock and the list_lru lock that protect the inode and > > + * its LRU state don't nest inside the irq-safe i_pages lock. > > + * > > + * Cache deletions are performed under the i_lock, which ensures that > > + * when an inode goes empty, it will reliably get queued on the LRU. > > + * > > + * Cache additions do not acquire the i_lock and may race with this > > + * check, in which case we'll report the inode as shrinkable when it > > + * has cache pages. This is okay: the shrinker also checks the > > + * refcount and the referenced bit, which will be elevated or set in > > + * the process of adding new cache pages to an inode. > > + */ > > .... because you're expanding the inode->i_lock to now cover the > page cache additions and removal and that massively increases the > scope of the i_lock. The ilock is for internal inode serialisation > purposes, not serialisation with external subsystems that inodes > interact with like mappings. I'm expanding it to cover another state change in the inode that happens to originate in the mapping. Yes, it would have been slightly easier if the inode locks nested under the page cache locks, like they do for dirty state. This way we could have updated the inode from the innermost context where the mapping state changes, which would have been a bit more compact. Unfortunately, the i_pages lock is irq-safe - to deal with writeback completion changing cache state from IRQ context. We'd have to make the i_lock and the list_lru lock irq-safe as well, and I'm not sure that is more desirable than ordering the locks the other way. Doing it this way means annotating a few more deletion entry points. But there are a limited number of ways for pages to leave the page cache - reclaim and truncation - so it's within reason. The i_lock isn't expanded to cover additions, if you take a look at the code. That would have been a much larger surface area indeed. > > +static inline bool mapping_shrinkable(struct address_space *mapping) > > +{ > > + void *head; > > + > > + /* > > + * On highmem systems, there could be lowmem pressure from the > > + * inodes before there is highmem pressure from the page > > + * cache. Make inodes shrinkable regardless of cache state. > > + */ > > + if (IS_ENABLED(CONFIG_HIGHMEM)) > > + return true; > > + > > + /* Cache completely empty? Shrink away. */ > > + head = rcu_access_pointer(mapping->i_pages.xa_head); > > + if (!head) > > + return true; > > + > > + /* > > + * The xarray stores single offset-0 entries directly in the > > + * head pointer, which allows non-resident page cache entries > > + * to escape the shadow shrinker's list of xarray nodes. The > > + * inode shrinker needs to pick them up under memory pressure. > > + */ > > + if (!xa_is_node(head) && xa_is_value(head)) > > + return true; > > + > > + return false; > > +} > > It just occurred to me: mapping_shrinkable() == available for > shrinking, not "mapping can be shrunk by having pages freed from > it". > > If this lives, it really needs a better name and API - this isn't > applying a check of whether the mapping can be shrunk, it indicates > whether the inode hosting the mapping is considered freeable or not. Right, it indicates whether the state of the mapping translates to the inode being eligigble for shrinker reclaim or not. I've used mapping_populated() in a previous version, but this function here captures additional reclaim policy (highmem setups e.g.), so it's not a suitable name for it. This is the best I could come up with, so I'm open to suggestions. > > @@ -260,9 +260,13 @@ void delete_from_page_cache(struct page *page) > > struct address_space *mapping = page_mapping(page); > > > > BUG_ON(!PageLocked(page)); > > + spin_lock(&mapping->host->i_lock); > > xa_lock_irq(&mapping->i_pages); > > __delete_from_page_cache(page, NULL); > > xa_unlock_irq(&mapping->i_pages); > > + if (mapping_shrinkable(mapping)) > > + inode_add_lru(mapping->host); > > + spin_unlock(&mapping->host->i_lock); > > > > This, to me is an example of the layering problesm here. No mapping > specific function should be using locks that belong to the mapping > host for internal mapping serialisation. > > Especially considering that inode_add_lru() is defined in > fs/internal.h - nothing outside the core fs code should really be > using inode_add_lru(), just lik enothing outside of fs code should > be using the inode->i_lock for anything. Again, it's no different than dirty state propagation. That's simply the direction in which this information flows through the stack. We can encapsulate it all into wrapper/API functions if you prefer, but it doesn't really change the direction in which the information travels, nor the lock scoping. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index cc5d7cd75935..6dd5ef8a11bc 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1055,6 +1055,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > > BUG_ON(!PageLocked(page)); > > BUG_ON(mapping != page_mapping(page)); > > > > + if (!PageSwapCache(page)) > > + spin_lock(&mapping->host->i_lock); > > xa_lock_irq(&mapping->i_pages); > > /* > > * The non racy check for a busy page. > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > > shadow = workingset_eviction(page, target_memcg); > > __delete_from_page_cache(page, shadow); > > xa_unlock_irq(&mapping->i_pages); > > + if (mapping_shrinkable(mapping)) > > + inode_add_lru(mapping->host); > > + spin_unlock(&mapping->host->i_lock); > > > > No. Inode locks have absolutely no place serialising core vmscan > algorithms. What if, and hear me out on this one, core vmscan algorithms change the state of the inode? The alternative to propagating this change from where it occurs is simply that the outer layer now has to do stupid things to infer it - essentially needing to busy poll. See below. > Really, all this complexity because: > > | The issue is mostly that shrinker pools attract pressure based on > | their size, and when objects get skipped the shrinkers remember this > | as deferred reclaim work. > > And so you want inodes with reclaimable mappings to avoid being > considered deferred work for the inode shrinker. That's what is > occuring because we are currently returning LRU_RETRY to them, or > would also happen if we returned LRU_SKIP or LRU_ROTATE just to > ignore them. > > However, it's trivial to change inode_lru_isolate() to do something > like: > > if (inode_has_buffers(inode) || inode->i_data.nrpages) { > if (!IS_ENABLED(CONFIG_HIGHMEM)) { > spin_unlock(&inode->i_lock); > return LRU_ROTATE_NODEFER; > } > ..... > } Yeah, that's busy polling. It's simple and inefficient. We can rotate the busy inodes to the end of the list when we see them, but all *new* inodes get placed ahead of them and push them out the back. If you have a large set of populated inodes paired with an ongoing stream of metadata operations creating one-off inodes, you end up continuously shoveling through hundreds of thousand of the same pinned inodes to get to the few reclaimable ones mixed in. Meanwhile, the MM is standing by, going "You know, I could tell you exactly when those actually become reclaimable..." If you think this is the more elegant implementation, simply because it follows a made-up information hierarchy that doesn't actually map to the real world, then I don't know what to tell you. We take i_count and dirty inodes off the list because it's silly to rotate them over and over, when we could just re-add them the moment the pin goes away. It's not a stretch to do the same for populated inodes, which usually make up a much bigger share of the pool. > And in __list_lru_walk_one() just add: > > case LRU_ROTATE_NODEFER: > isolated++; > /* fallthrough */ > case LRU_ROTATE: > list_move_tail(item, &l->list); > break; > > And now inodes with active page cache rotated to the tail of the > list and are considered to have had work done on them. Hence they > don't add to the work accumulation that the shrinker infrastructure > defers, and so will allow the page reclaim to do it's stuff with > page reclaim before such inodes will get reclaimed. > > That's *much* simpler than your proposed patch and should get you > pretty much the same result. It solves the deferred work buildup, but it's absurdly inefficient. So no, unfortunately I disagree with your assessment of this patch and the arguments you've made in support of that. There is precedent for propagating state change from the page cache to the inode, and doing so is vastly more elegant and efficient than having the outer layer abuse its own aging machinery to busy poll that state.
On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote: > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote: > > > Historically (pre-2.5), the inode shrinker used to reclaim only empty > > > inodes and skip over those that still contained page cache. This > > > caused problems on highmem hosts: struct inode could put fill lowmem > > > zones before the cache was getting reclaimed in the highmem zones. > > > > > > To address this, the inode shrinker started to strip page cache to > > > facilitate reclaiming lowmem. However, this comes with its own set of > > > problems: the shrinkers may drop actively used page cache just because > > > the inodes are not currently open or dirty - think working with a > > > large git tree. It further doesn't respect cgroup memory protection > > > settings and can cause priority inversions between containers. > > > > > > Nowadays, the page cache also holds non-resident info for evicted > > > cache pages in order to detect refaults. We've come to rely heavily on > > > this data inside reclaim for protecting the cache workingset and > > > driving swap behavior. We also use it to quantify and report workload > > > health through psi. The latter in turn is used for fleet health > > > monitoring, as well as driving automated memory sizing of workloads > > > and containers, proactive reclaim and memory offloading schemes. > > > > > > The consequences of dropping page cache prematurely is that we're > > > seeing subtle and not-so-subtle failures in all of the above-mentioned > > > scenarios, with the workload generally entering unexpected thrashing > > > states while losing the ability to reliably detect it. > > > > > > To fix this on non-highmem systems at least, going back to rotating > > > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 > > > ("mm: don't reclaim inodes with many attached pages")) and failed > > > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many > > > attached pages"")). The issue is mostly that shrinker pools attract > > > pressure based on their size, and when objects get skipped the > > > shrinkers remember this as deferred reclaim work. This accumulates > > > excessive pressure on the remaining inodes, and we can quickly eat > > > into heavily used ones, or dirty ones that require IO to reclaim, when > > > there potentially is plenty of cold, clean cache around still. > > > > > > Instead, this patch keeps populated inodes off the inode LRU in the > > > first place - just like an open file or dirty state would. An > > > otherwise clean and unused inode then gets queued when the last cache > > > entry disappears. This solves the problem without reintroducing the > > > reclaim issues, and generally is a bit more scalable than having to > > > wade through potentially hundreds of thousands of busy inodes. > > > > > > Locking is a bit tricky because the locks protecting the inode state > > > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the > > > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are > > > serialized through i_lock, taken before the i_pages lock, to make sure > > > depopulated inodes are queued reliably. Additions may race with > > > deletions, but we'll check again in the shrinker. If additions race > > > with the shrinker itself, we're protected by the i_lock: if > > > find_inode() or iput() win, the shrinker will bail on the elevated > > > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the > > > inode, it will set I_FREEING and inhibit further igets(), which will > > > cause the other side to create a new instance of the inode instead. > > > > This makes an awful mess of the inode locking and, to me, is a > > serious layering violation.... > > The direction through the layers is no different than set_page_dirty() > calling mark_inode_dirty() from under the page lock. > > This is simply where the state change of the inode originates. What's > the benefit of having the code pretend otherwise? > > > > index e89df447fae3..c9956fac640e 100644 > > > --- a/include/linux/pagemap.h > > > +++ b/include/linux/pagemap.h > > > @@ -23,6 +23,56 @@ static inline bool mapping_empty(struct address_space *mapping) > > > return xa_empty(&mapping->i_pages); > > > } > > > > > > +/* > > > + * mapping_shrinkable - test if page cache state allows inode reclaim > > > + * @mapping: the page cache mapping > > > + * > > > + * This checks the mapping's cache state for the pupose of inode > > > + * reclaim and LRU management. > > > + * > > > + * The caller is expected to hold the i_lock, but is not required to > > > + * hold the i_pages lock, which usually protects cache state. That's > > > + * because the i_lock and the list_lru lock that protect the inode and > > > + * its LRU state don't nest inside the irq-safe i_pages lock. > > > + * > > > + * Cache deletions are performed under the i_lock, which ensures that > > > + * when an inode goes empty, it will reliably get queued on the LRU. > > > + * > > > + * Cache additions do not acquire the i_lock and may race with this > > > + * check, in which case we'll report the inode as shrinkable when it > > > + * has cache pages. This is okay: the shrinker also checks the > > > + * refcount and the referenced bit, which will be elevated or set in > > > + * the process of adding new cache pages to an inode. > > > + */ > > > > .... because you're expanding the inode->i_lock to now cover the > > page cache additions and removal and that massively increases the > > scope of the i_lock. The ilock is for internal inode serialisation > > purposes, not serialisation with external subsystems that inodes > > interact with like mappings. > > I'm expanding it to cover another state change in the inode that > happens to originate in the mapping. > > Yes, it would have been slightly easier if the inode locks nested > under the page cache locks, like they do for dirty state. This way we > could have updated the inode from the innermost context where the > mapping state changes, which would have been a bit more compact. > > Unfortunately, the i_pages lock is irq-safe - to deal with writeback > completion changing cache state from IRQ context. We'd have to make > the i_lock and the list_lru lock irq-safe as well, and I'm not sure > that is more desirable than ordering the locks the other way. > > Doing it this way means annotating a few more deletion entry points. > But there are a limited number of ways for pages to leave the page > cache - reclaim and truncation - so it's within reason. > > The i_lock isn't expanded to cover additions, if you take a look at > the code. That would have been a much larger surface area indeed. > > > > +static inline bool mapping_shrinkable(struct address_space *mapping) > > > +{ > > > + void *head; > > > + > > > + /* > > > + * On highmem systems, there could be lowmem pressure from the > > > + * inodes before there is highmem pressure from the page > > > + * cache. Make inodes shrinkable regardless of cache state. > > > + */ > > > + if (IS_ENABLED(CONFIG_HIGHMEM)) > > > + return true; > > > + > > > + /* Cache completely empty? Shrink away. */ > > > + head = rcu_access_pointer(mapping->i_pages.xa_head); > > > + if (!head) > > > + return true; > > > + > > > + /* > > > + * The xarray stores single offset-0 entries directly in the > > > + * head pointer, which allows non-resident page cache entries > > > + * to escape the shadow shrinker's list of xarray nodes. The > > > + * inode shrinker needs to pick them up under memory pressure. > > > + */ > > > + if (!xa_is_node(head) && xa_is_value(head)) > > > + return true; > > > + > > > + return false; > > > +} > > > > It just occurred to me: mapping_shrinkable() == available for > > shrinking, not "mapping can be shrunk by having pages freed from > > it". > > > > If this lives, it really needs a better name and API - this isn't > > applying a check of whether the mapping can be shrunk, it indicates > > whether the inode hosting the mapping is considered freeable or not. > > Right, it indicates whether the state of the mapping translates to the > inode being eligigble for shrinker reclaim or not. > > I've used mapping_populated() in a previous version, but this function > here captures additional reclaim policy (highmem setups e.g.), so it's > not a suitable name for it. > > This is the best I could come up with, so I'm open to suggestions. > > > > @@ -260,9 +260,13 @@ void delete_from_page_cache(struct page *page) > > > struct address_space *mapping = page_mapping(page); > > > > > > BUG_ON(!PageLocked(page)); > > > + spin_lock(&mapping->host->i_lock); > > > xa_lock_irq(&mapping->i_pages); > > > __delete_from_page_cache(page, NULL); > > > xa_unlock_irq(&mapping->i_pages); > > > + if (mapping_shrinkable(mapping)) > > > + inode_add_lru(mapping->host); > > > + spin_unlock(&mapping->host->i_lock); > > > > > > > This, to me is an example of the layering problesm here. No mapping > > specific function should be using locks that belong to the mapping > > host for internal mapping serialisation. > > > > Especially considering that inode_add_lru() is defined in > > fs/internal.h - nothing outside the core fs code should really be > > using inode_add_lru(), just lik enothing outside of fs code should > > be using the inode->i_lock for anything. > > Again, it's no different than dirty state propagation. That's simply > the direction in which this information flows through the stack. > > We can encapsulate it all into wrapper/API functions if you prefer, > but it doesn't really change the direction in which the information > travels, nor the lock scoping. > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index cc5d7cd75935..6dd5ef8a11bc 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -1055,6 +1055,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > > > BUG_ON(!PageLocked(page)); > > > BUG_ON(mapping != page_mapping(page)); > > > > > > + if (!PageSwapCache(page)) > > > + spin_lock(&mapping->host->i_lock); > > > xa_lock_irq(&mapping->i_pages); > > > /* > > > * The non racy check for a busy page. > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > > > shadow = workingset_eviction(page, target_memcg); > > > __delete_from_page_cache(page, shadow); > > > xa_unlock_irq(&mapping->i_pages); > > > + if (mapping_shrinkable(mapping)) > > > + inode_add_lru(mapping->host); > > > + spin_unlock(&mapping->host->i_lock); > > > > > > > No. Inode locks have absolutely no place serialising core vmscan > > algorithms. > > What if, and hear me out on this one, core vmscan algorithms change > the state of the inode? Then the core vmscan algorithm has a layering violation. vmscan operates on pages and address spaces attached to pages. It should not be peering any deeper into the mapping than what is contained in the mapping. If it needs to go any deeper (i.e. needs host level operations) then that is what the address space operations are for. i.e. if emptying a mapping needs action to be taken at the inode level, then it should call out via operation so that the filesysetm can take appropriate action. We have strong abstractions and layering between different subsystems for good reason. If you can't make what you need work without violating those abstractions, you need to have a compelling reason to do so. You haven't given us one yet. > The alternative to propagating this change from where it occurs is > simply that the outer layer now has to do stupid things to infer it - > essentially needing to busy poll. See below. > > > Really, all this complexity because: > > > > | The issue is mostly that shrinker pools attract pressure based on > > | their size, and when objects get skipped the shrinkers remember this > > | as deferred reclaim work. > > > > And so you want inodes with reclaimable mappings to avoid being > > considered deferred work for the inode shrinker. That's what is > > occuring because we are currently returning LRU_RETRY to them, or > > would also happen if we returned LRU_SKIP or LRU_ROTATE just to > > ignore them. > > > > However, it's trivial to change inode_lru_isolate() to do something > > like: > > > > if (inode_has_buffers(inode) || inode->i_data.nrpages) { > > if (!IS_ENABLED(CONFIG_HIGHMEM)) { > > spin_unlock(&inode->i_lock); > > return LRU_ROTATE_NODEFER; > > } > > ..... > > } > > Yeah, that's busy polling. It's simple and inefficient. That's how the dentry and inode cache shrinkers work - the LRUs are lazily maintained to keep LRU list lock contention out of the fast paths. The trade off for that is some level of inefficiency in managing inodes and dentries that shouldn't be or aren't ideally placed on the LRU. IOWs, we push the decision on whether inodes should be on the LRU or whether they should be reclaimed into the shrinker scan logic, not the fast path lookup logic where inodes are referenced or dirtied or cleaned. IOWs, the biggest problem with increasing the rate at which we move inodes on and off the LRU list is lock contention. It's already a major problem and hence the return values from the LRU code to say whether the inode was added/removed or not to the LRU. By increasing the number inode LRU addition sites by an order of magnitude throughout the mm/ subsystem, we greatly increase the difficulty of managing the inode LRU. We no longer have a nice, predictable set of locations where the inodes are added to and removed from the LRUs, and so now we've got a huge increase in the number of different, unpredictable vectors that can lead to lock contention on the LRUs... > We can rotate the busy inodes to the end of the list when we see them, > but all *new* inodes get placed ahead of them and push them out the > back. If you have a large set of populated inodes paired with an > ongoing stream of metadata operations creating one-off inodes, you end > up continuously shoveling through hundreds of thousand of the same > pinned inodes to get to the few reclaimable ones mixed in. IDGI. What has a huge stream of metadata dirty inodes going through the cache have to do with deciding when to reclaim a data inode with a non-zero page cache residency? I mean, this whole patchset is intended to prevent the inode from being reclaimed until the page cache on the inode has been reclaimed by memory pressure, so what does it matter how often those data inodes are rotated through the LRU? They can't be reclaimed until the page cache frees the pages, so the presence or absence of other inodes on the list is entirely irrelevant. For any workload that invloves streaming inodes through the inode cache, the inode shrinker is going to be scanning through the hundreds of thousands of inodes anyway. Once put in that context, the cost rotating on data inodes with page cache attached is going to be noise, regardless of whether it is inefficient or not. And, really, if the inode cache has so many unreferenced inode with attached page cache attached to them that the size of the inode cache becomes an issue, then the page reclaim algorithms are totally failing to reclaim page cache quickly enough to maintian the desired system cache balance. i.e. We are supposed to be keeping a balance between the inode cache size and the page cache size, and if we can't keep that balance because inode reclaim can't make progress because page cache pinning inodes, then we have a page reclaim problem, not an inode reclaim problem. THere's also the problem of lack of visibility. RIght now, we know that all the unreferenced VFS inodes in the system are on the LRU and that they are generally all accounted for either by active references or walking the LRU. This change now creates a situation when inodes with page cache attached are not actively references but now also can't be found by walking the LRU list. i.e. we have inodes that are unaccounted for by various statistics. How do we know how many inodes are pinned by the page cache? We can't see them via inode shrinker count tracing, we can't see them via looking up all the active references to the inode (because there are none), and so on. Essentially we have nothing tracking these inodes at all at this point, so if we ever miss a call in the mm/ to add the inode to the LRU that you've sprinked throughout the mm/ code, we essentially leak the inode. It's hard enough tracking down bugs that result in "VFS: busy inodes after unmount" errors without actively adding infrastructure that intentionally makes inodes disappear from tracking. > Meanwhile, the MM is standing by, going "You know, I could tell you > exactly when those actually become reclaimable..." > > If you think this is the more elegant implementation, simply because > it follows a made-up information hierarchy that doesn't actually map > to the real world, then I don't know what to tell you. > > We take i_count and dirty inodes off the list because it's silly to > rotate them over and over, when we could just re-add them the moment > the pin goes away. It's not a stretch to do the same for populated > inodes, which usually make up a much bigger share of the pool. We don't take dirty inodes off the LRU. We only take referenced inodes off the LRU in the shrinker isolation function because we can guarantee that when the inode reference is dropped via iput() the inode will be added back to the LRU if it needs to be placed there. This is all all internal to the inode cache functionality (fs/inode.c) and that's the way it should remain because anything else will rapidly become unmaintainable. > > And in __list_lru_walk_one() just add: > > > > case LRU_ROTATE_NODEFER: > > isolated++; > > /* fallthrough */ > > case LRU_ROTATE: > > list_move_tail(item, &l->list); > > break; > > > > And now inodes with active page cache rotated to the tail of the > > list and are considered to have had work done on them. Hence they > > don't add to the work accumulation that the shrinker infrastructure > > defers, and so will allow the page reclaim to do it's stuff with > > page reclaim before such inodes will get reclaimed. > > > > That's *much* simpler than your proposed patch and should get you > > pretty much the same result. > > It solves the deferred work buildup, but it's absurdly inefficient. So you keep saying. Show us the numbers. Show us that it's so inefficient that it's completely unworkable. _You_ need to justify why violating modularity and layering is the only viable solution to this problem. Given that there is an alternative simple, straight forward solution to the problem, it's on you to prove it is insufficient to solve your issues. I'm sceptical that the complexity is necessary given that in general workloads, the inode shrinker doesn't even register in kernel profiles and that the problem being avoided generally isn't even hit in most workloads. IOWs, I'll take a simple but inefficient solution for avoiding a corner case behaviour over a solution that is complex, fragile and full of layering violations any day of the weeks. Whether you disagree with me or not is largely irrelevant - nothing you are saying changes the fact that this patchset makes an unmanageable mess of internal filesystem locks and APIs and that's not simply acceptible. There's a simple way to acheive what you want, so do it that way first, then we can work out if further improvement is necessary. Cheers, Dave.
On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote: > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote: > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote: > > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > > > > shadow = workingset_eviction(page, target_memcg); > > > > __delete_from_page_cache(page, shadow); > > > > xa_unlock_irq(&mapping->i_pages); > > > > + if (mapping_shrinkable(mapping)) > > > > + inode_add_lru(mapping->host); > > > > + spin_unlock(&mapping->host->i_lock); > > > > > > > > > > No. Inode locks have absolutely no place serialising core vmscan > > > algorithms. > > > > What if, and hear me out on this one, core vmscan algorithms change > > the state of the inode? > > Then the core vmscan algorithm has a layering violation. You're just playing a word game here. If we want the inode reclaimability to be dependent on the state of the page cache, then reclaim and truncation change the state of the inode. It's really that simple. We can recognize that fact in the code, or we can be obtuse about it. > > The alternative to propagating this change from where it occurs is > > simply that the outer layer now has to do stupid things to infer it - > > essentially needing to busy poll. See below. > > > > > Really, all this complexity because: > > > > > > | The issue is mostly that shrinker pools attract pressure based on > > > | their size, and when objects get skipped the shrinkers remember this > > > | as deferred reclaim work. > > > > > > And so you want inodes with reclaimable mappings to avoid being > > > considered deferred work for the inode shrinker. That's what is > > > occuring because we are currently returning LRU_RETRY to them, or > > > would also happen if we returned LRU_SKIP or LRU_ROTATE just to > > > ignore them. > > > > > > However, it's trivial to change inode_lru_isolate() to do something > > > like: > > > > > > if (inode_has_buffers(inode) || inode->i_data.nrpages) { > > > if (!IS_ENABLED(CONFIG_HIGHMEM)) { > > > spin_unlock(&inode->i_lock); > > > return LRU_ROTATE_NODEFER; > > > } > > > ..... > > > } > > > > Yeah, that's busy polling. It's simple and inefficient. > > That's how the dentry and inode cache shrinkers work - the LRUs are > lazily maintained to keep LRU list lock contention out of the fast > paths. The trade off for that is some level of inefficiency in > managing inodes and dentries that shouldn't be or aren't ideally > placed on the LRU. IOWs, we push the decision on whether inodes > should be on the LRU or whether they should be reclaimed into the > shrinker scan logic, not the fast path lookup logic where inodes are > referenced or dirtied or cleaned. That's just not true. The way the inode shrinker works is that all indefinite pins that make an inode unreclaimable for reasons other than its age are kept off the LRU. Yes, we're lazy about removing them if pins are established after the object is already queued. But we don't queue objects with that state, and sites that clear this pin update the LRU. That's true when the inode is pinned by a refcount or by the dirty state. I'm proposing to handle the page cache state in exactly the same fashion. It's you who is arguing we should deal with those particular pins differently than how we deal with all others, and I find it difficult to follow your reasoning for that. > IOWs, the biggest problem with increasing the rate at which we move > inodes on and off the LRU list is lock contention. It's already a > major problem and hence the return values from the LRU code to say > whether the inode was added/removed or not to the LRU. > > By increasing the number inode LRU addition sites by an order of > magnitude throughout the mm/ subsystem, we greatly increase the > difficulty of managing the inode LRU. We no longer have a nice, > predictable set of locations where the inodes are added to and > removed from the LRUs, and so now we've got a huge increase in the > number of different, unpredictable vectors that can lead to lock > contention on the LRUs... Rotating the same unreclaimable inodes over and over is BETTER for lock/LRU contention than adding them once at the end of their life? > > We can rotate the busy inodes to the end of the list when we see them, > > but all *new* inodes get placed ahead of them and push them out the > > back. If you have a large set of populated inodes paired with an > > ongoing stream of metadata operations creating one-off inodes, you end > > up continuously shoveling through hundreds of thousand of the same > > pinned inodes to get to the few reclaimable ones mixed in. > > IDGI. > > What has a huge stream of metadata dirty inodes going through the > cache have to do with deciding when to reclaim a data inode with a > non-zero page cache residency? It's simply about the work it needs to perform in order to find reclaimable nodes in the presence of a large number of permanently unreclaimable ones. > I mean, this whole patchset is intended to prevent the inode from > being reclaimed until the page cache on the inode has been reclaimed > by memory pressure, so what does it matter how often those data inodes > are rotated through the LRU? They can't be reclaimed until the page > cache frees the pages, so the presence or absence of other inodes on > the list is entirely irrelevant. > > For any workload that invloves streaming inodes through the inode > cache, the inode shrinker is going to be scanning through the > hundreds of thousands of inodes anyway. Once put in that context, > the cost rotating on data inodes with page cache attached is going > to be noise, regardless of whether it is inefficient or not. How do you figure that? If you have 1,000 inodes pinned by page cache, and then start streaming, you have to rotate 1,000 inodes before you get to the first reclaimable one. It depends on memory pressure how many streaming inodes you were able to add before the inode pool is capped and reclaim runs again. At least in theory, the worst case is having to rotate every unreclaimable inode for every one you can reclaim. Whether this is likely to happen in practice or not, I don't understand why we should special case page cache pins from existing pins and build a pathological cornercase into the shrinker behavior. > And, really, if the inode cache has so many unreferenced inode with > attached page cache attached to them that the size of the inode > cache becomes an issue, then the page reclaim algorithms are totally > failing to reclaim page cache quickly enough to maintian the desired > system cache balance. i.e. We are supposed to be keeping a balance > between the inode cache size and the page cache size, and if we > can't keep that balance because inode reclaim can't make progress > because page cache pinning inodes, then we have a page reclaim > problem, not an inode reclaim problem. We may never reclaim that page cache if we have no reason to do so. Say you have a git tree fully cached, and now somebody runs grep on a large subdirectory with lots of files. We don't kick out the git tree cache for stuff that is only touched once - reclaim goes directly for the cache coming off the grep. But the shrinker needs to first rotate through the entire git tree to get to the grep inodes. Hopefully by the time it gets to them, their page cache will have gone. If not, we'll rotate them and go through the entire git tree inodes again. What's the point in doing this when we know the exact point in time where it makes sense to start scanning the git tree inodes? Numbers talk, I get it, and I'll try to get some benchmarking done next week. But do you really need numbers to understand how poorly this algorithm performs? The strongest argument you can make then is that the algorithm simply never really matters. And that seems like a stretch. > THere's also the problem of lack of visibility. RIght now, we know > that all the unreferenced VFS inodes in the system are on the LRU > and that they are generally all accounted for either by active > references or walking the LRU. This change now creates a situation > when inodes with page cache attached are not actively references but > now also can't be found by walking the LRU list. i.e. we have inodes > that are unaccounted for by various statistics. How do we know how > many inodes are pinned by the page cache? We can't see them via > inode shrinker count tracing, we can't see them via looking up all > the active references to the inode (because there are none), and so > on. Can you elaborate on which statistics you are referring to? I can see nr_inodes and nr_unused, but I don't see the delta being broken down further into open and dirty - and if dirty whether it's the inode that's dirty, or the page cache. So it appears we already don't know how many inodes can be pinned today by the page cache. From what I can see there seems to be a general lack of insight into the state of the icache that may need a more general solution that is outside the scope of my patch here. > Essentially we have nothing tracking these inodes at all at this > point, so if we ever miss a call in the mm/ to add the inode to the > LRU that you've sprinked throughout the mm/ code, we essentially > leak the inode. It's hard enough tracking down bugs that result in > "VFS: busy inodes after unmount" errors without actively adding > infrastructure that intentionally makes inodes disappear from > tracking. Well yes, but that's no different than missing an iput(). And there is a lot more of those than there are page cache deletion sides. > > Meanwhile, the MM is standing by, going "You know, I could tell you > > exactly when those actually become reclaimable..." > > > > If you think this is the more elegant implementation, simply because > > it follows a made-up information hierarchy that doesn't actually map > > to the real world, then I don't know what to tell you. > > > > We take i_count and dirty inodes off the list because it's silly to > > rotate them over and over, when we could just re-add them the moment > > the pin goes away. It's not a stretch to do the same for populated > > inodes, which usually make up a much bigger share of the pool. > > We don't take dirty inodes off the LRU. We only take referenced > inodes off the LRU in the shrinker isolation function because we can > guarantee that when the inode reference is dropped via iput() the > inode will be added back to the LRU if it needs to be placed there. > > This is all all internal to the inode cache functionality > (fs/inode.c) and that's the way it should remain because anything > else will rapidly become unmaintainable. You don't seem to understand how this code actually works - this is patently false. I think this conversation could be more productive if you spent some time catching up on that first.
On Wed, Jun 16, 2021 at 12:54:15AM -0400, Johannes Weiner wrote: > On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote: > > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote: > > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > > > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote: > > > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > > > > > shadow = workingset_eviction(page, target_memcg); > > > > > __delete_from_page_cache(page, shadow); > > > > > xa_unlock_irq(&mapping->i_pages); > > > > > + if (mapping_shrinkable(mapping)) > > > > > + inode_add_lru(mapping->host); > > > > > + spin_unlock(&mapping->host->i_lock); > > > > > > > > > > > > > No. Inode locks have absolutely no place serialising core vmscan > > > > algorithms. > > > > > > What if, and hear me out on this one, core vmscan algorithms change > > > the state of the inode? > > > > Then the core vmscan algorithm has a layering violation. > > You're just playing a word game here. No, I've given you plenty of constructive justification and ways to restructure your patches to acheive what you say needs to be done. You're the one that is rejecting any proposal I make outright and making unjustified claims that "I don't understand this code". Serious, Johannes, I understand how the inode cache shrinkers work better than most people out there, and for you to just reject my comments and suggestions outright with "you don't understand this stuff so go learn about it' is pretty obnoxious and toxic behaviour. I haven't disagreed at all with what you are trying to do, nor do I think that being more selective about how we track inodes on the LRUs is a bad thing. What I'm commenting on is that the proposed changes are *really bad code*. If you can work out a *clean* way to move inodes onto the LRU when they are dirty then I'm all for it. But sprinkling inode->i_lock all over the mm/ subsystem and then adding seemling randomly placed inode lru manipulations isn't the way to do it. You should consider centralising all the work involved marking a mapping clean somewhere inside the mm/ code. Then add a single callout that does the inode LRU work, similar to how the fs-writeback.c code does it when the inode is marked clean by inode_sync_complete(). IOWs, if you can't accept that there are problems with your approach nor accept that it needs to be improved because *I* said there are problems (and it seems that you mostly only react this way when *I* happen to say "needs improvement"), then you need to go take a good hard look at yourself. Fix up your code so that it is acceptible to reviewers - telling reviewers they don't know shit is a sure way to piss people off unnecesarily. It doesn't make you look very good at all, and it makes other people think twice about wanting to review your code or work with you in future. So, again, until you see fit to be stop being obnoxious and toxic and that your patches need work before they are acceptible for merging, I maintain my NACK on this patch as it standsi and I am complete done with this discusion. -Dave.
On Wed, 16 Jun 2021 00:54:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote: > > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote: > > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > > > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote: > > > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > > > > > shadow = workingset_eviction(page, target_memcg); > > > > > __delete_from_page_cache(page, shadow); > > > > > xa_unlock_irq(&mapping->i_pages); > > > > > + if (mapping_shrinkable(mapping)) > > > > > + inode_add_lru(mapping->host); > > > > > + spin_unlock(&mapping->host->i_lock); > > > > > > > > > > > > > No. Inode locks have absolutely no place serialising core vmscan > > > > algorithms. > > > > > > What if, and hear me out on this one, core vmscan algorithms change > > > the state of the inode? > > > > Then the core vmscan algorithm has a layering violation. > > You're just playing a word game here. Don't think so. David is quite correct in saying that vmscan shouldn't mess with inode state unless it's via address_space_operations?
On Thu, Jun 17, 2021 at 10:49:42AM +1000, Dave Chinner wrote: > On Wed, Jun 16, 2021 at 12:54:15AM -0400, Johannes Weiner wrote: > > On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote: > > > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote: > > > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > > > > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote: > > > > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > > > > > > shadow = workingset_eviction(page, target_memcg); > > > > > > __delete_from_page_cache(page, shadow); > > > > > > xa_unlock_irq(&mapping->i_pages); > > > > > > + if (mapping_shrinkable(mapping)) > > > > > > + inode_add_lru(mapping->host); > > > > > > + spin_unlock(&mapping->host->i_lock); > > > > > > > > > > > > > > > > No. Inode locks have absolutely no place serialising core vmscan > > > > > algorithms. > > > > > > > > What if, and hear me out on this one, core vmscan algorithms change > > > > the state of the inode? > > > > > > Then the core vmscan algorithm has a layering violation. > > > > You're just playing a word game here. > > No, I've given you plenty of constructive justification and ways to > restructure your patches to acheive what you say needs to be done. > > You're the one that is rejecting any proposal I make outright and > making unjustified claims that "I don't understand this code". Hey, come on now. The argument I was making is that page cache state is already used to update the inode LRU, and you incorrectly claimed that this wasn't the case. My statement was a direct response to this impasse, not a way to weasel out of your feedback. > I haven't disagreed at all with what you are trying to do, nor do I > think that being more selective about how we track inodes on the > LRUs is a bad thing. That's what it sounded like to me, but I'll chalk that up as a misunderstanding then. > What I'm commenting on is that the proposed changes are *really bad > code*. I'm not in love with it either, I can tell you that. But it also depends on the alternatives. I don't want to fix one bug and introduce a scalability issue. Or reintroduce subtle unforeseen shrinker issues that had you revert the previous fix. A revert, I might add, that could have been the incremental fix you proposed here. Instead you glossed over Roman's rootcausing and eintroduced the original bug. Now we're here, almost three years later, still on square one. So yeah, my priority is to get the behavior right first, and then worry about architectural beauty within those constraints. > If you can work out a *clean* way to move inodes onto the LRU when > they are dirty then I'm all for it. But sprinkling inode->i_lock all > over the mm/ subsystem and then adding seemling randomly placed > inode lru manipulations isn't the way to do it. > > You should consider centralising all the work involved marking a > mapping clean somewhere inside the mm/ code. Then add a single > callout that does the inode LRU work, similar to how the > fs-writeback.c code does it when the inode is marked clean by > inode_sync_complete(). Yes, I'd prefer that as well. Let's look at the options. The main source of complication is that the page cache cannot hold a direct reference on the inode; holding the xa_lock or the i_lock is the only thing that keeps the inode alive after we removed the page. So our options are either overlapping the lock sections, or taking the rcu readlock on the page cache side to bridge the deletion and the inode callback - which then has to deal with the possibility that the inode may have already been destroyed by the time it's called. I would put the RCU thing aside for now as it sounds just a bit too hairy, and too low-level an insight into the inode lifetime from the mapping side. The xa_lock is also dropped in several outer entry functions, so while it would clean up the fs side a bit, we wouldn't reduce the blast radius on the MM side. When we overlap lock sections, there are two options: a) This patch, with the page cache lock nesting inside the i_lock. Or, b) the way we handle dirty state: When we call set_page_dirty() -> mark_inode_dirty(), we hold the lock that serializes the page cache state when locking and updating the inode state. The hierarchy is: lock_page(page) # MM spin_lock(&inode->i_lock) # FS The equivalent for this patch would be to have page_cache_delete() call mark_inode_empty() (or whatever name works for everybody), again under the lock that serializes the page cache state: xa_lock_irq(&mapping->i_pages) # MM spin_lock(&inode->i_lock) # FS There would be one central place calling into the fs with an API function, encapsulating i_lock handling in fs/inode.c. Great. The major caveat here is that i_lock would need to become IRQ safe in order to nest inside the xa_lock. It's not that the semantical layering of code here is new in any way, it's simply the lock type. As far as I can see, i_lock hold times are quite short - it's a spinlock after all. But I haven't reviewed all the sites yet, and there are a lot of them. They would all need to be updated. Likewise, list_lru locking needs to be made irq-safe. However, irqsafe spinlock is sort of the inevitable fate of any lock embedded in a data structure API. So I'm less concerned about that. AFAICS nothing else nests under i_lock. If FS folks are fine with that, I would give that conversion a shot. Lock type dependency aside, this would retain full modularity and a clear delineation between mapping and inode property. It would also be a fully locked scheme, so none of the subtleties of the current patch. The end result seems clean and maintanable.
On Wed, Jun 16, 2021 at 06:30:43PM -0700, Andrew Morton wrote: > On Wed, 16 Jun 2021 00:54:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote: > > > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote: > > > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > > > > > On Mon, Jun 14, 2021 at 05:19:04PM -0400, Johannes Weiner wrote: > > > > > > @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, > > > > > > shadow = workingset_eviction(page, target_memcg); > > > > > > __delete_from_page_cache(page, shadow); > > > > > > xa_unlock_irq(&mapping->i_pages); > > > > > > + if (mapping_shrinkable(mapping)) > > > > > > + inode_add_lru(mapping->host); > > > > > > + spin_unlock(&mapping->host->i_lock); > > > > > > > > > > > > > > > > No. Inode locks have absolutely no place serialising core vmscan > > > > > algorithms. > > > > > > > > What if, and hear me out on this one, core vmscan algorithms change > > > > the state of the inode? > > > > > > Then the core vmscan algorithm has a layering violation. > > > > You're just playing a word game here. > > Don't think so. David is quite correct in saying that vmscan shouldn't > mess with inode state unless it's via address_space_operations? It seemed to me the complaint was more about vmscan propagating this state into the inode in general - effecting fs inode acquisitions and LRU manipulations from the page reclaim callstack - regardless of whether they are open-coded or indirect through API functions? Since I mentioned better encapsulation but received no response...
On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote: > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote: > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > > > And in __list_lru_walk_one() just add: > > > > > > case LRU_ROTATE_NODEFER: > > > isolated++; > > > /* fallthrough */ > > > case LRU_ROTATE: > > > list_move_tail(item, &l->list); > > > break; > > > > > > And now inodes with active page cache rotated to the tail of the > > > list and are considered to have had work done on them. Hence they > > > don't add to the work accumulation that the shrinker infrastructure > > > defers, and so will allow the page reclaim to do it's stuff with > > > page reclaim before such inodes will get reclaimed. > > > > > > That's *much* simpler than your proposed patch and should get you > > > pretty much the same result. > > > > It solves the deferred work buildup, but it's absurdly inefficient. > > So you keep saying. Show us the numbers. Show us that it's so > inefficient that it's completely unworkable. _You_ need to justify > why violating modularity and layering is the only viable solution to > this problem. Given that there is an alternative simple, straight > forward solution to the problem, it's on you to prove it is > insufficient to solve your issues. > > I'm sceptical that the complexity is necessary given that in general > workloads, the inode shrinker doesn't even register in kernel > profiles and that the problem being avoided generally isn't even hit > in most workloads. IOWs, I'll take a simple but inefficient solution > for avoiding a corner case behaviour over a solution that is > complex, fragile and full of layering violations any day of the > weeks. I spent time last week benchmarking both implementations with various combinations of icache and page cache size proportions. You're right that most workloads don't care. But there are workloads that do, and for them the behavior can become pathological during drop-behind reclaim. Page cache reclaim has two modes: 1. Workingset transitions where we flush out the old as quickly as possible and 2. Streaming buffered IO that doesn't benefit from caching, and so gets confined to the smallest possible amount of memory without touching active pages. During 1. we may rotate busy inodes a few times until their page cache disappears. This isn't great, but at least temporary. The issue is 2. We may do drop-behind reclaim for extended periods of time, during which the cache workingset remains completely untouched and the corresponding inodes never become eligible for freeing. Rotating them over and over represents a continuous parasitic drag on reclaim. Depending on the proportions between the icache and the inactive cache list, this drag can make up a sizable portion or even the majority of overall CPU consumed by reclaim. (And if you recall the discussion around RWF_UNCACHED, dropbehind reclaim is already bottlenecked on CPU with decent IO devices.) My test is doing drop-behind reclaim while most memory is filled with a cache workingset that is held by an increasing number of inodes. The first number here is the inodes, the second is the active pages held by each: 1,000 * 3072 pages: 0.39% 0.05% kswapd0 [kernel.kallsyms] [k] shrink_slab 10,000 * 307 pages: 0.39% 0.04% kswapd0 [kernel.kallsyms] [k] shrink_slab 100,000 * 32 pages: 1.29% 0.05% kswapd0 [kernel.kallsyms] [k] shrink_slab 500,000 * 6 pages: 11.36% 0.08% kswapd0 [kernel.kallsyms] [k] shrink_slab 1,000,000 * 3 pages: 26.40% 0.04% kswapd0 [kernel.kallsyms] [k] shrink_slab 1,500,000 * 2 pages: 42.97% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_slab 3,000,000 * 1 page: 45.22% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_slab As we get into higher inode counts, the shrinkers end up burning most of the reclaim cycles to rotate workingset inodes. For perspective, with 3 million inodes, when the shrinkers eat 45% of the cycles to busypoll the workingset inodes, page reclaim only consumes about 10% to actually make forward progress. IMO it goes from suboptimal to being a problem somewhere between 100k and 500k in this table. That's not *that* many inodes - I'm counting ~74k files in my linux git tree alone. North of 500k, it becomes pathological. That's probably less common, but it happens in the real world. I checked the file servers that host our internal source code trees. They have 16 times the memory of my test box, but they routinely deal with 50 million+ inodes. I think the additional complexity of updating the inode LRU according to cache population state is justified in order to avoid these pathological cornercases.
On Fri, Jun 18, 2021 at 12:45:18PM -0400, Johannes Weiner wrote: > On Thu, Jun 17, 2021 at 10:49:42AM +1000, Dave Chinner wrote: > > If you can work out a *clean* way to move inodes onto the LRU when > > they are dirty then I'm all for it. But sprinkling inode->i_lock all > > over the mm/ subsystem and then adding seemling randomly placed > > inode lru manipulations isn't the way to do it. > > > > You should consider centralising all the work involved marking a > > mapping clean somewhere inside the mm/ code. Then add a single > > callout that does the inode LRU work, similar to how the > > fs-writeback.c code does it when the inode is marked clean by > > inode_sync_complete(). > > Yes, I'd prefer that as well. Let's look at the options. > > The main source of complication is that the page cache cannot hold a > direct reference on the inode; holding the xa_lock or the i_lock is > the only thing that keeps the inode alive after we removed the page. > > So our options are either overlapping the lock sections, or taking the > rcu readlock on the page cache side to bridge the deletion and the > inode callback - which then has to deal with the possibility that the > inode may have already been destroyed by the time it's called. > > I would put the RCU thing aside for now as it sounds just a bit too > hairy, and too low-level an insight into the inode lifetime from the > mapping side. The xa_lock is also dropped in several outer entry > functions, so while it would clean up the fs side a bit, we wouldn't > reduce the blast radius on the MM side. > > When we overlap lock sections, there are two options: > > a) This patch, with the page cache lock nesting inside the i_lock. Or, > > b) the way we handle dirty state: When we call set_page_dirty() -> > mark_inode_dirty(), we hold the lock that serializes the page cache > state when locking and updating the inode state. The hierarchy is: > > lock_page(page) # MM > spin_lock(&inode->i_lock) # FS > > The equivalent for this patch would be to have page_cache_delete() > call mark_inode_empty() (or whatever name works for everybody), > again under the lock that serializes the page cache state: > > xa_lock_irq(&mapping->i_pages) # MM > spin_lock(&inode->i_lock) # FS > > There would be one central place calling into the fs with an API > function, encapsulating i_lock handling in fs/inode.c. > > Great. > > The major caveat here is that i_lock would need to become IRQ safe > in order to nest inside the xa_lock. It's not that the semantical > layering of code here is new in any way, it's simply the lock type. > > As far as I can see, i_lock hold times are quite short - it's a > spinlock after all. But I haven't reviewed all the sites yet, and > there are a lot of them. They would all need to be updated. > > Likewise, list_lru locking needs to be made irq-safe. However, > irqsafe spinlock is sort of the inevitable fate of any lock > embedded in a data structure API. So I'm less concerned about that. > > AFAICS nothing else nests under i_lock. > > If FS folks are fine with that, I would give that conversion a > shot. Lock type dependency aside, this would retain full modularity > and a clear delineation between mapping and inode property. It would > also be a fully locked scheme, so none of the subtleties of the > current patch. The end result seems clean and maintanable. I spent some time auditing i_lock and I don't think this is workable in its current scope. The irq disabling would be sprawling, and also require things like d_lock to become irq safe. That said, it might be possible to split the i_lock itself. AFAIU it is used to serialize several independent operations. For example, it seems i_size, i_blocks, i_nlink, i_dentry, i_private, i_*time updates don't need to serialize against changes to i_state or inode destruction beyond just holding a reference, and could presumably use separate locks. Splitting that out could be a good thing in itself. If the lifetime and LRU management of the inode can be moved to a new lock, we could make those operations irq safe with little collateral damage. And then the below cleanup to the current patch would be possible, which would consolidate the MM side somewhat. I don't mind pursuing this if people agree it's a good idea. That said, this is a fairly invasive, non-trivial architectural overhaul, which I don't think is a reasonable prerequisites for addressing a real world problem we have today. I hope we can go ahead with this fix for now and tag on the architectural rework - instead of the other way around, and leaving memory management broken until I get more familiar with vfs locking, let alone stabilizing those changes... Here is the follow-up cleanup this would enable (would obviously deadlock in the current version of i_lock). Honestly a little underwhelming if you ask me, given the size of the prep work... --- fs/inode.c | 7 +++++++ fs/internal.h | 1 + include/linux/fs.h | 3 ++- include/linux/pagemap.h | 13 +------------ mm/filemap.c | 14 ++++++-------- mm/truncate.c | 22 +++++----------------- mm/vmscan.c | 7 ------- mm/workingset.c | 12 ++---------- 8 files changed, 24 insertions(+), 55 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 6b74701c1954..18c0554148a3 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -457,6 +457,13 @@ static void inode_lru_list_del(struct inode *inode) this_cpu_dec(nr_unused); } +void mark_inode_shrinkable(struct inode *inode) +{ + spin_lock(&inode->i_lock); + inode_add_lru(inode); + spin_unlock(&inode->i_lock); +} + /** * inode_sb_list_add - add inode to the superblock list of inodes * @inode: inode to add diff --git a/fs/internal.h b/fs/internal.h index 3eb90dde62bd..6aeae7ef3380 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -146,6 +146,7 @@ extern int vfs_open(const struct path *, struct file *); * inode.c */ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); +extern void inode_add_lru(struct inode *inode); extern int dentry_needs_remove_privs(struct dentry *dentry); /* diff --git a/include/linux/fs.h b/include/linux/fs.h index 301cd0195036..a5fb073f54c2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2409,6 +2409,8 @@ static inline void mark_inode_dirty_sync(struct inode *inode) __mark_inode_dirty(inode, I_DIRTY_SYNC); } +extern void mark_inode_shrinkable(struct inode *inode); + /* * Returns true if the given inode itself only has dirty timestamps (its pages * may still be dirty) and isn't currently being allocated or freed. @@ -3214,7 +3216,6 @@ static inline void remove_inode_hash(struct inode *inode) } extern void inode_sb_list_add(struct inode *inode); -extern void inode_add_lru(struct inode *inode); extern int sb_set_blocksize(struct super_block *, int); extern int sb_min_blocksize(struct super_block *, int); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c9956fac640e..2e7436eba882 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -31,18 +31,7 @@ static inline bool mapping_empty(struct address_space *mapping) * reclaim and LRU management. * * The caller is expected to hold the i_lock, but is not required to - * hold the i_pages lock, which usually protects cache state. That's - * because the i_lock and the list_lru lock that protect the inode and - * its LRU state don't nest inside the irq-safe i_pages lock. - * - * Cache deletions are performed under the i_lock, which ensures that - * when an inode goes empty, it will reliably get queued on the LRU. - * - * Cache additions do not acquire the i_lock and may race with this - * check, in which case we'll report the inode as shrinkable when it - * has cache pages. This is okay: the shrinker also checks the - * refcount and the referenced bit, which will be elevated or set in - * the process of adding new cache pages to an inode. + * hold the i_pages lock, which usually protects cache state. */ static inline bool mapping_shrinkable(struct address_space *mapping) { diff --git a/mm/filemap.c b/mm/filemap.c index 0d0d72ced961..3c10830cd396 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -143,6 +143,9 @@ static void page_cache_delete(struct address_space *mapping, page->mapping = NULL; /* Leave page->index set: truncation lookup relies upon it */ mapping->nrpages -= nr; + + if (mapping_shrinkable(mapping)) + mark_inode_shrinkable(mapping->host); } static void unaccount_page_cache_page(struct address_space *mapping, @@ -260,13 +263,9 @@ void delete_from_page_cache(struct page *page) struct address_space *mapping = page_mapping(page); BUG_ON(!PageLocked(page)); - spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); __delete_from_page_cache(page, NULL); xa_unlock_irq(&mapping->i_pages); - if (mapping_shrinkable(mapping)) - inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); page_cache_free_page(mapping, page); } @@ -332,6 +331,9 @@ static void page_cache_delete_batch(struct address_space *mapping, total_pages++; } mapping->nrpages -= total_pages; + + if (mapping_shrinkable(mapping)) + mark_inode_shrinkable(mapping->host); } void delete_from_page_cache_batch(struct address_space *mapping, @@ -342,7 +344,6 @@ void delete_from_page_cache_batch(struct address_space *mapping, if (!pagevec_count(pvec)) return; - spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); for (i = 0; i < pagevec_count(pvec); i++) { trace_mm_filemap_delete_from_page_cache(pvec->pages[i]); @@ -351,9 +352,6 @@ void delete_from_page_cache_batch(struct address_space *mapping, } page_cache_delete_batch(mapping, pvec); xa_unlock_irq(&mapping->i_pages); - if (mapping_shrinkable(mapping)) - inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); for (i = 0; i < pagevec_count(pvec); i++) page_cache_free_page(mapping, pvec->pages[i]); diff --git a/mm/truncate.c b/mm/truncate.c index 950d73fa995d..0adce9df9100 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -40,18 +40,17 @@ static inline void __clear_shadow_entry(struct address_space *mapping, if (xas_load(&xas) != entry) return; xas_store(&xas, NULL); + + if (mapping_shrinkable(mapping)) + mark_inode_shrinkable(mapping->host); } static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, void *entry) { - spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); __clear_shadow_entry(mapping, index, entry); xa_unlock_irq(&mapping->i_pages); - if (mapping_shrinkable(mapping)) - inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); } /* @@ -77,10 +76,8 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, return; dax = dax_mapping(mapping); - if (!dax) { - spin_lock(&mapping->host->i_lock); + if (!dax) xa_lock_irq(&mapping->i_pages); - } for (i = j; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; @@ -99,12 +96,8 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, __clear_shadow_entry(mapping, index, page); } - if (!dax) { + if (!dax) xa_unlock_irq(&mapping->i_pages); - if (mapping_shrinkable(mapping)) - inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); - } pvec->nr = j; } @@ -579,7 +572,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL)) return 0; - spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); if (PageDirty(page)) goto failed; @@ -587,9 +579,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) BUG_ON(page_has_private(page)); __delete_from_page_cache(page, NULL); xa_unlock_irq(&mapping->i_pages); - if (mapping_shrinkable(mapping)) - inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); if (mapping->a_ops->freepage) mapping->a_ops->freepage(page); @@ -598,7 +587,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) return 1; failed: xa_unlock_irq(&mapping->i_pages); - spin_unlock(&mapping->host->i_lock); return 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 6dd5ef8a11bc..cc5d7cd75935 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1055,8 +1055,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); - if (!PageSwapCache(page)) - spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); /* * The non racy check for a busy page. @@ -1125,9 +1123,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, shadow = workingset_eviction(page, target_memcg); __delete_from_page_cache(page, shadow); xa_unlock_irq(&mapping->i_pages); - if (mapping_shrinkable(mapping)) - inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); if (freepage != NULL) freepage(page); @@ -1137,8 +1132,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, cannot_free: xa_unlock_irq(&mapping->i_pages); - if (!PageSwapCache(page)) - spin_unlock(&mapping->host->i_lock); return 0; } diff --git a/mm/workingset.c b/mm/workingset.c index 9d3d2b4ce44d..1576b7ca3fbb 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -540,13 +540,6 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, goto out; } - if (!spin_trylock(&mapping->host->i_lock)) { - xa_unlock(&mapping->i_pages); - spin_unlock_irq(lru_lock); - ret = LRU_RETRY; - goto out; - } - list_lru_isolate(lru, item); __dec_lruvec_kmem_state(node, WORKINGSET_NODES); @@ -564,11 +557,10 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, xa_delete_node(node, workingset_update_node); __inc_lruvec_kmem_state(node, WORKINGSET_NODERECLAIM); + if (mapping_shrinkable(mapping)) + mark_inode_shrinkable(mapping->host); out_invalid: xa_unlock_irq(&mapping->i_pages); - if (mapping_shrinkable(mapping)) - inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); ret = LRU_REMOVED_RETRY; out: cond_resched();
diff --git a/fs/inode.c b/fs/inode.c index 8830a727b0af..6b74701c1954 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -424,11 +424,20 @@ void ihold(struct inode *inode) } EXPORT_SYMBOL(ihold); -static void inode_lru_list_add(struct inode *inode) +static void __inode_add_lru(struct inode *inode, bool rotate) { + if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE)) + return; + if (atomic_read(&inode->i_count)) + return; + if (!(inode->i_sb->s_flags & SB_ACTIVE)) + return; + if (!mapping_shrinkable(&inode->i_data)) + return; + if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru)) this_cpu_inc(nr_unused); - else + else if (rotate) inode->i_state |= I_REFERENCED; } @@ -439,16 +448,11 @@ static void inode_lru_list_add(struct inode *inode) */ void inode_add_lru(struct inode *inode) { - if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC | - I_FREEING | I_WILL_FREE)) && - !atomic_read(&inode->i_count) && inode->i_sb->s_flags & SB_ACTIVE) - inode_lru_list_add(inode); + __inode_add_lru(inode, false); } - static void inode_lru_list_del(struct inode *inode) { - if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru)) this_cpu_dec(nr_unused); } @@ -724,10 +728,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty) /* * Isolate the inode from the LRU in preparation for freeing it. * - * Any inodes which are pinned purely because of attached pagecache have their - * pagecache removed. If the inode has metadata buffers attached to - * mapping->private_list then try to remove them. - * * If the inode has the I_REFERENCED flag set, then it means that it has been * used recently - the flag is set in iput_final(). When we encounter such an * inode, clear the flag and move it to the back of the LRU so it gets another @@ -743,31 +743,39 @@ static enum lru_status inode_lru_isolate(struct list_head *item, struct inode *inode = container_of(item, struct inode, i_lru); /* - * we are inverting the lru lock/inode->i_lock here, so use a trylock. - * If we fail to get the lock, just skip it. + * We are inverting the lru lock/inode->i_lock here, so use a + * trylock. If we fail to get the lock, just skip it. */ if (!spin_trylock(&inode->i_lock)) return LRU_SKIP; /* - * Referenced or dirty inodes are still in use. Give them another pass - * through the LRU as we canot reclaim them now. + * Inodes can get referenced, redirtied, or repopulated while + * they're already on the LRU, and this can make them + * unreclaimable for a while. Remove them lazily here; iput, + * sync, or the last page cache deletion will requeue them. */ if (atomic_read(&inode->i_count) || - (inode->i_state & ~I_REFERENCED)) { + (inode->i_state & ~I_REFERENCED) || + !mapping_shrinkable(&inode->i_data)) { list_lru_isolate(lru, &inode->i_lru); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; } - /* recently referenced inodes get one more pass */ + /* Recently referenced inodes get one more pass */ if (inode->i_state & I_REFERENCED) { inode->i_state &= ~I_REFERENCED; spin_unlock(&inode->i_lock); return LRU_ROTATE; } + /* + * On highmem systems, mapping_shrinkable() permits dropping + * page cache in order to free up struct inodes: lowmem might + * be under pressure before the cache inside the highmem zone. + */ if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { __iget(inode); spin_unlock(&inode->i_lock); @@ -1634,7 +1642,7 @@ static void iput_final(struct inode *inode) if (!drop && !(inode->i_state & I_DONTCACHE) && (sb->s_flags & SB_ACTIVE)) { - inode_add_lru(inode); + __inode_add_lru(inode, true); spin_unlock(&inode->i_lock); return; } diff --git a/fs/internal.h b/fs/internal.h index 6aeae7ef3380..3eb90dde62bd 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -146,7 +146,6 @@ extern int vfs_open(const struct path *, struct file *); * inode.c */ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); -extern void inode_add_lru(struct inode *inode); extern int dentry_needs_remove_privs(struct dentry *dentry); /* diff --git a/include/linux/fs.h b/include/linux/fs.h index 8652ed7cdce8..301cd0195036 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3214,6 +3214,7 @@ static inline void remove_inode_hash(struct inode *inode) } extern void inode_sb_list_add(struct inode *inode); +extern void inode_add_lru(struct inode *inode); extern int sb_set_blocksize(struct super_block *, int); extern int sb_min_blocksize(struct super_block *, int); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index e89df447fae3..c9956fac640e 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -23,6 +23,56 @@ static inline bool mapping_empty(struct address_space *mapping) return xa_empty(&mapping->i_pages); } +/* + * mapping_shrinkable - test if page cache state allows inode reclaim + * @mapping: the page cache mapping + * + * This checks the mapping's cache state for the pupose of inode + * reclaim and LRU management. + * + * The caller is expected to hold the i_lock, but is not required to + * hold the i_pages lock, which usually protects cache state. That's + * because the i_lock and the list_lru lock that protect the inode and + * its LRU state don't nest inside the irq-safe i_pages lock. + * + * Cache deletions are performed under the i_lock, which ensures that + * when an inode goes empty, it will reliably get queued on the LRU. + * + * Cache additions do not acquire the i_lock and may race with this + * check, in which case we'll report the inode as shrinkable when it + * has cache pages. This is okay: the shrinker also checks the + * refcount and the referenced bit, which will be elevated or set in + * the process of adding new cache pages to an inode. + */ +static inline bool mapping_shrinkable(struct address_space *mapping) +{ + void *head; + + /* + * On highmem systems, there could be lowmem pressure from the + * inodes before there is highmem pressure from the page + * cache. Make inodes shrinkable regardless of cache state. + */ + if (IS_ENABLED(CONFIG_HIGHMEM)) + return true; + + /* Cache completely empty? Shrink away. */ + head = rcu_access_pointer(mapping->i_pages.xa_head); + if (!head) + return true; + + /* + * The xarray stores single offset-0 entries directly in the + * head pointer, which allows non-resident page cache entries + * to escape the shadow shrinker's list of xarray nodes. The + * inode shrinker needs to pick them up under memory pressure. + */ + if (!xa_is_node(head) && xa_is_value(head)) + return true; + + return false; +} + /* * Bits in mapping->flags. */ diff --git a/mm/filemap.c b/mm/filemap.c index 819d2589abef..0d0d72ced961 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -260,9 +260,13 @@ void delete_from_page_cache(struct page *page) struct address_space *mapping = page_mapping(page); BUG_ON(!PageLocked(page)); + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); __delete_from_page_cache(page, NULL); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); page_cache_free_page(mapping, page); } @@ -338,6 +342,7 @@ void delete_from_page_cache_batch(struct address_space *mapping, if (!pagevec_count(pvec)) return; + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); for (i = 0; i < pagevec_count(pvec); i++) { trace_mm_filemap_delete_from_page_cache(pvec->pages[i]); @@ -346,6 +351,9 @@ void delete_from_page_cache_batch(struct address_space *mapping, } page_cache_delete_batch(mapping, pvec); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); for (i = 0; i < pagevec_count(pvec); i++) page_cache_free_page(mapping, pvec->pages[i]); diff --git a/mm/truncate.c b/mm/truncate.c index 95934c98259a..950d73fa995d 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -45,9 +45,13 @@ static inline void __clear_shadow_entry(struct address_space *mapping, static void clear_shadow_entry(struct address_space *mapping, pgoff_t index, void *entry) { + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); __clear_shadow_entry(mapping, index, entry); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); } /* @@ -73,8 +77,10 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, return; dax = dax_mapping(mapping); - if (!dax) + if (!dax) { + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); + } for (i = j; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; @@ -93,8 +99,12 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping, __clear_shadow_entry(mapping, index, page); } - if (!dax) + if (!dax) { xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); + } pvec->nr = j; } @@ -569,6 +579,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL)) return 0; + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); if (PageDirty(page)) goto failed; @@ -576,6 +587,9 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) BUG_ON(page_has_private(page)); __delete_from_page_cache(page, NULL); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); if (mapping->a_ops->freepage) mapping->a_ops->freepage(page); @@ -584,6 +598,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) return 1; failed: xa_unlock_irq(&mapping->i_pages); + spin_unlock(&mapping->host->i_lock); return 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index cc5d7cd75935..6dd5ef8a11bc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1055,6 +1055,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); + if (!PageSwapCache(page)) + spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); /* * The non racy check for a busy page. @@ -1123,6 +1125,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, shadow = workingset_eviction(page, target_memcg); __delete_from_page_cache(page, shadow); xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); if (freepage != NULL) freepage(page); @@ -1132,6 +1137,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, cannot_free: xa_unlock_irq(&mapping->i_pages); + if (!PageSwapCache(page)) + spin_unlock(&mapping->host->i_lock); return 0; } diff --git a/mm/workingset.c b/mm/workingset.c index 4f7a306ce75a..9d3d2b4ce44d 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -540,6 +540,13 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, goto out; } + if (!spin_trylock(&mapping->host->i_lock)) { + xa_unlock(&mapping->i_pages); + spin_unlock_irq(lru_lock); + ret = LRU_RETRY; + goto out; + } + list_lru_isolate(lru, item); __dec_lruvec_kmem_state(node, WORKINGSET_NODES); @@ -559,6 +566,9 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, out_invalid: xa_unlock_irq(&mapping->i_pages); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); ret = LRU_REMOVED_RETRY; out: cond_resched();
Historically (pre-2.5), the inode shrinker used to reclaim only empty inodes and skip over those that still contained page cache. This caused problems on highmem hosts: struct inode could put fill lowmem zones before the cache was getting reclaimed in the highmem zones. To address this, the inode shrinker started to strip page cache to facilitate reclaiming lowmem. However, this comes with its own set of problems: the shrinkers may drop actively used page cache just because the inodes are not currently open or dirty - think working with a large git tree. It further doesn't respect cgroup memory protection settings and can cause priority inversions between containers. Nowadays, the page cache also holds non-resident info for evicted cache pages in order to detect refaults. We've come to rely heavily on this data inside reclaim for protecting the cache workingset and driving swap behavior. We also use it to quantify and report workload health through psi. The latter in turn is used for fleet health monitoring, as well as driving automated memory sizing of workloads and containers, proactive reclaim and memory offloading schemes. The consequences of dropping page cache prematurely is that we're seeing subtle and not-so-subtle failures in all of the above-mentioned scenarios, with the workload generally entering unexpected thrashing states while losing the ability to reliably detect it. To fix this on non-highmem systems at least, going back to rotating inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 ("mm: don't reclaim inodes with many attached pages")) and failed (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many attached pages"")). The issue is mostly that shrinker pools attract pressure based on their size, and when objects get skipped the shrinkers remember this as deferred reclaim work. This accumulates excessive pressure on the remaining inodes, and we can quickly eat into heavily used ones, or dirty ones that require IO to reclaim, when there potentially is plenty of cold, clean cache around still. Instead, this patch keeps populated inodes off the inode LRU in the first place - just like an open file or dirty state would. An otherwise clean and unused inode then gets queued when the last cache entry disappears. This solves the problem without reintroducing the reclaim issues, and generally is a bit more scalable than having to wade through potentially hundreds of thousands of busy inodes. Locking is a bit tricky because the locks protecting the inode state (i_lock) and the inode LRU (lru_list.lock) don't nest inside the irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are serialized through i_lock, taken before the i_pages lock, to make sure depopulated inodes are queued reliably. Additions may race with deletions, but we'll check again in the shrinker. If additions race with the shrinker itself, we're protected by the i_lock: if find_inode() or iput() win, the shrinker will bail on the elevated i_count or I_REFERENCED; if the shrinker wins and goes ahead with the inode, it will set I_FREEING and inhibit further igets(), which will cause the other side to create a new instance of the inode instead. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- fs/inode.c | 46 +++++++++++++++++++++---------------- fs/internal.h | 1 - include/linux/fs.h | 1 + include/linux/pagemap.h | 50 +++++++++++++++++++++++++++++++++++++++++ mm/filemap.c | 8 +++++++ mm/truncate.c | 19 ++++++++++++++-- mm/vmscan.c | 7 ++++++ mm/workingset.c | 10 +++++++++ 8 files changed, 120 insertions(+), 22 deletions(-)