[00/11] fs/dcache: Limit # of negative dentries

Message ID	20200226161404.14136-1-longman@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=B+/k=4O=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: Waiman Long <longman@redhat.com> To: Alexander Viro <viro@zeniv.linux.org.uk>, Jonathan Corbet <corbet@lwn.net>, Luis Chamberlain <mcgrof@kernel.org>, Kees Cook <keescook@chromium.org>, Iurii Zaikin <yzaikin@google.com> Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, Mauro Carvalho Chehab <mchehab+samsung@kernel.org>, Eric Biggers <ebiggers@google.com>, Dave Chinner <david@fromorbit.com>, Eric Sandeen <sandeen@redhat.com>, Waiman Long <longman@redhat.com> Subject: [PATCH 00/11] fs/dcache: Limit # of negative dentries Date: Wed, 26 Feb 2020 11:13:53 -0500 Message-Id: <20200226161404.14136-1-longman@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	fs/dcache: Limit # of negative dentries \| expand [00/11] fs/dcache: Limit # of negative dentries [01/11] fs/dcache: Fix incorrect accounting of negative dentries [02/11] fs/dcache: Simplify __dentry_kill() [03/11] fs/dcache: Add a counter to track number of children [04/11] fs/dcache: Add sysctl parameter dentry-dir-max [05/11] fs/dcache: Reclaim excessive negative dentries in directories [06/11] fs/dcache: directory opportunistically stores # of positive dentries [07/11] fs/dcache: Add static key negative_reclaim_enable [08/11] fs/dcache: Limit dentry reclaim count in negative_reclaim_workfn() [09/11] fs/dcache: Don't allow small values for dentry-dir-max [10/11] fs/dcache: Kill off dentry as last resort [11/11] fs/dcache: Track # of negative dentries reclaimed & killed

Waiman Long Feb. 26, 2020, 4:13 p.m. UTC

As there is no limit for negative dentries, it is possible that a sizeable
portion of system memory can be tied up in dentry cache slabs. Dentry slabs
are generally recalimable if the dentries are in the LRUs. Still having
too much memory used up by dentries can be problematic:

 1) When a filesystem with too many negative dentries is being unmounted,
    the process of draining the dentries associated with the filesystem
    can take some time. To users, the system may seem to hang for
    a while.  The long wait may also cause unexpected timeout error or
    other warnings.  This can happen when a long running container with
    many negative dentries is being destroyed, for instance.

 2) Tying up too much memory in unused negative dentries means there
    are less memory available for other use. Even though the kernel is
    able to reclaim unused dentries when running out of free memory,
    it will still introduce additional latency to the application
    reducing its performance.

There are two different approaches to limit negative dentries.

  1) Global reclaim
     Based on the total number of negative dentries as tracked by the
     nr_dentry_negative percpu count, a function can be activated to
     scan the various LRU lists to trim out excess negative dentries.

  2) Local reclaim
     By tracking the number of negative dentries under each directory,
     we can start the reclaim process if the number exceeds a certain
     limit.

The problem with global reclaim is that there are just too many LRU lists
present that may need to be scanned for each filesystem. Especially
problematic is the fact that each memory cgroup can have its own LRU
lists. As memory cgroup can come and go at any time, scanning its LRUs
can be tricky.

Local reclaim does not have this problem. So it is used as the basis
for negative dentry reclaim for this patchset. Accurately tracking the
number of negative dentries in each directory can be costly in term of
performance hit. As a result, this patchset estimates the number of
negative dentries present in a directory by looking at a newly added
children count and an opportunistically stored positive dentry count.

A new sysctl parameter "dentry-dir-max" is introduced which accepts a
value of 0 (default) for no limit or a positive integer 256 and up. Small
dentry-dir-max numbers are forbidden to avoid excessive dentry count
checking which can impact system performance.

The actual negative dentry reclaim is delegated to the system workqueue
to avoid adding excessive latency to normal filesystem operation.

Waiman Long (11):
  fs/dcache: Fix incorrect accounting of negative dentries
  fs/dcache: Simplify __dentry_kill()
  fs/dcache: Add a counter to track number of children
  fs/dcache: Add sysctl parameter dentry-dir-max
  fs/dcache: Reclaim excessive negative dentries in directories
  fs/dcache: directory opportunistically stores # of positive dentries
  fs/dcache: Add static key negative_reclaim_enable
  fs/dcache: Limit dentry reclaim count in negative_reclaim_workfn()
  fs/dcache: Don't allow small values for dentry-dir-max
  fs/dcache: Kill off dentry as last resort
  fs/dcache: Track # of negative dentries reclaimed & killed

 Documentation/admin-guide/sysctl/fs.rst |  18 +
 fs/dcache.c                             | 428 +++++++++++++++++++++++-
 include/linux/dcache.h                  |  18 +-
 kernel/sysctl.c                         |  11 +
 4 files changed, 457 insertions(+), 18 deletions(-)

Matthew Wilcox Feb. 26, 2020, 4:29 p.m. UTC | #1

On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
> value of 0 (default) for no limit or a positive integer 256 and up. Small
> dentry-dir-max numbers are forbidden to avoid excessive dentry count
> checking which can impact system performance.

This is always the wrong approach.  A sysctl is just a way of blaming
the sysadmin for us not being very good at programming.

I agree that we need a way to limit the number of negative dentries.
But that limit needs to be dynamic and depend on how the system is being
used, not on how some overworked sysadmin has configured it.

So we need an initial estimate for the number of negative dentries that
we need for good performance.  Maybe it's 1000.  It doesn't really matter;
it's going to change dynamically.

Then we need a metric to let us know whether it needs to be increased.
Perhaps that's "number of new negative dentries created in the last
second".  And we need to decide how much to increase it; maybe it's by
50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
how high the recent rate of negative dentry creation has been.

We also need a metric to let us know whether it needs to be decreased.
I'm reluctant to say that memory pressure should be that metric because
very large systems can let the number of dentries grow in an unbounded
way.  Perhaps that metric is "number of hits in the negative dentry
cache in the last ten seconds".  Again, we'll need to decide how much
to shrink the target number by.

If the number of negative dentries is at or above the target, then
creating a new negative dentry means evicting an existing negative dentry.
If the number of negative dentries is lower than the target, then we
can just create a new one.

Of course, memory pressure (and shrinking the target number) should
cause negative dentries to be evicted from the old end of the LRU list.
But memory pressure shouldn't cause us to change the target number;
the target number is what we think we need to keep the system running
smoothly.

Waiman Long Feb. 26, 2020, 7:19 p.m. UTC | #2

On 2/26/20 11:29 AM, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
>> value of 0 (default) for no limit or a positive integer 256 and up. Small
>> dentry-dir-max numbers are forbidden to avoid excessive dentry count
>> checking which can impact system performance.
> This is always the wrong approach.  A sysctl is just a way of blaming
> the sysadmin for us not being very good at programming.
>
> I agree that we need a way to limit the number of negative dentries.
> But that limit needs to be dynamic and depend on how the system is being
> used, not on how some overworked sysadmin has configured it.
>
> So we need an initial estimate for the number of negative dentries that
> we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> it's going to change dynamically.
>
> Then we need a metric to let us know whether it needs to be increased.
> Perhaps that's "number of new negative dentries created in the last
> second".  And we need to decide how much to increase it; maybe it's by
> 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> how high the recent rate of negative dentry creation has been.
>
> We also need a metric to let us know whether it needs to be decreased.
> I'm reluctant to say that memory pressure should be that metric because
> very large systems can let the number of dentries grow in an unbounded
> way.  Perhaps that metric is "number of hits in the negative dentry
> cache in the last ten seconds".  Again, we'll need to decide how much
> to shrink the target number by.
>
> If the number of negative dentries is at or above the target, then
> creating a new negative dentry means evicting an existing negative dentry.
> If the number of negative dentries is lower than the target, then we
> can just create a new one.
>
> Of course, memory pressure (and shrinking the target number) should
> cause negative dentries to be evicted from the old end of the LRU list.
> But memory pressure shouldn't cause us to change the target number;
> the target number is what we think we need to keep the system running
> smoothly.
>
Thanks for the quick response.

I agree that auto-tuning so that the system administrator don't have to
worry about it will be the best approach if it is implemented in the
right way. However, it is hard to do it right.

How about letting users specify a cap on the amount of total system
memory allowed for negative dentries like one of my previous patchs.
Internally, there is a predefined minimum and maximum for
dentry-dir-max. We sample the total negative dentry counts periodically
and adjust the dentry-dir-max accordingly.

Specifying a percentage of total system memory is more intuitive than
just specifying a hard number for dentry-dir-max. Still some user input
is required.

What do you think?

Cheers,
Longman

Matthew Wilcox Feb. 26, 2020, 9:28 p.m. UTC | #3

On Wed, Feb 26, 2020 at 02:19:59PM -0500, Waiman Long wrote:
> On 2/26/20 11:29 AM, Matthew Wilcox wrote:
> > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
> >> value of 0 (default) for no limit or a positive integer 256 and up. Small
> >> dentry-dir-max numbers are forbidden to avoid excessive dentry count
> >> checking which can impact system performance.
> > This is always the wrong approach.  A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> >
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> >
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> > it's going to change dynamically.
> >
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second".  And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
> >
> > We also need a metric to let us know whether it needs to be decreased.
> > I'm reluctant to say that memory pressure should be that metric because
> > very large systems can let the number of dentries grow in an unbounded
> > way.  Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds".  Again, we'll need to decide how much
> > to shrink the target number by.
> >
> > If the number of negative dentries is at or above the target, then
> > creating a new negative dentry means evicting an existing negative dentry.
> > If the number of negative dentries is lower than the target, then we
> > can just create a new one.
> >
> > Of course, memory pressure (and shrinking the target number) should
> > cause negative dentries to be evicted from the old end of the LRU list.
> > But memory pressure shouldn't cause us to change the target number;
> > the target number is what we think we need to keep the system running
> > smoothly.
> >
> Thanks for the quick response.
> 
> I agree that auto-tuning so that the system administrator don't have to
> worry about it will be the best approach if it is implemented in the
> right way. However, it is hard to do it right.
> 
> How about letting users specify a cap on the amount of total system
> memory allowed for negative dentries like one of my previous patchs.
> Internally, there is a predefined minimum and maximum for
> dentry-dir-max. We sample the total negative dentry counts periodically
> and adjust the dentry-dir-max accordingly.
> 
> Specifying a percentage of total system memory is more intuitive than
> just specifying a hard number for dentry-dir-max. Still some user input
> is required.

If you want to base the whole thing on a per-directory target number,
or a percentage of the system memory target (rather than my suggestion
of a total # of negative dentries), that seems reasonable.  What's not
reasonable is expecting the sysadmin to be able to either predict the
workload, or react to a changing workload in sufficient time.  The system
has to be self-tuning.

Just look how long stale information stays around about how to tune your
Linux system.  Here's an article from 2018 suggesting using the 'intr'
option for NFS mounts:
https://kb.netapp.com/app/answers/answer_view/a_id/1004893/~/hard-mount-vs-soft-mount-
I made that a no-op in 2007.  Any tunable you add to Linux immediately
becomes a cargo-cult solution to any problem people are having.

Andreas Dilger Feb. 26, 2020, 9:28 p.m. UTC | #4

On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
>> value of 0 (default) for no limit or a positive integer 256 and up. Small
>> dentry-dir-max numbers are forbidden to avoid excessive dentry count
>> checking which can impact system performance.
> 
> This is always the wrong approach.  A sysctl is just a way of blaming
> the sysadmin for us not being very good at programming.
> 
> I agree that we need a way to limit the number of negative dentries.
> But that limit needs to be dynamic and depend on how the system is being
> used, not on how some overworked sysadmin has configured it.
> 
> So we need an initial estimate for the number of negative dentries that
> we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> it's going to change dynamically.
> 
> Then we need a metric to let us know whether it needs to be increased.
> Perhaps that's "number of new negative dentries created in the last
> second".  And we need to decide how much to increase it; maybe it's by
> 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> how high the recent rate of negative dentry creation has been.
> 
> We also need a metric to let us know whether it needs to be decreased.
> I'm reluctant to say that memory pressure should be that metric because
> very large systems can let the number of dentries grow in an unbounded
> way.  Perhaps that metric is "number of hits in the negative dentry
> cache in the last ten seconds".  Again, we'll need to decide how much
> to shrink the target number by.

OK, so now instead of a single tunable parameter we need three, because
these numbers are totally made up and nobody knows the right values. :-)
Defaulting the limit to "disabled/no limit" also has the problem that
99.99% of users won't even know this tunable exists, let alone how to
set it correctly, so they will continue to see these problems, and the
code may as well not exist (i.e. pure overhead), while Waiman has a
better idea today of what would be reasonable defaults.

I definitely agree that a single fixed value will be wrong for every
system except the original developer's.  Making the maximum default to
some reasonable fraction of the system size, rather than a fixed value,
is probably best to start.  Something like this as a starting point:

	/* Allow a reasonable minimum number of negative entries,
	 * but proportionately more if the directory/dcache is large.
	 */
	dir_negative_max = max(num_dir_entries / 16, 1024);
        total_negative_max = max(totalram_pages / 32, total_dentries / 8);

(Waiman should decide actual values based on where the problem was hit
previously), and include tunables to change the limits for testing.

Ideally there would also be a dir ioctl that allows fetching the current
positive/negative entry count on a directory (e.g. /usr/bin, /usr/lib64,
/usr/share/man/man*) to see what these values are.  Otherwise there is
no way to determine whether the limits used are any good or not.

Dynamic limits are hard to get right, and incorrect state machines can lead
to wild swings in behaviour due to unexpected feedback.  It isn't clear to
me that adjusting the limit based on the current rate of negative dentry
creation even makes sense.  If there are a lot of negative entries being
created, that is when you'd want to _stop_ allowing more to be added.

We don't have any limit today, so imposing some large-but-still-reasonable
upper limit on negative entries will catch the runaway negative dcache case
that was the original need of this functionality without adding a lot of
complexity that we may not need at all.

> If the number of negative dentries is at or above the target, then
> creating a new negative dentry means evicting an existing negative dentry.
> If the number of negative dentries is lower than the target, then we
> can just create a new one.
> 
> Of course, memory pressure (and shrinking the target number) should
> cause negative dentries to be evicted from the old end of the LRU list.
> But memory pressure shouldn't cause us to change the target number;
> the target number is what we think we need to keep the system running
> smoothly.

Cheers, Andreas

Matthew Wilcox Feb. 26, 2020, 9:45 p.m. UTC | #5

On Wed, Feb 26, 2020 at 02:28:50PM -0700, Andreas Dilger wrote:
> On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > This is always the wrong approach.  A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> > 
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> > 
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> > it's going to change dynamically.
> > 
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second".  And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
> > 
> > We also need a metric to let us know whether it needs to be decreased.
> > I'm reluctant to say that memory pressure should be that metric because
> > very large systems can let the number of dentries grow in an unbounded
> > way.  Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds".  Again, we'll need to decide how much
> > to shrink the target number by.
> 
> OK, so now instead of a single tunable parameter we need three, because
> these numbers are totally made up and nobody knows the right values. :-)
> Defaulting the limit to "disabled/no limit" also has the problem that
> 99.99% of users won't even know this tunable exists, let alone how to
> set it correctly, so they will continue to see these problems, and the
> code may as well not exist (i.e. pure overhead), while Waiman has a
> better idea today of what would be reasonable defaults.

I never said "no limit".  I just said to start at some fairly random
value and not worry about where you start because it'll correct to where
this system needs it to be.  As long as it converges like loadavg does,
it'll be fine.  It needs a fairly large "don't change the target" area,
and it needs to react quickly to real changes in a system's workload.

> I definitely agree that a single fixed value will be wrong for every
> system except the original developer's.  Making the maximum default to
> some reasonable fraction of the system size, rather than a fixed value,
> is probably best to start.  Something like this as a starting point:
> 
> 	/* Allow a reasonable minimum number of negative entries,
> 	 * but proportionately more if the directory/dcache is large.
> 	 */
> 	dir_negative_max = max(num_dir_entries / 16, 1024);
>         total_negative_max = max(totalram_pages / 32, total_dentries / 8);

Those kinds of things are garbage on large machines.  With a terabyte
of RAM, you can end up with tens of millions of dentries clogging up
the system.  There _is_ an upper limit on the useful number of dentries
to keep around.

> (Waiman should decide actual values based on where the problem was hit
> previously), and include tunables to change the limits for testing.
> 
> Ideally there would also be a dir ioctl that allows fetching the current
> positive/negative entry count on a directory (e.g. /usr/bin, /usr/lib64,
> /usr/share/man/man*) to see what these values are.  Otherwise there is
> no way to determine whether the limits used are any good or not.

It definitely needs to be instrumented for testing, but no, not ioctls.
tracepoints, perhaps.

> Dynamic limits are hard to get right, and incorrect state machines can lead
> to wild swings in behaviour due to unexpected feedback.  It isn't clear to
> me that adjusting the limit based on the current rate of negative dentry
> creation even makes sense.  If there are a lot of negative entries being
> created, that is when you'd want to _stop_ allowing more to be added.

That doesn't make sense.  What you really want to know is "If my dcache
had twice as many entries in it, would that significantly reduce the
thrash of new entries being created".  In the page cache, we end up
with a double LRU where once-used entries fall off the list quickly
but twice-or-more used entries get to stay around for a bit longer.
Maybe we could do something like that; keep a victim cache for recently
evicted dentries, and if we get a large hit rate in the victim cache,
expand the size of the primary cache.

Dave Chinner Feb. 27, 2020, 8:07 a.m. UTC | #6

On Wed, Feb 26, 2020 at 01:45:07PM -0800, Matthew Wilcox wrote:
> had twice as many entries in it, would that significantly reduce the
> thrash of new entries being created".  In the page cache, we end up
> with a double LRU where once-used entries fall off the list quickly
> but twice-or-more used entries get to stay around for a bit longer.
> Maybe we could do something like that; keep a victim cache for recently
> evicted dentries, and if we get a large hit rate in the victim cache,
> expand the size of the primary cache.

You know, I've been saying exactly the same thing about the inode
LRU in response to people trying to hack behaviour out of the
shrinker that is triggered by the working set getting trashed by
excessive creation of single use inodes (i.e. large scale directory
traversal).

IOWs, both have the same problem with working set retention in the
face of excessive growth pressure.

So, you know, perhaps two caches with the same problem, that use the
same LRU implementation, could solve the same problem by enhancing
the generic LRU code they use to an active/inactive style clocking
LRU like the page LRUs?

Cheers,

Dave.

Dave Chinner Feb. 27, 2020, 8:30 a.m. UTC | #7

On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> As there is no limit for negative dentries, it is possible that a sizeable
> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
> are generally recalimable if the dentries are in the LRUs. Still having
> too much memory used up by dentries can be problematic:

I don't get it.

Why isn't the solution simply "constrain the application generating
unbound numbers of dentries to a memcg"?

Then when the memcg runs out of memory, it will start reclaiming the
dentries that were allocated inside the memcg that are using all
it's resources, thereby preventing unbound growth of the dentry
cache.

I mean, this sort of resource control is exactly what memcgs are
supposed to be used for and are already used for. I don't see why we
need all this complexity for global dentry resource management when
memcgs should already provide an effective means of managing and
placing bounds on the amount of memory any specific application can
use...

Cheers,

Dave.

Ian Kent Feb. 27, 2020, 9:55 a.m. UTC | #8

On Wed, 2020-02-26 at 14:28 -0700, Andreas Dilger wrote:
> On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@infradead.org>
> wrote:
> > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> > > A new sysctl parameter "dentry-dir-max" is introduced which
> > > accepts a
> > > value of 0 (default) for no limit or a positive integer 256 and
> > > up. Small
> > > dentry-dir-max numbers are forbidden to avoid excessive dentry
> > > count
> > > checking which can impact system performance.
> > 
> > This is always the wrong approach.  A sysctl is just a way of
> > blaming
> > the sysadmin for us not being very good at programming.
> > 
> > I agree that we need a way to limit the number of negative
> > dentries.
> > But that limit needs to be dynamic and depend on how the system is
> > being
> > used, not on how some overworked sysadmin has configured it.
> > 
> > So we need an initial estimate for the number of negative dentries
> > that
> > we need for good performance.  Maybe it's 1000.  It doesn't really
> > matter;
> > it's going to change dynamically.
> > 
> > Then we need a metric to let us know whether it needs to be
> > increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second".  And we need to decide how much to increase it; maybe it's
> > by
> > 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending
> > on
> > how high the recent rate of negative dentry creation has been.
> > 
> > We also need a metric to let us know whether it needs to be
> > decreased.
> > I'm reluctant to say that memory pressure should be that metric
> > because
> > very large systems can let the number of dentries grow in an
> > unbounded
> > way.  Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds".  Again, we'll need to decide how
> > much
> > to shrink the target number by.
> 
> OK, so now instead of a single tunable parameter we need three,
> because
> these numbers are totally made up and nobody knows the right values.
> :-)
> Defaulting the limit to "disabled/no limit" also has the problem that
> 99.99% of users won't even know this tunable exists, let alone how to
> set it correctly, so they will continue to see these problems, and
> the
> code may as well not exist (i.e. pure overhead), while Waiman has a
> better idea today of what would be reasonable defaults.

Why have these at all.

Not all file systems even produce negative hashed dentries.

The most beneficial use of them is to improve performance of rapid
fire lookups for non-existent names. Longer lived negative hashed
dentries don't give much benefit at all unless they suddenly have
lots of hits and that would cost a single allocation on the first
lookup if the dentry ttl expired and the dentry discarded.

A ttl (say jiffies) set at appropriate times could be a better
choice all round, no sysctl values at all.

Ian

Eric Sandeen Feb. 27, 2020, 7:04 p.m. UTC | #9

On 2/26/20 8:29 AM, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
>> value of 0 (default) for no limit or a positive integer 256 and up. Small
>> dentry-dir-max numbers are forbidden to avoid excessive dentry count
>> checking which can impact system performance.
> 
> This is always the wrong approach.  A sysctl is just a way of blaming
> the sysadmin for us not being very good at programming.
> 
> I agree that we need a way to limit the number of negative dentries.
> But that limit needs to be dynamic and depend on how the system is being
> used, not on how some overworked sysadmin has configured it.
> 
> So we need an initial estimate for the number of negative dentries that
> we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> it's going to change dynamically.
> 
> Then we need a metric to let us know whether it needs to be increased.
> Perhaps that's "number of new negative dentries created in the last
> second".  And we need to decide how much to increase it; maybe it's by
> 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> how high the recent rate of negative dentry creation has been.

There are pitfalls to this approach as well.  Consider what libnss
does every time it starts up (via curl in this case)

# cat /proc/sys/fs/dentry-state
3154271	3131421	45	0	2863333	0
# for I in `seq 1 10`; do curl https://sandeen.net/ &>/dev/null; done
# cat /proc/sys/fs/dentry-state
3170738	3147844	45	0	2879882	0

voila, 16k more negative dcache entries, thanks to:

https://github.com/nss-dev/nss/blob/317cb06697d5b953d825e050c1d8c1ee0d647010/lib/softoken/sdb.c#L390

i.e. each time it inits, it will intentionally create up to 10,000 negative
dentries which will never be looked up again.  I /think/ the original intent
of this work was to limit such rogue applications, so scaling with use probably
isn't the way to go.

-Eric

Dave Chinner Feb. 27, 2020, 10:39 p.m. UTC | #10

On Thu, Feb 27, 2020 at 11:04:40AM -0800, Eric Sandeen wrote:
> On 2/26/20 8:29 AM, Matthew Wilcox wrote:
> > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a
> >> value of 0 (default) for no limit or a positive integer 256 and up. Small
> >> dentry-dir-max numbers are forbidden to avoid excessive dentry count
> >> checking which can impact system performance.
> > 
> > This is always the wrong approach.  A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> > 
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> > 
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> > it's going to change dynamically.
> > 
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second".  And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
> 
> There are pitfalls to this approach as well.  Consider what libnss
> does every time it starts up (via curl in this case)
> 
> # cat /proc/sys/fs/dentry-state
> 3154271	3131421	45	0	2863333	0
> # for I in `seq 1 10`; do curl https://sandeen.net/ &>/dev/null; done
> # cat /proc/sys/fs/dentry-state
> 3170738	3147844	45	0	2879882	0
> 
> voila, 16k more negative dcache entries, thanks to:
> 
> https://github.com/nss-dev/nss/blob/317cb06697d5b953d825e050c1d8c1ee0d647010/lib/softoken/sdb.c#L390
> 
> i.e. each time it inits, it will intentionally create up to 10,000 negative
> dentries which will never be looked up again.

Sandboxing via memcg restricted cgroups means users and/or
applications cannot create unbound numbers of negative dentries, and
that largely solves this problem.

For a system daemons whose environment is controlled by a
systemd unit file, this should be pretty trivial to do, and memcg
directed memory reclaim will control negative dentry buildup.

For short-lived applications, teardown of the cgroup will free
all the negative dentries it created - they don't hang around
forever.

For long lived applications, negative dentries are bound by the
application memcg limits, and buildup will only affect the
applications own performance, not that of the whole system.

IOWs, I'd expect this sort of resource control problem to be solved
at the user, application and/or distro level, not with a huge kernel
hammer.

> I /think/ the original intent of this work was to limit such rogue
> applications, so scaling with use probably isn't the way to go.

The original intent was to prevent problems on old kernels that
supported terabytes of memory but could not use cgroup/memcg
infrastructure to isolate and contain negative dentry growth.
That was a much simpler, targeted negative dentry limiting solution,
not the ... craziness that can be found in this patchset.

Cheers,

Dave.

Matthew Wilcox Feb. 28, 2020, 3:34 a.m. UTC | #11

On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> Not all file systems even produce negative hashed dentries.
> 
> The most beneficial use of them is to improve performance of rapid
> fire lookups for non-existent names. Longer lived negative hashed
> dentries don't give much benefit at all unless they suddenly have
> lots of hits and that would cost a single allocation on the first
> lookup if the dentry ttl expired and the dentry discarded.
> 
> A ttl (say jiffies) set at appropriate times could be a better
> choice all round, no sysctl values at all.

The canonical argument in favour of negative dentries is to improve
application startup time as every application searches the library path
for the same libraries.  Only they don't do that any more:

$ strace -e file cat /dev/null
execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars */) = 0
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3

So, are they still useful?  Or should we, say, keep at most 100 around?

Ian Kent Feb. 28, 2020, 4:16 a.m. UTC | #12

On Thu, 2020-02-27 at 19:34 -0800, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > Not all file systems even produce negative hashed dentries.
> > 
> > The most beneficial use of them is to improve performance of rapid
> > fire lookups for non-existent names. Longer lived negative hashed
> > dentries don't give much benefit at all unless they suddenly have
> > lots of hits and that would cost a single allocation on the first
> > lookup if the dentry ttl expired and the dentry discarded.
> > 
> > A ttl (say jiffies) set at appropriate times could be a better
> > choice all round, no sysctl values at all.
> 
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library
> path
> for the same libraries.  Only they don't do that any more:
> 
> $ strace -e file cat /dev/null
> execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars
> */) = 0
> access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or
> directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
> O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib/locale/locale-archive",
> O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
> 
> So, are they still useful?  Or should we, say, keep at most 100
> around?

Who knows how old apps will be on distros., ;)

But I don't think it matters.

The VFS will (should) work fine without a minimum negative hashed
dentry count (and hashed since unhashed negative dentries are
summarily executed on final dput()) and a ttl should keep frequently
used ones around long enough to satisfy this sort of thing should it
be needed.

Even the ttl value should be resilient to a large range of values,
just not so much very small ones.

Ian

Al Viro Feb. 28, 2020, 4:22 a.m. UTC | #13

On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > Not all file systems even produce negative hashed dentries.
> > 
> > The most beneficial use of them is to improve performance of rapid
> > fire lookups for non-existent names. Longer lived negative hashed
> > dentries don't give much benefit at all unless they suddenly have
> > lots of hits and that would cost a single allocation on the first
> > lookup if the dentry ttl expired and the dentry discarded.
> > 
> > A ttl (say jiffies) set at appropriate times could be a better
> > choice all round, no sysctl values at all.
> 
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library path
> for the same libraries.  Only they don't do that any more:

	Tell that to scripts that keep looking through $PATH for
binaries each time they are run.  Tell that to cc(1) looking through
include path, etc.

	Ian, autofs is deeply pathological in that respect; that's OK,
since it has very unusual needs, but please don't use it as a model
for anything else - its needs *are* unusual.

Ian Kent Feb. 28, 2020, 4:36 a.m. UTC | #14

On Fri, 2020-02-28 at 12:16 +0800, Ian Kent wrote:
> On Thu, 2020-02-27 at 19:34 -0800, Matthew Wilcox wrote:
> > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > > Not all file systems even produce negative hashed dentries.
> > > 
> > > The most beneficial use of them is to improve performance of
> > > rapid
> > > fire lookups for non-existent names. Longer lived negative hashed
> > > dentries don't give much benefit at all unless they suddenly have
> > > lots of hits and that would cost a single allocation on the first
> > > lookup if the dentry ttl expired and the dentry discarded.
> > > 
> > > A ttl (say jiffies) set at appropriate times could be a better
> > > choice all round, no sysctl values at all.
> > 
> > The canonical argument in favour of negative dentries is to improve
> > application startup time as every application searches the library
> > path
> > for the same libraries.  Only they don't do that any more:
> > 
> > $ strace -e file cat /dev/null
> > execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars
> > */) = 0
> > access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file
> > or
> > directory)
> > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
> > O_RDONLY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/usr/lib/locale/locale-archive",
> > O_RDONLY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
> > 
> > So, are they still useful?  Or should we, say, keep at most 100
> > around?
> 
> Who knows how old apps will be on distros., ;)

Or what admins put in the PATH, I've seen oddness in that
a lot.

> 
> But I don't think it matters.

And I don't think I made my answer to the question clear.

I don't think setting a minimum matters but there are other
sources of a possibly significant number of lookups on
paths that don't exist. I've seen evidence recently
(although I suspect unfounded) that systemd can generate
lots of these lookups at times.

And let's not forget that file systems are the primary
source of these and not all create them on lookups.
I may be mistaken, but I think ext4 does not while xfs
definitely does.

The more important metric I think is calculating a sensible
maximum to be pruned to prevent getting bogged down as there
could be times when a lot of these are present. After all this
is meant to be an iterative pruning measure.

Ian

Ian Kent Feb. 28, 2020, 4:52 a.m. UTC | #15

On Fri, 2020-02-28 at 04:22 +0000, Al Viro wrote:
> On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote:
> > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
> > > Not all file systems even produce negative hashed dentries.
> > > 
> > > The most beneficial use of them is to improve performance of
> > > rapid
> > > fire lookups for non-existent names. Longer lived negative hashed
> > > dentries don't give much benefit at all unless they suddenly have
> > > lots of hits and that would cost a single allocation on the first
> > > lookup if the dentry ttl expired and the dentry discarded.
> > > 
> > > A ttl (say jiffies) set at appropriate times could be a better
> > > choice all round, no sysctl values at all.
> > 
> > The canonical argument in favour of negative dentries is to improve
> > application startup time as every application searches the library
> > path
> > for the same libraries.  Only they don't do that any more:
> 
> 	Tell that to scripts that keep looking through $PATH for
> binaries each time they are run.  Tell that to cc(1) looking through
> include path, etc.
> 
> 	Ian, autofs is deeply pathological in that respect; that's OK,
> since it has very unusual needs, but please don't use it as a model
> for anything else - its needs *are* unusual.

Ok, but my thoughts aren't based on autofs behaviours.

But it sounds like you don't believe this is a sensible suggestion
and you would know best so ...

Ian

Al Viro Feb. 28, 2020, 4:52 a.m. UTC | #16

On Fri, Feb 28, 2020 at 12:36:09PM +0800, Ian Kent wrote:

> And let's not forget that file systems are the primary
> source of these and not all create them on lookups.
> I may be mistaken, but I think ext4 does not while xfs
> definitely does.

Both ext4 and xfs bloody well *DO* create hashed negative
dentries on lookups.  There is a pathological case when
they are trying to be case-insensitive (and in that situation
we are SOL - if somebody insists upon mounting with
-o make-it-suck, that's what they bloody well get).

Casefondling idiocy aside, negative lookups are hashed.
On all normal filesystems.  Look for d_splice_alias()
getting passed NULL inode - that's where ->lookup()
instances normally create those.

Waiman Long Feb. 28, 2020, 3:32 p.m. UTC | #17

On 2/27/20 10:34 PM, Matthew Wilcox wrote:
> On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote:
>> Not all file systems even produce negative hashed dentries.
>>
>> The most beneficial use of them is to improve performance of rapid
>> fire lookups for non-existent names. Longer lived negative hashed
>> dentries don't give much benefit at all unless they suddenly have
>> lots of hits and that would cost a single allocation on the first
>> lookup if the dentry ttl expired and the dentry discarded.
>>
>> A ttl (say jiffies) set at appropriate times could be a better
>> choice all round, no sysctl values at all.
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library path
> for the same libraries.  Only they don't do that any more:
>
> $ strace -e file cat /dev/null
> execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars */) = 0
> access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3
>
> So, are they still useful?  Or should we, say, keep at most 100 around?
>
It is the shell that does the path search, not the command itself.

Cheers,
Longman

Matthew Wilcox Feb. 28, 2020, 3:39 p.m. UTC | #18

On Fri, Feb 28, 2020 at 10:32:02AM -0500, Waiman Long wrote:
> On 2/27/20 10:34 PM, Matthew Wilcox wrote:
> > The canonical argument in favour of negative dentries is to improve
> > application startup time as every application searches the library path
                                                               ^^^^^^^
> > for the same libraries.  Only they don't do that any more:
                 ^^^^^^^^^
>
> It is the shell that does the path search, not the command itself.

Waiman Long Feb. 28, 2020, 3:47 p.m. UTC | #19

On 2/27/20 3:30 AM, Dave Chinner wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> As there is no limit for negative dentries, it is possible that a sizeable
>> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
>> are generally recalimable if the dentries are in the LRUs. Still having
>> too much memory used up by dentries can be problematic:
> I don't get it.
>
> Why isn't the solution simply "constrain the application generating
> unbound numbers of dentries to a memcg"?
>
> Then when the memcg runs out of memory, it will start reclaiming the
> dentries that were allocated inside the memcg that are using all
> it's resources, thereby preventing unbound growth of the dentry
> cache.
>
> I mean, this sort of resource control is exactly what memcgs are
> supposed to be used for and are already used for. I don't see why we
> need all this complexity for global dentry resource management when
> memcgs should already provide an effective means of managing and
> placing bounds on the amount of memory any specific application can
> use...

Using memcg is one way to limit the damage. The argument that excessive
negative dentries can push out existing memory objects that can be more
useful if left alone still applies. Daemons that run in the root memcg
has no limitation on how much memory that they can use.

There can also be memcgs with high memory limits and long running
applications. memcg is certainly a useful tool in this regards, but it
doesn't solve all the problem.

Cheers,
Longman

Theodore Ts'o Feb. 28, 2020, 7:32 p.m. UTC | #20

On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote:
> 
> The canonical argument in favour of negative dentries is to improve
> application startup time as every application searches the library path
> for the same libraries.

The other canonical example is C compilers that need to search for
header files along the include search path:

% strace  -o /tmp/st -f gcc -o /tmp/hello /tmp/hello.c -I.. -I../..
% grep open /tmp/st | grep stdio.h | grep ENOENT | wc -l
6

						- Ted

Matthew Wilcox March 15, 2020, 3:46 a.m. UTC | #21

On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
> As there is no limit for negative dentries, it is possible that a sizeable
> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
> are generally recalimable if the dentries are in the LRUs. Still having
> too much memory used up by dentries can be problematic:
> 
>  1) When a filesystem with too many negative dentries is being unmounted,
>     the process of draining the dentries associated with the filesystem
>     can take some time. To users, the system may seem to hang for
>     a while.  The long wait may also cause unexpected timeout error or
>     other warnings.  This can happen when a long running container with
>     many negative dentries is being destroyed, for instance.
> 
>  2) Tying up too much memory in unused negative dentries means there
>     are less memory available for other use. Even though the kernel is
>     able to reclaim unused dentries when running out of free memory,
>     it will still introduce additional latency to the application
>     reducing its performance.

There's a third problem, which is that having a lot of negative dentries
can clog the hash chains.  I tried to quantify this, and found a weird result:

root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.402s
user	0m4.361s
sys	0m1.230s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.572s
user	0m4.337s
sys	0m1.407s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.607s
user	0m4.522s
sys	0m1.342s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.599s
user	0m4.472s
sys	0m1.369s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.574s
user	0m4.498s
sys	0m1.300s

Pretty consistent system time, between about 1.3 and 1.4 seconds.

root@bobo-kvm:~# grep dentry /proc/slabinfo 
dentry             20394  21735    192   21    1 : tunables    0    0    0 : slabdata   1035   1035      0
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m5.515s
user	0m4.353s
sys	0m1.359s

At this point, we have 20k dentries allocated.

Now, pollute the dcache with names that don't exist:

root@bobo-kvm:~# for i in `seq 1 100000`; do cat /dev/null$i >/dev/zero; done 2>/dev/null
root@bobo-kvm:~# grep dentry /proc/slabinfo 
dentry             20605  21735    192   21    1 : tunables    0    0    0 : slabdata   1035   1035      0

Huh.  We've kept the number of dentries pretty constant.  Still, maybe the
bad dentries have pushed out the good ones.

root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m6.644s
user	0m4.921s
sys	0m1.946s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m6.676s
user	0m5.004s
sys	0m1.909s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m6.662s
user	0m4.980s
sys	0m1.916s
root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
real	0m6.714s
user	0m4.973s
sys	0m1.986s

Well, we certainly made it suck.  Up to a pretty consistent 1.9-2.0 seconds
of kernel time, or 50% worse.  We've also made user time worse, somehow.

Anyhow, I should write a proper C program to measure this.  But I thought
I'd share this raw data with you now to demonstrate that dcache pollution
is a real problem today, even on a machine with 2GB.

Konstantin Khlebnikov March 21, 2020, 10:17 a.m. UTC | #22

On 15/03/2020 06.46, Matthew Wilcox wrote:
> On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote:
>> As there is no limit for negative dentries, it is possible that a sizeable
>> portion of system memory can be tied up in dentry cache slabs. Dentry slabs
>> are generally recalimable if the dentries are in the LRUs. Still having
>> too much memory used up by dentries can be problematic:
>>
>>   1) When a filesystem with too many negative dentries is being unmounted,
>>      the process of draining the dentries associated with the filesystem
>>      can take some time. To users, the system may seem to hang for
>>      a while.  The long wait may also cause unexpected timeout error or
>>      other warnings.  This can happen when a long running container with
>>      many negative dentries is being destroyed, for instance.
>>
>>   2) Tying up too much memory in unused negative dentries means there
>>      are less memory available for other use. Even though the kernel is
>>      able to reclaim unused dentries when running out of free memory,
>>      it will still introduce additional latency to the application
>>      reducing its performance.
> 
> There's a third problem, which is that having a lot of negative dentries
> can clog the hash chains.  I tried to quantify this, and found a weird result:

Yep. I've seen this in the wild. Server hard too much unused memory.

https://lore.kernel.org/lkml/ff0993a2-9825-304c-6a5b-2e9d4b940032@yandex-team.ru/T/#u

---quote---

I've seen problem on large server where horde of negative dentries
slowed down all lookups significantly:

watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [atop:968884] at __d_lookup_rcu+0x6f/0x190

slabtop:

    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
85118166 85116916   0%    0.19K 2026623       42  16212984K dentry
16577106 16371723   0%    0.10K 425054       39   1700216K buffer_head
935850 934379   0%    1.05K  31195       30    998240K ext4_inode_cache
663740 654967   0%    0.57K  23705       28    379280K radix_tree_node
399987 380055   0%    0.65K   8163       49    261216K proc_inode_cache
226380 168813   0%    0.19K   5390       42     43120K cred_jar
   70345  65721   0%    0.58K   1279       55     40928K inode_cache
105927  43314   0%    0.31K   2077       51     33232K filp
630972 601503   0%    0.04K   6186      102     24744K ext4_extent_status
    5848   4269   0%    3.56K    731        8     23392K task_struct
   16224  11531   0%    1.00K    507       32     16224K kmalloc-1024
    6752   5833   0%    2.00K    422       16     13504K kmalloc-2048
199680 158086   0%    0.06K   3120       64     12480K anon_vma_chain
156128 154751   0%    0.07K   2788       56     11152K Acpi-Operand

Total RAM is 256 GB

These dentries came from temporary files created and deleted by postgres.
But this could be easily reproduced by lookup of non-existent files.

Of course, memory pressure easily washes them away.

Similar problem happened before around proc sysctl entries:
https://lkml.org/lkml/2017/2/10/47

This one does not concentrate in one bucket and needs much more memory.

Looks like dcache needs some kind of background shrinker started
when dcache size or fraction of negative dentries exceeds some threshold.

---end---

> > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.402s
> user	0m4.361s
> sys	0m1.230s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.572s
> user	0m4.337s
> sys	0m1.407s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.607s
> user	0m4.522s
> sys	0m1.342s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.599s
> user	0m4.472s
> sys	0m1.369s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.574s
> user	0m4.498s
> sys	0m1.300s
> 
> Pretty consistent system time, between about 1.3 and 1.4 seconds.
> 
> root@bobo-kvm:~# grep dentry /proc/slabinfo
> dentry             20394  21735    192   21    1 : tunables    0    0    0 : slabdata   1035   1035      0
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m5.515s
> user	0m4.353s
> sys	0m1.359s
> 
> At this point, we have 20k dentries allocated.
> 
> Now, pollute the dcache with names that don't exist:
> 
> root@bobo-kvm:~# for i in `seq 1 100000`; do cat /dev/null$i >/dev/zero; done 2>/dev/null
> root@bobo-kvm:~# grep dentry /proc/slabinfo
> dentry             20605  21735    192   21    1 : tunables    0    0    0 : slabdata   1035   1035      0
> 
> Huh.  We've kept the number of dentries pretty constant.  Still, maybe the
> bad dentries have pushed out the good ones.
> 
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m6.644s
> user	0m4.921s
> sys	0m1.946s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m6.676s
> user	0m5.004s
> sys	0m1.909s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m6.662s
> user	0m4.980s
> sys	0m1.916s
> root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done
> real	0m6.714s
> user	0m4.973s
> sys	0m1.986s
> 
> Well, we certainly made it suck.  Up to a pretty consistent 1.9-2.0 seconds
> of kernel time, or 50% worse.  We've also made user time worse, somehow.
> 
> Anyhow, I should write a proper C program to measure this.  But I thought
> I'd share this raw data with you now to demonstrate that dcache pollution
> is a real problem today, even on a machine with 2GB.
>

[00/11] fs/dcache: Limit # of negative dentries

Message

Comments