diff mbox

[1/4] fs/dcache: Limit numbers of negative dentries

Message ID 1500298773-7510-2-git-send-email-longman@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Waiman Long July 17, 2017, 1:39 p.m. UTC
The number of positive dentries is limited by the number of files
in the filesystems. The number of negative dentries, however,
has no limit other than the total amount of memory available in
the system. So a rogue application that generates a lot of negative
dentries can potentially exhaust most of the memory available in the
system impacting performance on other running applications.

To prevent this from happening, the dcache code is now updated to limit
the amount of the negative dentries in the LRU lists that can be kept
as a percentage of total available system memory. The default is 5%
and can be changed by specifying the "neg_dentry_pc=" kernel command
line option.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   7 ++
 fs/dcache.c                                     | 154 +++++++++++++++++++++++-
 include/linux/dcache.h                          |   1 +
 3 files changed, 161 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox July 17, 2017, 5:49 p.m. UTC | #1
On Mon, Jul 17, 2017 at 09:39:30AM -0400, Waiman Long wrote:
> The number of positive dentries is limited by the number of files
> in the filesystems. The number of negative dentries, however,
> has no limit other than the total amount of memory available in
> the system. So a rogue application that generates a lot of negative
> dentries can potentially exhaust most of the memory available in the
> system impacting performance on other running applications.
> 
> To prevent this from happening, the dcache code is now updated to limit
> the amount of the negative dentries in the LRU lists that can be kept
> as a percentage of total available system memory. The default is 5%
> and can be changed by specifying the "neg_dentry_pc=" kernel command
> line option.

I see the problem, but rather than restricting the number of negative
dentries to be a fraction of the total amount of memory in the machine,
wouldn't it make more sense to limit the number of negative dentries to be
some multiple of the number of positive dentries currently in the system?

Or make negative dentries more easily prunable.  For example, we could
allocate them from a separate slab and use the existing reclaim mechanism
to just throw them away.  Since they can't be pinned by an inode, they're
much easier to get rid of than positive dentries.  Might make changing
a dentry from positive to negative or vice versa a bit more expensive ...
Waiman Long July 17, 2017, 6:31 p.m. UTC | #2
On 07/17/2017 01:49 PM, Matthew Wilcox wrote:
> On Mon, Jul 17, 2017 at 09:39:30AM -0400, Waiman Long wrote:
>> The number of positive dentries is limited by the number of files
>> in the filesystems. The number of negative dentries, however,
>> has no limit other than the total amount of memory available in
>> the system. So a rogue application that generates a lot of negative
>> dentries can potentially exhaust most of the memory available in the
>> system impacting performance on other running applications.
>>
>> To prevent this from happening, the dcache code is now updated to limit
>> the amount of the negative dentries in the LRU lists that can be kept
>> as a percentage of total available system memory. The default is 5%
>> and can be changed by specifying the "neg_dentry_pc=" kernel command
>> line option.
> I see the problem, but rather than restricting the number of negative
> dentries to be a fraction of the total amount of memory in the machine,
> wouldn't it make more sense to limit the number of negative dentries to be
> some multiple of the number of positive dentries currently in the system?

The number of positive dentries will be a rapidly changing number. So we
can't use __read_mostly variable for the limits. That may have a certain
performance impact. I chose to use a fixed number because of simplicity
and performance. I can compromise on simplicity, but not on performance.
I am open to maybe adjust the free pool count in some ways as long as
the performance impact is negligible.
 
> Or make negative dentries more easily prunable.  For example, we could
> allocate them from a separate slab and use the existing reclaim mechanism
> to just throw them away.  Since they can't be pinned by an inode, they're
> much easier to get rid of than positive dentries.  Might make changing
> a dentry from positive to negative or vice versa a bit more expensive ...

I don't quite understand what you mean by having two separate slabs. The
current reclaim mechanism is through scanning the LRU lists.

I had been thinking about having a separate LRU list for negative
dentries. Giving the complexity of the current per-node/per-memcg LRU
list, maintaining 2 separate LRU lists in each super_block may be
error-prone.

It is true that positive dentries will also be pruned in the process. By
the time automatic pruning happens, there should have a lot of negative
dentries in the LRU lists already. We can skip over positive dentries in
the scanning, but we have to either allow scanning more entries in each
pass prolonging the interruption or do no pruning at all if the LRU
lists are front-loaded with a bunch of positive dentries.

BTW, you remind me that I should have accounted for the
positive-to-negative dentry transitions which is missing in the current
patch.

Cheers,
Longman
Miklos Szeredi July 19, 2017, 2:39 p.m. UTC | #3
On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote:
> The number of positive dentries is limited by the number of files
> in the filesystems. The number of negative dentries, however,
> has no limit other than the total amount of memory available in
> the system. So a rogue application that generates a lot of negative
> dentries can potentially exhaust most of the memory available in the
> system impacting performance on other running applications.
>
> To prevent this from happening, the dcache code is now updated to limit
> the amount of the negative dentries in the LRU lists that can be kept
> as a percentage of total available system memory. The default is 5%
> and can be changed by specifying the "neg_dentry_pc=" kernel command
> line option.

AFAICS the implementation is counter to the concept of LRU since it
will get rid of the most recently used negative dentry after passing
the limit.  Which in itself is a source of DoS (keep rouge negative
dentries at just about the limit, so normal application are prevented
from getting their negatives cached).

Thanks,
Miklos

>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   7 ++
>  fs/dcache.c                                     | 154 +++++++++++++++++++++++-
>  include/linux/dcache.h                          |   1 +
>  3 files changed, 161 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f701430..fc3c937 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2372,6 +2372,13 @@
>
>         n2=             [NET] SDL Inc. RISCom/N2 synchronous serial card
>
> +       neg_dentry_pc=  [KNL]
> +                       Range: 1-50
> +                       Default: 5
> +                       This parameter specifies the amount of negative
> +                       dentries allowed in the system as a percentage of
> +                       total system memory.
> +
>         netdev=         [NET] Network devices parameters
>                         Format: <irq>,<io>,<mem_start>,<mem_end>,<name>
>                         Note that mem_start is often overloaded to mean
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f901413..6a0a844 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -130,8 +130,19 @@ struct dentry_stat_t dentry_stat = {
>         .age_limit = 45,
>  };
>
> +/*
> + * Macros and variables to manage and count negative dentries.
> + */
> +#define NEG_DENTRY_BATCH       (1 << 8)
> +static long neg_dentry_percpu_limit __read_mostly;
> +static struct {
> +       raw_spinlock_t nfree_lock;
> +       long nfree;                     /* Negative dentry free pool */
> +} ndblk ____cacheline_aligned_in_smp;
> +
>  static DEFINE_PER_CPU(long, nr_dentry);
>  static DEFINE_PER_CPU(long, nr_dentry_unused);
> +static DEFINE_PER_CPU(long, nr_dentry_neg);
>
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
>
> @@ -227,6 +238,86 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char
>
>  #endif
>
> +/*
> + * There is a system-wide limit to the amount of negative dentries allowed
> + * in the super blocks' LRU lists. The default limit is 5% of the total
> + * system memory. This limit can be changed by using the kernel command line
> + * option "neg_dentry_pc=" to specify the percentage of the total memory
> + * that can be used for negative dentries. That percentage must be in the
> + * 1-50% range.
> + *
> + * To avoid performance problem with a global counter on an SMP system,
> + * the tracking is done mostly on a per-cpu basis. The total limit is
> + * distributed in a 80/20 ratio to per-cpu counters and a global free pool.
> + *
> + * If a per-cpu counter runs out of negative dentries, it can borrow extra
> + * ones from the global free pool. If it has more than its percpu limit,
> + * the extra ones will be returned back to the global pool.
> + */
> +
> +/*
> + * Decrement negative dentry count if applicable.
> + */
> +static void __neg_dentry_dec(struct dentry *dentry)
> +{
> +       if (unlikely(this_cpu_dec_return(nr_dentry_neg) < 0)) {
> +               long *pcnt = get_cpu_ptr(&nr_dentry_neg);
> +
> +               if ((*pcnt < 0) && raw_spin_trylock(&ndblk.nfree_lock)) {
> +                       ACCESS_ONCE(ndblk.nfree) += NEG_DENTRY_BATCH;
> +                       *pcnt += NEG_DENTRY_BATCH;
> +                       raw_spin_unlock(&ndblk.nfree_lock);
> +               }
> +               put_cpu_ptr(&nr_dentry_neg);
> +       }
> +}
> +
> +static inline void neg_dentry_dec(struct dentry *dentry)
> +{
> +       if (unlikely(d_is_negative(dentry)))
> +               __neg_dentry_dec(dentry);
> +}
> +
> +/*
> + * Increment negative dentry count if applicable.
> + */
> +static void __neg_dentry_inc(struct dentry *dentry)
> +{
> +       long cnt, *pcnt;
> +
> +       if (this_cpu_inc_return(nr_dentry_neg) <= neg_dentry_percpu_limit)
> +               return;
> +
> +       pcnt = get_cpu_ptr(&nr_dentry_neg);
> +       cnt  = (READ_ONCE(ndblk.nfree) &&
> +              (*pcnt > neg_dentry_percpu_limit)) ? NEG_DENTRY_BATCH : 0;
> +
> +       if (cnt && raw_spin_trylock(&ndblk.nfree_lock)) {
> +               long val = READ_ONCE(ndblk.nfree);
> +
> +               if (val < cnt)
> +                       cnt = val;
> +               ACCESS_ONCE(ndblk.nfree) -= cnt;
> +               *pcnt -= cnt;
> +               raw_spin_unlock(&ndblk.nfree_lock);
> +       } else {
> +               cnt = 0;
> +       }
> +       put_cpu_ptr(&nr_dentry_neg);
> +       /*
> +        * If there are too many negative dentries, set DCACHE_KILL_NEGATIVE
> +        * flag to indicate that the dentry should be killed.
> +        */
> +       if (!cnt)
> +               dentry->d_flags |= DCACHE_KILL_NEGATIVE;
> +}
> +
> +static inline void neg_dentry_inc(struct dentry *dentry)
> +{
> +       if (unlikely(d_is_negative(dentry)))
> +               __neg_dentry_inc(dentry);
> +}
> +
>  static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *ct, unsigned tcount)
>  {
>         /*
> @@ -396,6 +487,7 @@ static void d_lru_add(struct dentry *dentry)
>         dentry->d_flags |= DCACHE_LRU_LIST;
>         this_cpu_inc(nr_dentry_unused);
>         WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
> +       neg_dentry_inc(dentry);
>  }
>
>  static void d_lru_del(struct dentry *dentry)
> @@ -404,6 +496,7 @@ static void d_lru_del(struct dentry *dentry)
>         dentry->d_flags &= ~DCACHE_LRU_LIST;
>         this_cpu_dec(nr_dentry_unused);
>         WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
> +       neg_dentry_dec(dentry);
>  }
>
>  static void d_shrink_del(struct dentry *dentry)
> @@ -434,6 +527,7 @@ static void d_lru_isolate(struct list_lru_one *lru, struct dentry *dentry)
>         dentry->d_flags &= ~DCACHE_LRU_LIST;
>         this_cpu_dec(nr_dentry_unused);
>         list_lru_isolate(lru, &dentry->d_lru);
> +       neg_dentry_dec(dentry);
>  }
>
>  static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
> @@ -442,6 +536,7 @@ static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
>         D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST);
>         dentry->d_flags |= DCACHE_SHRINK_LIST;
>         list_lru_isolate_move(lru, &dentry->d_lru, list);
> +       neg_dentry_dec(dentry);
>  }
>
>  /*
> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>
>         if (!IS_ROOT(dentry)) {
>                 parent = dentry->d_parent;
> -               if (unlikely(!spin_trylock(&parent->d_lock))) {
> +               /*
> +                * Force the killing of this negative dentry when
> +                * DCACHE_KILL_NEGATIVE flag is set.
> +                */
> +               if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
> +                       spin_lock(&parent->d_lock);
> +               } else if (unlikely(!spin_trylock(&parent->d_lock))) {
>                         if (inode)
>                                 spin_unlock(&inode->i_lock);
>                         goto failed;
> @@ -815,6 +916,9 @@ void dput(struct dentry *dentry)
>
>         dentry_lru_add(dentry);
>
> +       if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE))
> +               goto kill_it;
> +
>         dentry->d_lockref.count--;
>         spin_unlock(&dentry->d_lock);
>         return;
> @@ -1820,6 +1924,11 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
>         WARN_ON(d_in_lookup(dentry));
>
>         spin_lock(&dentry->d_lock);
> +       /*
> +        * Decrement negative dentry count if it was in the LRU list.
> +        */
> +       if (dentry->d_flags & DCACHE_LRU_LIST)
> +               neg_dentry_dec(dentry);
>         hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
>         raw_write_seqcount_begin(&dentry->d_seq);
>         __d_set_inode_and_type(dentry, inode, add_flags);
> @@ -3566,6 +3675,47 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
>  }
>  EXPORT_SYMBOL(d_tmpfile);
>
> +static long neg_dentry_pc __initdata = 5;
> +static bool neg_dentry_warn __initdata;
> +static int __init set_neg_dentry_pc(char *str)
> +{
> +       ssize_t ret;
> +       long new_pc = neg_dentry_pc;
> +
> +       if (!str)
> +               return 0;
> +       ret = kstrtol(str, 0, &new_pc);
> +       if (ret || (new_pc < 1) || (new_pc > 50))
> +               ret = 1;
> +       else
> +               neg_dentry_pc = new_pc;
> +       if (ret)
> +               neg_dentry_warn = true;
> +       return ret ? 0 : 1;
> +}
> +__setup("neg_dentry_pc=", set_neg_dentry_pc);
> +
> +static void __init neg_dentry_init(void)
> +{
> +       /* Rough estimate of # of dentries allocated per page */
> +       unsigned int nr_dentry_page = PAGE_SIZE/sizeof(struct dentry) - 1;
> +       unsigned long cnt;
> +
> +       raw_spin_lock_init(&ndblk.nfree_lock);
> +
> +       /* 20% in global pool & 80% in percpu free */
> +       ndblk.nfree = totalram_pages * nr_dentry_page * neg_dentry_pc / 500;
> +       cnt = ndblk.nfree * 4 / num_possible_cpus();
> +       if (unlikely(cnt < 2 * NEG_DENTRY_BATCH))
> +               cnt = 2 * NEG_DENTRY_BATCH;
> +       neg_dentry_percpu_limit = cnt;
> +
> +       if (neg_dentry_warn)
> +               pr_warn("Warning: neg_dentry_pc must be within 1-50 range.\n");
> +       pr_info("Negative dentry: percpu limit = %ld, free pool = %ld\n",
> +               neg_dentry_percpu_limit, ndblk.nfree);
> +}
> +
>  static __initdata unsigned long dhash_entries;
>  static int __init set_dhash_entries(char *str)
>  {
> @@ -3606,6 +3756,8 @@ static void __init dcache_init(void)
>         dentry_cache = KMEM_CACHE(dentry,
>                 SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT);
>
> +       neg_dentry_init();
> +
>         /* Hash may have been set up in dcache_init_early */
>         if (!hashdist)
>                 return;
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 3f3ff4c..498233b 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -218,6 +218,7 @@ struct dentry_operations {
>
>  #define DCACHE_PAR_LOOKUP              0x10000000 /* being looked up (with parent locked shared) */
>  #define DCACHE_DENTRY_CURSOR           0x20000000
> +#define DCACHE_KILL_NEGATIVE           0x40000000 /* Kill negative dentry */
>
>  extern seqlock_t rename_lock;
>
> --
> 1.8.3.1
>
Waiman Long July 19, 2017, 3:02 p.m. UTC | #4
On 07/19/2017 10:39 AM, Miklos Szeredi wrote:
> On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote:
>> The number of positive dentries is limited by the number of files
>> in the filesystems. The number of negative dentries, however,
>> has no limit other than the total amount of memory available in
>> the system. So a rogue application that generates a lot of negative
>> dentries can potentially exhaust most of the memory available in the
>> system impacting performance on other running applications.
>>
>> To prevent this from happening, the dcache code is now updated to limit
>> the amount of the negative dentries in the LRU lists that can be kept
>> as a percentage of total available system memory. The default is 5%
>> and can be changed by specifying the "neg_dentry_pc=" kernel command
>> line option.
> AFAICS the implementation is counter to the concept of LRU since it
> will get rid of the most recently used negative dentry after passing
> the limit.  Which in itself is a source of DoS (keep rouge negative
> dentries at just about the limit, so normal application are prevented
> from getting their negatives cached).
>
> Thanks,
> Miklos

Yes, you are right. That is exactly the problem with patch 1 alone. That
is why I have patches 3 & 4 to enable automatic trimming to decrease the
number of negative dentries before the limit is reached assuming the
rate of increase of negative dentries isn't faster that the reduction
rate of the automatic trimming process.

Cheers,
Longman
Miklos Szeredi July 19, 2017, 8:24 p.m. UTC | #5
On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote:
> The number of positive dentries is limited by the number of files
> in the filesystems. The number of negative dentries, however,
> has no limit other than the total amount of memory available in
> the system. So a rogue application that generates a lot of negative
> dentries can potentially exhaust most of the memory available in the
> system impacting performance on other running applications.
>
> To prevent this from happening, the dcache code is now updated to limit
> the amount of the negative dentries in the LRU lists that can be kept
> as a percentage of total available system memory. The default is 5%
> and can be changed by specifying the "neg_dentry_pc=" kernel command
> line option.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---

[...]

> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>
>         if (!IS_ROOT(dentry)) {
>                 parent = dentry->d_parent;
> -               if (unlikely(!spin_trylock(&parent->d_lock))) {
> +               /*
> +                * Force the killing of this negative dentry when
> +                * DCACHE_KILL_NEGATIVE flag is set.
> +                */
> +               if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
> +                       spin_lock(&parent->d_lock);

This looks like d_lock ordering problem (should be parent first, child
second).  Why is this needed, anyway?

> +               } else if (unlikely(!spin_trylock(&parent->d_lock))) {
>                         if (inode)
>                                 spin_unlock(&inode->i_lock);
>                         goto failed;

Thanks,
Miklos
Waiman Long July 19, 2017, 8:42 p.m. UTC | #6
On 07/19/2017 04:24 PM, Miklos Szeredi wrote:
> On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote:
>> The number of positive dentries is limited by the number of files
>> in the filesystems. The number of negative dentries, however,
>> has no limit other than the total amount of memory available in
>> the system. So a rogue application that generates a lot of negative
>> dentries can potentially exhaust most of the memory available in the
>> system impacting performance on other running applications.
>>
>> To prevent this from happening, the dcache code is now updated to limit
>> the amount of the negative dentries in the LRU lists that can be kept
>> as a percentage of total available system memory. The default is 5%
>> and can be changed by specifying the "neg_dentry_pc=" kernel command
>> line option.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
> [...]
>
>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>>
>>         if (!IS_ROOT(dentry)) {
>>                 parent = dentry->d_parent;
>> -               if (unlikely(!spin_trylock(&parent->d_lock))) {
>> +               /*
>> +                * Force the killing of this negative dentry when
>> +                * DCACHE_KILL_NEGATIVE flag is set.
>> +                */
>> +               if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
>> +                       spin_lock(&parent->d_lock);
> This looks like d_lock ordering problem (should be parent first, child
> second).  Why is this needed, anyway?
>

Yes, that is a bug. I should have used lock_parent() instead.

I have a test program that generate a lot of negative dentries
continuously. Using spin_trylock(), it failed most of the time when that
test program was running. So I need to actually acquire the parent's
d_lock to make sure that the offending negative dentry was really
killed. It was there to protect against the worst case situation. I will
update the patch to correct that.

Thanks for spotting this.

Cheers,
Longman
Miklos Szeredi July 20, 2017, 7:20 a.m. UTC | #7
On Wed, Jul 19, 2017 at 10:42 PM, Waiman Long <longman@redhat.com> wrote:
> On 07/19/2017 04:24 PM, Miklos Szeredi wrote:
>> On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote:
>>> The number of positive dentries is limited by the number of files
>>> in the filesystems. The number of negative dentries, however,
>>> has no limit other than the total amount of memory available in
>>> the system. So a rogue application that generates a lot of negative
>>> dentries can potentially exhaust most of the memory available in the
>>> system impacting performance on other running applications.
>>>
>>> To prevent this from happening, the dcache code is now updated to limit
>>> the amount of the negative dentries in the LRU lists that can be kept
>>> as a percentage of total available system memory. The default is 5%
>>> and can be changed by specifying the "neg_dentry_pc=" kernel command
>>> line option.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>> [...]
>>
>>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>>>
>>>         if (!IS_ROOT(dentry)) {
>>>                 parent = dentry->d_parent;
>>> -               if (unlikely(!spin_trylock(&parent->d_lock))) {
>>> +               /*
>>> +                * Force the killing of this negative dentry when
>>> +                * DCACHE_KILL_NEGATIVE flag is set.
>>> +                */
>>> +               if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
>>> +                       spin_lock(&parent->d_lock);
>> This looks like d_lock ordering problem (should be parent first, child
>> second).  Why is this needed, anyway?
>>
>
> Yes, that is a bug. I should have used lock_parent() instead.

lock_parent() can release dentry->d_lock, which means it's perfectly
useless for this.

I still feel forcing  free is wrong here.  Why not just block until
the number of negatives goes below the limit (start reclaim if not
already doing so, etc...)?

Thanks,
Miklos
Waiman Long July 20, 2017, 2:21 p.m. UTC | #8
On 07/20/2017 03:20 AM, Miklos Szeredi wrote:
> On Wed, Jul 19, 2017 at 10:42 PM, Waiman Long <longman@redhat.com> wrote:
>>
>>>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>>>>
>>>>         if (!IS_ROOT(dentry)) {
>>>>                 parent = dentry->d_parent;
>>>> -               if (unlikely(!spin_trylock(&parent->d_lock))) {
>>>> +               /*
>>>> +                * Force the killing of this negative dentry when
>>>> +                * DCACHE_KILL_NEGATIVE flag is set.
>>>> +                */
>>>> +               if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
>>>> +                       spin_lock(&parent->d_lock);
>>> This looks like d_lock ordering problem (should be parent first, child
>>> second).  Why is this needed, anyway?
>>>
>> Yes, that is a bug. I should have used lock_parent() instead.
> lock_parent() can release dentry->d_lock, which means it's perfectly
> useless for this.

As the reference count is kept at 1 in dentry_kill(), the dentry won't
go away even if the dentry lock is temporarily released.

> I still feel forcing  free is wrong here.  Why not just block until
> the number of negatives goes below the limit (start reclaim if not
> already doing so, etc...)?

Force freeing is the simplest. Any other ways will require adding more
code and increasing code complexity.

One reason why I prefer this is to avoid adding unpredictable latency to
the regular directory lookup and other dentry related operations. We can
always change the code later on if there is a better way of doing it.

Cheers,
Longman
Miklos Szeredi July 20, 2017, 3:08 p.m. UTC | #9
On Thu, Jul 20, 2017 at 4:21 PM, Waiman Long <longman@redhat.com> wrote:
> On 07/20/2017 03:20 AM, Miklos Szeredi wrote:
>> On Wed, Jul 19, 2017 at 10:42 PM, Waiman Long <longman@redhat.com> wrote:
>>>
>>>>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>>>>>
>>>>>         if (!IS_ROOT(dentry)) {
>>>>>                 parent = dentry->d_parent;
>>>>> -               if (unlikely(!spin_trylock(&parent->d_lock))) {
>>>>> +               /*
>>>>> +                * Force the killing of this negative dentry when
>>>>> +                * DCACHE_KILL_NEGATIVE flag is set.
>>>>> +                */
>>>>> +               if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
>>>>> +                       spin_lock(&parent->d_lock);
>>>> This looks like d_lock ordering problem (should be parent first, child
>>>> second).  Why is this needed, anyway?
>>>>
>>> Yes, that is a bug. I should have used lock_parent() instead.
>> lock_parent() can release dentry->d_lock, which means it's perfectly
>> useless for this.
>
> As the reference count is kept at 1 in dentry_kill(), the dentry won't
> go away even if the dentry lock is temporarily released.

It won't go away, but anything else might happen to it (ref grabbed by
somebody else, instantiated, etc).  Don't see how it's going to be
better than the existing trylock.

Thanks,
Miklos
Waiman Long July 20, 2017, 3:46 p.m. UTC | #10
On 07/20/2017 11:08 AM, Miklos Szeredi wrote:
> On Thu, Jul 20, 2017 at 4:21 PM, Waiman Long <longman@redhat.com> wrote:
>> On 07/20/2017 03:20 AM, Miklos Szeredi wrote:
>>> On Wed, Jul 19, 2017 at 10:42 PM, Waiman Long <longman@redhat.com> wrote:
>>>>>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>>>>>>
>>>>>>         if (!IS_ROOT(dentry)) {
>>>>>>                 parent = dentry->d_parent;
>>>>>> -               if (unlikely(!spin_trylock(&parent->d_lock))) {
>>>>>> +               /*
>>>>>> +                * Force the killing of this negative dentry when
>>>>>> +                * DCACHE_KILL_NEGATIVE flag is set.
>>>>>> +                */
>>>>>> +               if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
>>>>>> +                       spin_lock(&parent->d_lock);
>>>>> This looks like d_lock ordering problem (should be parent first, child
>>>>> second).  Why is this needed, anyway?
>>>>>
>>>> Yes, that is a bug. I should have used lock_parent() instead.
>>> lock_parent() can release dentry->d_lock, which means it's perfectly
>>> useless for this.
>> As the reference count is kept at 1 in dentry_kill(), the dentry won't
>> go away even if the dentry lock is temporarily released.
> It won't go away, but anything else might happen to it (ref grabbed by
> somebody else, instantiated, etc).  Don't see how it's going to be
> better than the existing trylock.
>
> Thanks,
> Miklos

In the unlikely event that the reference count or the d_flags changes,
we can abort the killing.

Cheers,
Longman
diff mbox

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f701430..fc3c937 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2372,6 +2372,13 @@ 
 
 	n2=		[NET] SDL Inc. RISCom/N2 synchronous serial card
 
+	neg_dentry_pc=	[KNL]
+			Range: 1-50
+			Default: 5
+			This parameter specifies the amount of negative
+			dentries allowed in the system as a percentage of
+			total system memory.
+
 	netdev=		[NET] Network devices parameters
 			Format: <irq>,<io>,<mem_start>,<mem_end>,<name>
 			Note that mem_start is often overloaded to mean
diff --git a/fs/dcache.c b/fs/dcache.c
index f901413..6a0a844 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -130,8 +130,19 @@  struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+/*
+ * Macros and variables to manage and count negative dentries.
+ */
+#define NEG_DENTRY_BATCH	(1 << 8)
+static long neg_dentry_percpu_limit __read_mostly;
+static struct {
+	raw_spinlock_t nfree_lock;
+	long nfree;			/* Negative dentry free pool */
+} ndblk ____cacheline_aligned_in_smp;
+
 static DEFINE_PER_CPU(long, nr_dentry);
 static DEFINE_PER_CPU(long, nr_dentry_unused);
+static DEFINE_PER_CPU(long, nr_dentry_neg);
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
 
@@ -227,6 +238,86 @@  static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char
 
 #endif
 
+/*
+ * There is a system-wide limit to the amount of negative dentries allowed
+ * in the super blocks' LRU lists. The default limit is 5% of the total
+ * system memory. This limit can be changed by using the kernel command line
+ * option "neg_dentry_pc=" to specify the percentage of the total memory
+ * that can be used for negative dentries. That percentage must be in the
+ * 1-50% range.
+ *
+ * To avoid performance problem with a global counter on an SMP system,
+ * the tracking is done mostly on a per-cpu basis. The total limit is
+ * distributed in a 80/20 ratio to per-cpu counters and a global free pool.
+ *
+ * If a per-cpu counter runs out of negative dentries, it can borrow extra
+ * ones from the global free pool. If it has more than its percpu limit,
+ * the extra ones will be returned back to the global pool.
+ */
+
+/*
+ * Decrement negative dentry count if applicable.
+ */
+static void __neg_dentry_dec(struct dentry *dentry)
+{
+	if (unlikely(this_cpu_dec_return(nr_dentry_neg) < 0)) {
+		long *pcnt = get_cpu_ptr(&nr_dentry_neg);
+
+		if ((*pcnt < 0) && raw_spin_trylock(&ndblk.nfree_lock)) {
+			ACCESS_ONCE(ndblk.nfree) += NEG_DENTRY_BATCH;
+			*pcnt += NEG_DENTRY_BATCH;
+			raw_spin_unlock(&ndblk.nfree_lock);
+		}
+		put_cpu_ptr(&nr_dentry_neg);
+	}
+}
+
+static inline void neg_dentry_dec(struct dentry *dentry)
+{
+	if (unlikely(d_is_negative(dentry)))
+		__neg_dentry_dec(dentry);
+}
+
+/*
+ * Increment negative dentry count if applicable.
+ */
+static void __neg_dentry_inc(struct dentry *dentry)
+{
+	long cnt, *pcnt;
+
+	if (this_cpu_inc_return(nr_dentry_neg) <= neg_dentry_percpu_limit)
+		return;
+
+	pcnt = get_cpu_ptr(&nr_dentry_neg);
+	cnt  = (READ_ONCE(ndblk.nfree) &&
+	       (*pcnt > neg_dentry_percpu_limit)) ? NEG_DENTRY_BATCH : 0;
+
+	if (cnt && raw_spin_trylock(&ndblk.nfree_lock)) {
+		long val = READ_ONCE(ndblk.nfree);
+
+		if (val < cnt)
+			cnt = val;
+		ACCESS_ONCE(ndblk.nfree) -= cnt;
+		*pcnt -= cnt;
+		raw_spin_unlock(&ndblk.nfree_lock);
+	} else {
+		cnt = 0;
+	}
+	put_cpu_ptr(&nr_dentry_neg);
+	/*
+	 * If there are too many negative dentries, set DCACHE_KILL_NEGATIVE
+	 * flag to indicate that the dentry should be killed.
+	 */
+	if (!cnt)
+		dentry->d_flags |= DCACHE_KILL_NEGATIVE;
+}
+
+static inline void neg_dentry_inc(struct dentry *dentry)
+{
+	if (unlikely(d_is_negative(dentry)))
+		__neg_dentry_inc(dentry);
+}
+
 static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *ct, unsigned tcount)
 {
 	/*
@@ -396,6 +487,7 @@  static void d_lru_add(struct dentry *dentry)
 	dentry->d_flags |= DCACHE_LRU_LIST;
 	this_cpu_inc(nr_dentry_unused);
 	WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
+	neg_dentry_inc(dentry);
 }
 
 static void d_lru_del(struct dentry *dentry)
@@ -404,6 +496,7 @@  static void d_lru_del(struct dentry *dentry)
 	dentry->d_flags &= ~DCACHE_LRU_LIST;
 	this_cpu_dec(nr_dentry_unused);
 	WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
+	neg_dentry_dec(dentry);
 }
 
 static void d_shrink_del(struct dentry *dentry)
@@ -434,6 +527,7 @@  static void d_lru_isolate(struct list_lru_one *lru, struct dentry *dentry)
 	dentry->d_flags &= ~DCACHE_LRU_LIST;
 	this_cpu_dec(nr_dentry_unused);
 	list_lru_isolate(lru, &dentry->d_lru);
+	neg_dentry_dec(dentry);
 }
 
 static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
@@ -442,6 +536,7 @@  static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
 	D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST);
 	dentry->d_flags |= DCACHE_SHRINK_LIST;
 	list_lru_isolate_move(lru, &dentry->d_lru, list);
+	neg_dentry_dec(dentry);
 }
 
 /*
@@ -603,7 +698,13 @@  static struct dentry *dentry_kill(struct dentry *dentry)
 
 	if (!IS_ROOT(dentry)) {
 		parent = dentry->d_parent;
-		if (unlikely(!spin_trylock(&parent->d_lock))) {
+		/*
+		 * Force the killing of this negative dentry when
+		 * DCACHE_KILL_NEGATIVE flag is set.
+		 */
+		if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) {
+			spin_lock(&parent->d_lock);
+		} else if (unlikely(!spin_trylock(&parent->d_lock))) {
 			if (inode)
 				spin_unlock(&inode->i_lock);
 			goto failed;
@@ -815,6 +916,9 @@  void dput(struct dentry *dentry)
 
 	dentry_lru_add(dentry);
 
+	if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE))
+		goto kill_it;
+
 	dentry->d_lockref.count--;
 	spin_unlock(&dentry->d_lock);
 	return;
@@ -1820,6 +1924,11 @@  static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 	WARN_ON(d_in_lookup(dentry));
 
 	spin_lock(&dentry->d_lock);
+	/*
+	 * Decrement negative dentry count if it was in the LRU list.
+	 */
+	if (dentry->d_flags & DCACHE_LRU_LIST)
+		neg_dentry_dec(dentry);
 	hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_set_inode_and_type(dentry, inode, add_flags);
@@ -3566,6 +3675,47 @@  void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+static long neg_dentry_pc __initdata = 5;
+static bool neg_dentry_warn __initdata;
+static int __init set_neg_dentry_pc(char *str)
+{
+	ssize_t ret;
+	long new_pc = neg_dentry_pc;
+
+	if (!str)
+		return 0;
+	ret = kstrtol(str, 0, &new_pc);
+	if (ret || (new_pc < 1) || (new_pc > 50))
+		ret = 1;
+	else
+		neg_dentry_pc = new_pc;
+	if (ret)
+		neg_dentry_warn = true;
+	return ret ? 0 : 1;
+}
+__setup("neg_dentry_pc=", set_neg_dentry_pc);
+
+static void __init neg_dentry_init(void)
+{
+	/* Rough estimate of # of dentries allocated per page */
+	unsigned int nr_dentry_page = PAGE_SIZE/sizeof(struct dentry) - 1;
+	unsigned long cnt;
+
+	raw_spin_lock_init(&ndblk.nfree_lock);
+
+	/* 20% in global pool & 80% in percpu free */
+	ndblk.nfree = totalram_pages * nr_dentry_page * neg_dentry_pc / 500;
+	cnt = ndblk.nfree * 4 / num_possible_cpus();
+	if (unlikely(cnt < 2 * NEG_DENTRY_BATCH))
+		cnt = 2 * NEG_DENTRY_BATCH;
+	neg_dentry_percpu_limit = cnt;
+
+	if (neg_dentry_warn)
+		pr_warn("Warning: neg_dentry_pc must be within 1-50 range.\n");
+	pr_info("Negative dentry: percpu limit = %ld, free pool = %ld\n",
+		neg_dentry_percpu_limit, ndblk.nfree);
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
@@ -3606,6 +3756,8 @@  static void __init dcache_init(void)
 	dentry_cache = KMEM_CACHE(dentry,
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT);
 
+	neg_dentry_init();
+
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
 		return;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3f3ff4c..498233b 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -218,6 +218,7 @@  struct dentry_operations {
 
 #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		0x20000000
+#define DCACHE_KILL_NEGATIVE		0x40000000 /* Kill negative dentry */
 
 extern seqlock_t rename_lock;