Message ID | 1500298773-7510-2-git-send-email-longman@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jul 17, 2017 at 09:39:30AM -0400, Waiman Long wrote: > The number of positive dentries is limited by the number of files > in the filesystems. The number of negative dentries, however, > has no limit other than the total amount of memory available in > the system. So a rogue application that generates a lot of negative > dentries can potentially exhaust most of the memory available in the > system impacting performance on other running applications. > > To prevent this from happening, the dcache code is now updated to limit > the amount of the negative dentries in the LRU lists that can be kept > as a percentage of total available system memory. The default is 5% > and can be changed by specifying the "neg_dentry_pc=" kernel command > line option. I see the problem, but rather than restricting the number of negative dentries to be a fraction of the total amount of memory in the machine, wouldn't it make more sense to limit the number of negative dentries to be some multiple of the number of positive dentries currently in the system? Or make negative dentries more easily prunable. For example, we could allocate them from a separate slab and use the existing reclaim mechanism to just throw them away. Since they can't be pinned by an inode, they're much easier to get rid of than positive dentries. Might make changing a dentry from positive to negative or vice versa a bit more expensive ...
On 07/17/2017 01:49 PM, Matthew Wilcox wrote: > On Mon, Jul 17, 2017 at 09:39:30AM -0400, Waiman Long wrote: >> The number of positive dentries is limited by the number of files >> in the filesystems. The number of negative dentries, however, >> has no limit other than the total amount of memory available in >> the system. So a rogue application that generates a lot of negative >> dentries can potentially exhaust most of the memory available in the >> system impacting performance on other running applications. >> >> To prevent this from happening, the dcache code is now updated to limit >> the amount of the negative dentries in the LRU lists that can be kept >> as a percentage of total available system memory. The default is 5% >> and can be changed by specifying the "neg_dentry_pc=" kernel command >> line option. > I see the problem, but rather than restricting the number of negative > dentries to be a fraction of the total amount of memory in the machine, > wouldn't it make more sense to limit the number of negative dentries to be > some multiple of the number of positive dentries currently in the system? The number of positive dentries will be a rapidly changing number. So we can't use __read_mostly variable for the limits. That may have a certain performance impact. I chose to use a fixed number because of simplicity and performance. I can compromise on simplicity, but not on performance. I am open to maybe adjust the free pool count in some ways as long as the performance impact is negligible. > Or make negative dentries more easily prunable. For example, we could > allocate them from a separate slab and use the existing reclaim mechanism > to just throw them away. Since they can't be pinned by an inode, they're > much easier to get rid of than positive dentries. Might make changing > a dentry from positive to negative or vice versa a bit more expensive ... I don't quite understand what you mean by having two separate slabs. The current reclaim mechanism is through scanning the LRU lists. I had been thinking about having a separate LRU list for negative dentries. Giving the complexity of the current per-node/per-memcg LRU list, maintaining 2 separate LRU lists in each super_block may be error-prone. It is true that positive dentries will also be pruned in the process. By the time automatic pruning happens, there should have a lot of negative dentries in the LRU lists already. We can skip over positive dentries in the scanning, but we have to either allow scanning more entries in each pass prolonging the interruption or do no pruning at all if the LRU lists are front-loaded with a bunch of positive dentries. BTW, you remind me that I should have accounted for the positive-to-negative dentry transitions which is missing in the current patch. Cheers, Longman
On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote: > The number of positive dentries is limited by the number of files > in the filesystems. The number of negative dentries, however, > has no limit other than the total amount of memory available in > the system. So a rogue application that generates a lot of negative > dentries can potentially exhaust most of the memory available in the > system impacting performance on other running applications. > > To prevent this from happening, the dcache code is now updated to limit > the amount of the negative dentries in the LRU lists that can be kept > as a percentage of total available system memory. The default is 5% > and can be changed by specifying the "neg_dentry_pc=" kernel command > line option. AFAICS the implementation is counter to the concept of LRU since it will get rid of the most recently used negative dentry after passing the limit. Which in itself is a source of DoS (keep rouge negative dentries at just about the limit, so normal application are prevented from getting their negatives cached). Thanks, Miklos > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > Documentation/admin-guide/kernel-parameters.txt | 7 ++ > fs/dcache.c | 154 +++++++++++++++++++++++- > include/linux/dcache.h | 1 + > 3 files changed, 161 insertions(+), 1 deletion(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index f701430..fc3c937 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -2372,6 +2372,13 @@ > > n2= [NET] SDL Inc. RISCom/N2 synchronous serial card > > + neg_dentry_pc= [KNL] > + Range: 1-50 > + Default: 5 > + This parameter specifies the amount of negative > + dentries allowed in the system as a percentage of > + total system memory. > + > netdev= [NET] Network devices parameters > Format: <irq>,<io>,<mem_start>,<mem_end>,<name> > Note that mem_start is often overloaded to mean > diff --git a/fs/dcache.c b/fs/dcache.c > index f901413..6a0a844 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -130,8 +130,19 @@ struct dentry_stat_t dentry_stat = { > .age_limit = 45, > }; > > +/* > + * Macros and variables to manage and count negative dentries. > + */ > +#define NEG_DENTRY_BATCH (1 << 8) > +static long neg_dentry_percpu_limit __read_mostly; > +static struct { > + raw_spinlock_t nfree_lock; > + long nfree; /* Negative dentry free pool */ > +} ndblk ____cacheline_aligned_in_smp; > + > static DEFINE_PER_CPU(long, nr_dentry); > static DEFINE_PER_CPU(long, nr_dentry_unused); > +static DEFINE_PER_CPU(long, nr_dentry_neg); > > #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) > > @@ -227,6 +238,86 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char > > #endif > > +/* > + * There is a system-wide limit to the amount of negative dentries allowed > + * in the super blocks' LRU lists. The default limit is 5% of the total > + * system memory. This limit can be changed by using the kernel command line > + * option "neg_dentry_pc=" to specify the percentage of the total memory > + * that can be used for negative dentries. That percentage must be in the > + * 1-50% range. > + * > + * To avoid performance problem with a global counter on an SMP system, > + * the tracking is done mostly on a per-cpu basis. The total limit is > + * distributed in a 80/20 ratio to per-cpu counters and a global free pool. > + * > + * If a per-cpu counter runs out of negative dentries, it can borrow extra > + * ones from the global free pool. If it has more than its percpu limit, > + * the extra ones will be returned back to the global pool. > + */ > + > +/* > + * Decrement negative dentry count if applicable. > + */ > +static void __neg_dentry_dec(struct dentry *dentry) > +{ > + if (unlikely(this_cpu_dec_return(nr_dentry_neg) < 0)) { > + long *pcnt = get_cpu_ptr(&nr_dentry_neg); > + > + if ((*pcnt < 0) && raw_spin_trylock(&ndblk.nfree_lock)) { > + ACCESS_ONCE(ndblk.nfree) += NEG_DENTRY_BATCH; > + *pcnt += NEG_DENTRY_BATCH; > + raw_spin_unlock(&ndblk.nfree_lock); > + } > + put_cpu_ptr(&nr_dentry_neg); > + } > +} > + > +static inline void neg_dentry_dec(struct dentry *dentry) > +{ > + if (unlikely(d_is_negative(dentry))) > + __neg_dentry_dec(dentry); > +} > + > +/* > + * Increment negative dentry count if applicable. > + */ > +static void __neg_dentry_inc(struct dentry *dentry) > +{ > + long cnt, *pcnt; > + > + if (this_cpu_inc_return(nr_dentry_neg) <= neg_dentry_percpu_limit) > + return; > + > + pcnt = get_cpu_ptr(&nr_dentry_neg); > + cnt = (READ_ONCE(ndblk.nfree) && > + (*pcnt > neg_dentry_percpu_limit)) ? NEG_DENTRY_BATCH : 0; > + > + if (cnt && raw_spin_trylock(&ndblk.nfree_lock)) { > + long val = READ_ONCE(ndblk.nfree); > + > + if (val < cnt) > + cnt = val; > + ACCESS_ONCE(ndblk.nfree) -= cnt; > + *pcnt -= cnt; > + raw_spin_unlock(&ndblk.nfree_lock); > + } else { > + cnt = 0; > + } > + put_cpu_ptr(&nr_dentry_neg); > + /* > + * If there are too many negative dentries, set DCACHE_KILL_NEGATIVE > + * flag to indicate that the dentry should be killed. > + */ > + if (!cnt) > + dentry->d_flags |= DCACHE_KILL_NEGATIVE; > +} > + > +static inline void neg_dentry_inc(struct dentry *dentry) > +{ > + if (unlikely(d_is_negative(dentry))) > + __neg_dentry_inc(dentry); > +} > + > static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *ct, unsigned tcount) > { > /* > @@ -396,6 +487,7 @@ static void d_lru_add(struct dentry *dentry) > dentry->d_flags |= DCACHE_LRU_LIST; > this_cpu_inc(nr_dentry_unused); > WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); > + neg_dentry_inc(dentry); > } > > static void d_lru_del(struct dentry *dentry) > @@ -404,6 +496,7 @@ static void d_lru_del(struct dentry *dentry) > dentry->d_flags &= ~DCACHE_LRU_LIST; > this_cpu_dec(nr_dentry_unused); > WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); > + neg_dentry_dec(dentry); > } > > static void d_shrink_del(struct dentry *dentry) > @@ -434,6 +527,7 @@ static void d_lru_isolate(struct list_lru_one *lru, struct dentry *dentry) > dentry->d_flags &= ~DCACHE_LRU_LIST; > this_cpu_dec(nr_dentry_unused); > list_lru_isolate(lru, &dentry->d_lru); > + neg_dentry_dec(dentry); > } > > static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry, > @@ -442,6 +536,7 @@ static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry, > D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST); > dentry->d_flags |= DCACHE_SHRINK_LIST; > list_lru_isolate_move(lru, &dentry->d_lru, list); > + neg_dentry_dec(dentry); > } > > /* > @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry) > > if (!IS_ROOT(dentry)) { > parent = dentry->d_parent; > - if (unlikely(!spin_trylock(&parent->d_lock))) { > + /* > + * Force the killing of this negative dentry when > + * DCACHE_KILL_NEGATIVE flag is set. > + */ > + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) { > + spin_lock(&parent->d_lock); > + } else if (unlikely(!spin_trylock(&parent->d_lock))) { > if (inode) > spin_unlock(&inode->i_lock); > goto failed; > @@ -815,6 +916,9 @@ void dput(struct dentry *dentry) > > dentry_lru_add(dentry); > > + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) > + goto kill_it; > + > dentry->d_lockref.count--; > spin_unlock(&dentry->d_lock); > return; > @@ -1820,6 +1924,11 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode) > WARN_ON(d_in_lookup(dentry)); > > spin_lock(&dentry->d_lock); > + /* > + * Decrement negative dentry count if it was in the LRU list. > + */ > + if (dentry->d_flags & DCACHE_LRU_LIST) > + neg_dentry_dec(dentry); > hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry); > raw_write_seqcount_begin(&dentry->d_seq); > __d_set_inode_and_type(dentry, inode, add_flags); > @@ -3566,6 +3675,47 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode) > } > EXPORT_SYMBOL(d_tmpfile); > > +static long neg_dentry_pc __initdata = 5; > +static bool neg_dentry_warn __initdata; > +static int __init set_neg_dentry_pc(char *str) > +{ > + ssize_t ret; > + long new_pc = neg_dentry_pc; > + > + if (!str) > + return 0; > + ret = kstrtol(str, 0, &new_pc); > + if (ret || (new_pc < 1) || (new_pc > 50)) > + ret = 1; > + else > + neg_dentry_pc = new_pc; > + if (ret) > + neg_dentry_warn = true; > + return ret ? 0 : 1; > +} > +__setup("neg_dentry_pc=", set_neg_dentry_pc); > + > +static void __init neg_dentry_init(void) > +{ > + /* Rough estimate of # of dentries allocated per page */ > + unsigned int nr_dentry_page = PAGE_SIZE/sizeof(struct dentry) - 1; > + unsigned long cnt; > + > + raw_spin_lock_init(&ndblk.nfree_lock); > + > + /* 20% in global pool & 80% in percpu free */ > + ndblk.nfree = totalram_pages * nr_dentry_page * neg_dentry_pc / 500; > + cnt = ndblk.nfree * 4 / num_possible_cpus(); > + if (unlikely(cnt < 2 * NEG_DENTRY_BATCH)) > + cnt = 2 * NEG_DENTRY_BATCH; > + neg_dentry_percpu_limit = cnt; > + > + if (neg_dentry_warn) > + pr_warn("Warning: neg_dentry_pc must be within 1-50 range.\n"); > + pr_info("Negative dentry: percpu limit = %ld, free pool = %ld\n", > + neg_dentry_percpu_limit, ndblk.nfree); > +} > + > static __initdata unsigned long dhash_entries; > static int __init set_dhash_entries(char *str) > { > @@ -3606,6 +3756,8 @@ static void __init dcache_init(void) > dentry_cache = KMEM_CACHE(dentry, > SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT); > > + neg_dentry_init(); > + > /* Hash may have been set up in dcache_init_early */ > if (!hashdist) > return; > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index 3f3ff4c..498233b 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -218,6 +218,7 @@ struct dentry_operations { > > #define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */ > #define DCACHE_DENTRY_CURSOR 0x20000000 > +#define DCACHE_KILL_NEGATIVE 0x40000000 /* Kill negative dentry */ > > extern seqlock_t rename_lock; > > -- > 1.8.3.1 >
On 07/19/2017 10:39 AM, Miklos Szeredi wrote: > On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote: >> The number of positive dentries is limited by the number of files >> in the filesystems. The number of negative dentries, however, >> has no limit other than the total amount of memory available in >> the system. So a rogue application that generates a lot of negative >> dentries can potentially exhaust most of the memory available in the >> system impacting performance on other running applications. >> >> To prevent this from happening, the dcache code is now updated to limit >> the amount of the negative dentries in the LRU lists that can be kept >> as a percentage of total available system memory. The default is 5% >> and can be changed by specifying the "neg_dentry_pc=" kernel command >> line option. > AFAICS the implementation is counter to the concept of LRU since it > will get rid of the most recently used negative dentry after passing > the limit. Which in itself is a source of DoS (keep rouge negative > dentries at just about the limit, so normal application are prevented > from getting their negatives cached). > > Thanks, > Miklos Yes, you are right. That is exactly the problem with patch 1 alone. That is why I have patches 3 & 4 to enable automatic trimming to decrease the number of negative dentries before the limit is reached assuming the rate of increase of negative dentries isn't faster that the reduction rate of the automatic trimming process. Cheers, Longman
On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote: > The number of positive dentries is limited by the number of files > in the filesystems. The number of negative dentries, however, > has no limit other than the total amount of memory available in > the system. So a rogue application that generates a lot of negative > dentries can potentially exhaust most of the memory available in the > system impacting performance on other running applications. > > To prevent this from happening, the dcache code is now updated to limit > the amount of the negative dentries in the LRU lists that can be kept > as a percentage of total available system memory. The default is 5% > and can be changed by specifying the "neg_dentry_pc=" kernel command > line option. > > Signed-off-by: Waiman Long <longman@redhat.com> > --- [...] > @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry) > > if (!IS_ROOT(dentry)) { > parent = dentry->d_parent; > - if (unlikely(!spin_trylock(&parent->d_lock))) { > + /* > + * Force the killing of this negative dentry when > + * DCACHE_KILL_NEGATIVE flag is set. > + */ > + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) { > + spin_lock(&parent->d_lock); This looks like d_lock ordering problem (should be parent first, child second). Why is this needed, anyway? > + } else if (unlikely(!spin_trylock(&parent->d_lock))) { > if (inode) > spin_unlock(&inode->i_lock); > goto failed; Thanks, Miklos
On 07/19/2017 04:24 PM, Miklos Szeredi wrote: > On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote: >> The number of positive dentries is limited by the number of files >> in the filesystems. The number of negative dentries, however, >> has no limit other than the total amount of memory available in >> the system. So a rogue application that generates a lot of negative >> dentries can potentially exhaust most of the memory available in the >> system impacting performance on other running applications. >> >> To prevent this from happening, the dcache code is now updated to limit >> the amount of the negative dentries in the LRU lists that can be kept >> as a percentage of total available system memory. The default is 5% >> and can be changed by specifying the "neg_dentry_pc=" kernel command >> line option. >> >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- > [...] > >> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry) >> >> if (!IS_ROOT(dentry)) { >> parent = dentry->d_parent; >> - if (unlikely(!spin_trylock(&parent->d_lock))) { >> + /* >> + * Force the killing of this negative dentry when >> + * DCACHE_KILL_NEGATIVE flag is set. >> + */ >> + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) { >> + spin_lock(&parent->d_lock); > This looks like d_lock ordering problem (should be parent first, child > second). Why is this needed, anyway? > Yes, that is a bug. I should have used lock_parent() instead. I have a test program that generate a lot of negative dentries continuously. Using spin_trylock(), it failed most of the time when that test program was running. So I need to actually acquire the parent's d_lock to make sure that the offending negative dentry was really killed. It was there to protect against the worst case situation. I will update the patch to correct that. Thanks for spotting this. Cheers, Longman
On Wed, Jul 19, 2017 at 10:42 PM, Waiman Long <longman@redhat.com> wrote: > On 07/19/2017 04:24 PM, Miklos Szeredi wrote: >> On Mon, Jul 17, 2017 at 3:39 PM, Waiman Long <longman@redhat.com> wrote: >>> The number of positive dentries is limited by the number of files >>> in the filesystems. The number of negative dentries, however, >>> has no limit other than the total amount of memory available in >>> the system. So a rogue application that generates a lot of negative >>> dentries can potentially exhaust most of the memory available in the >>> system impacting performance on other running applications. >>> >>> To prevent this from happening, the dcache code is now updated to limit >>> the amount of the negative dentries in the LRU lists that can be kept >>> as a percentage of total available system memory. The default is 5% >>> and can be changed by specifying the "neg_dentry_pc=" kernel command >>> line option. >>> >>> Signed-off-by: Waiman Long <longman@redhat.com> >>> --- >> [...] >> >>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry) >>> >>> if (!IS_ROOT(dentry)) { >>> parent = dentry->d_parent; >>> - if (unlikely(!spin_trylock(&parent->d_lock))) { >>> + /* >>> + * Force the killing of this negative dentry when >>> + * DCACHE_KILL_NEGATIVE flag is set. >>> + */ >>> + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) { >>> + spin_lock(&parent->d_lock); >> This looks like d_lock ordering problem (should be parent first, child >> second). Why is this needed, anyway? >> > > Yes, that is a bug. I should have used lock_parent() instead. lock_parent() can release dentry->d_lock, which means it's perfectly useless for this. I still feel forcing free is wrong here. Why not just block until the number of negatives goes below the limit (start reclaim if not already doing so, etc...)? Thanks, Miklos
On 07/20/2017 03:20 AM, Miklos Szeredi wrote: > On Wed, Jul 19, 2017 at 10:42 PM, Waiman Long <longman@redhat.com> wrote: >> >>>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry) >>>> >>>> if (!IS_ROOT(dentry)) { >>>> parent = dentry->d_parent; >>>> - if (unlikely(!spin_trylock(&parent->d_lock))) { >>>> + /* >>>> + * Force the killing of this negative dentry when >>>> + * DCACHE_KILL_NEGATIVE flag is set. >>>> + */ >>>> + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) { >>>> + spin_lock(&parent->d_lock); >>> This looks like d_lock ordering problem (should be parent first, child >>> second). Why is this needed, anyway? >>> >> Yes, that is a bug. I should have used lock_parent() instead. > lock_parent() can release dentry->d_lock, which means it's perfectly > useless for this. As the reference count is kept at 1 in dentry_kill(), the dentry won't go away even if the dentry lock is temporarily released. > I still feel forcing free is wrong here. Why not just block until > the number of negatives goes below the limit (start reclaim if not > already doing so, etc...)? Force freeing is the simplest. Any other ways will require adding more code and increasing code complexity. One reason why I prefer this is to avoid adding unpredictable latency to the regular directory lookup and other dentry related operations. We can always change the code later on if there is a better way of doing it. Cheers, Longman
On Thu, Jul 20, 2017 at 4:21 PM, Waiman Long <longman@redhat.com> wrote: > On 07/20/2017 03:20 AM, Miklos Szeredi wrote: >> On Wed, Jul 19, 2017 at 10:42 PM, Waiman Long <longman@redhat.com> wrote: >>> >>>>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry) >>>>> >>>>> if (!IS_ROOT(dentry)) { >>>>> parent = dentry->d_parent; >>>>> - if (unlikely(!spin_trylock(&parent->d_lock))) { >>>>> + /* >>>>> + * Force the killing of this negative dentry when >>>>> + * DCACHE_KILL_NEGATIVE flag is set. >>>>> + */ >>>>> + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) { >>>>> + spin_lock(&parent->d_lock); >>>> This looks like d_lock ordering problem (should be parent first, child >>>> second). Why is this needed, anyway? >>>> >>> Yes, that is a bug. I should have used lock_parent() instead. >> lock_parent() can release dentry->d_lock, which means it's perfectly >> useless for this. > > As the reference count is kept at 1 in dentry_kill(), the dentry won't > go away even if the dentry lock is temporarily released. It won't go away, but anything else might happen to it (ref grabbed by somebody else, instantiated, etc). Don't see how it's going to be better than the existing trylock. Thanks, Miklos
On 07/20/2017 11:08 AM, Miklos Szeredi wrote: > On Thu, Jul 20, 2017 at 4:21 PM, Waiman Long <longman@redhat.com> wrote: >> On 07/20/2017 03:20 AM, Miklos Szeredi wrote: >>> On Wed, Jul 19, 2017 at 10:42 PM, Waiman Long <longman@redhat.com> wrote: >>>>>> @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry) >>>>>> >>>>>> if (!IS_ROOT(dentry)) { >>>>>> parent = dentry->d_parent; >>>>>> - if (unlikely(!spin_trylock(&parent->d_lock))) { >>>>>> + /* >>>>>> + * Force the killing of this negative dentry when >>>>>> + * DCACHE_KILL_NEGATIVE flag is set. >>>>>> + */ >>>>>> + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) { >>>>>> + spin_lock(&parent->d_lock); >>>>> This looks like d_lock ordering problem (should be parent first, child >>>>> second). Why is this needed, anyway? >>>>> >>>> Yes, that is a bug. I should have used lock_parent() instead. >>> lock_parent() can release dentry->d_lock, which means it's perfectly >>> useless for this. >> As the reference count is kept at 1 in dentry_kill(), the dentry won't >> go away even if the dentry lock is temporarily released. > It won't go away, but anything else might happen to it (ref grabbed by > somebody else, instantiated, etc). Don't see how it's going to be > better than the existing trylock. > > Thanks, > Miklos In the unlikely event that the reference count or the d_flags changes, we can abort the killing. Cheers, Longman
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index f701430..fc3c937 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2372,6 +2372,13 @@ n2= [NET] SDL Inc. RISCom/N2 synchronous serial card + neg_dentry_pc= [KNL] + Range: 1-50 + Default: 5 + This parameter specifies the amount of negative + dentries allowed in the system as a percentage of + total system memory. + netdev= [NET] Network devices parameters Format: <irq>,<io>,<mem_start>,<mem_end>,<name> Note that mem_start is often overloaded to mean diff --git a/fs/dcache.c b/fs/dcache.c index f901413..6a0a844 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -130,8 +130,19 @@ struct dentry_stat_t dentry_stat = { .age_limit = 45, }; +/* + * Macros and variables to manage and count negative dentries. + */ +#define NEG_DENTRY_BATCH (1 << 8) +static long neg_dentry_percpu_limit __read_mostly; +static struct { + raw_spinlock_t nfree_lock; + long nfree; /* Negative dentry free pool */ +} ndblk ____cacheline_aligned_in_smp; + static DEFINE_PER_CPU(long, nr_dentry); static DEFINE_PER_CPU(long, nr_dentry_unused); +static DEFINE_PER_CPU(long, nr_dentry_neg); #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) @@ -227,6 +238,86 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char #endif +/* + * There is a system-wide limit to the amount of negative dentries allowed + * in the super blocks' LRU lists. The default limit is 5% of the total + * system memory. This limit can be changed by using the kernel command line + * option "neg_dentry_pc=" to specify the percentage of the total memory + * that can be used for negative dentries. That percentage must be in the + * 1-50% range. + * + * To avoid performance problem with a global counter on an SMP system, + * the tracking is done mostly on a per-cpu basis. The total limit is + * distributed in a 80/20 ratio to per-cpu counters and a global free pool. + * + * If a per-cpu counter runs out of negative dentries, it can borrow extra + * ones from the global free pool. If it has more than its percpu limit, + * the extra ones will be returned back to the global pool. + */ + +/* + * Decrement negative dentry count if applicable. + */ +static void __neg_dentry_dec(struct dentry *dentry) +{ + if (unlikely(this_cpu_dec_return(nr_dentry_neg) < 0)) { + long *pcnt = get_cpu_ptr(&nr_dentry_neg); + + if ((*pcnt < 0) && raw_spin_trylock(&ndblk.nfree_lock)) { + ACCESS_ONCE(ndblk.nfree) += NEG_DENTRY_BATCH; + *pcnt += NEG_DENTRY_BATCH; + raw_spin_unlock(&ndblk.nfree_lock); + } + put_cpu_ptr(&nr_dentry_neg); + } +} + +static inline void neg_dentry_dec(struct dentry *dentry) +{ + if (unlikely(d_is_negative(dentry))) + __neg_dentry_dec(dentry); +} + +/* + * Increment negative dentry count if applicable. + */ +static void __neg_dentry_inc(struct dentry *dentry) +{ + long cnt, *pcnt; + + if (this_cpu_inc_return(nr_dentry_neg) <= neg_dentry_percpu_limit) + return; + + pcnt = get_cpu_ptr(&nr_dentry_neg); + cnt = (READ_ONCE(ndblk.nfree) && + (*pcnt > neg_dentry_percpu_limit)) ? NEG_DENTRY_BATCH : 0; + + if (cnt && raw_spin_trylock(&ndblk.nfree_lock)) { + long val = READ_ONCE(ndblk.nfree); + + if (val < cnt) + cnt = val; + ACCESS_ONCE(ndblk.nfree) -= cnt; + *pcnt -= cnt; + raw_spin_unlock(&ndblk.nfree_lock); + } else { + cnt = 0; + } + put_cpu_ptr(&nr_dentry_neg); + /* + * If there are too many negative dentries, set DCACHE_KILL_NEGATIVE + * flag to indicate that the dentry should be killed. + */ + if (!cnt) + dentry->d_flags |= DCACHE_KILL_NEGATIVE; +} + +static inline void neg_dentry_inc(struct dentry *dentry) +{ + if (unlikely(d_is_negative(dentry))) + __neg_dentry_inc(dentry); +} + static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *ct, unsigned tcount) { /* @@ -396,6 +487,7 @@ static void d_lru_add(struct dentry *dentry) dentry->d_flags |= DCACHE_LRU_LIST; this_cpu_inc(nr_dentry_unused); WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); + neg_dentry_inc(dentry); } static void d_lru_del(struct dentry *dentry) @@ -404,6 +496,7 @@ static void d_lru_del(struct dentry *dentry) dentry->d_flags &= ~DCACHE_LRU_LIST; this_cpu_dec(nr_dentry_unused); WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); + neg_dentry_dec(dentry); } static void d_shrink_del(struct dentry *dentry) @@ -434,6 +527,7 @@ static void d_lru_isolate(struct list_lru_one *lru, struct dentry *dentry) dentry->d_flags &= ~DCACHE_LRU_LIST; this_cpu_dec(nr_dentry_unused); list_lru_isolate(lru, &dentry->d_lru); + neg_dentry_dec(dentry); } static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry, @@ -442,6 +536,7 @@ static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry, D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST); dentry->d_flags |= DCACHE_SHRINK_LIST; list_lru_isolate_move(lru, &dentry->d_lru, list); + neg_dentry_dec(dentry); } /* @@ -603,7 +698,13 @@ static struct dentry *dentry_kill(struct dentry *dentry) if (!IS_ROOT(dentry)) { parent = dentry->d_parent; - if (unlikely(!spin_trylock(&parent->d_lock))) { + /* + * Force the killing of this negative dentry when + * DCACHE_KILL_NEGATIVE flag is set. + */ + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) { + spin_lock(&parent->d_lock); + } else if (unlikely(!spin_trylock(&parent->d_lock))) { if (inode) spin_unlock(&inode->i_lock); goto failed; @@ -815,6 +916,9 @@ void dput(struct dentry *dentry) dentry_lru_add(dentry); + if (unlikely(dentry->d_flags & DCACHE_KILL_NEGATIVE)) + goto kill_it; + dentry->d_lockref.count--; spin_unlock(&dentry->d_lock); return; @@ -1820,6 +1924,11 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode) WARN_ON(d_in_lookup(dentry)); spin_lock(&dentry->d_lock); + /* + * Decrement negative dentry count if it was in the LRU list. + */ + if (dentry->d_flags & DCACHE_LRU_LIST) + neg_dentry_dec(dentry); hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry); raw_write_seqcount_begin(&dentry->d_seq); __d_set_inode_and_type(dentry, inode, add_flags); @@ -3566,6 +3675,47 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode) } EXPORT_SYMBOL(d_tmpfile); +static long neg_dentry_pc __initdata = 5; +static bool neg_dentry_warn __initdata; +static int __init set_neg_dentry_pc(char *str) +{ + ssize_t ret; + long new_pc = neg_dentry_pc; + + if (!str) + return 0; + ret = kstrtol(str, 0, &new_pc); + if (ret || (new_pc < 1) || (new_pc > 50)) + ret = 1; + else + neg_dentry_pc = new_pc; + if (ret) + neg_dentry_warn = true; + return ret ? 0 : 1; +} +__setup("neg_dentry_pc=", set_neg_dentry_pc); + +static void __init neg_dentry_init(void) +{ + /* Rough estimate of # of dentries allocated per page */ + unsigned int nr_dentry_page = PAGE_SIZE/sizeof(struct dentry) - 1; + unsigned long cnt; + + raw_spin_lock_init(&ndblk.nfree_lock); + + /* 20% in global pool & 80% in percpu free */ + ndblk.nfree = totalram_pages * nr_dentry_page * neg_dentry_pc / 500; + cnt = ndblk.nfree * 4 / num_possible_cpus(); + if (unlikely(cnt < 2 * NEG_DENTRY_BATCH)) + cnt = 2 * NEG_DENTRY_BATCH; + neg_dentry_percpu_limit = cnt; + + if (neg_dentry_warn) + pr_warn("Warning: neg_dentry_pc must be within 1-50 range.\n"); + pr_info("Negative dentry: percpu limit = %ld, free pool = %ld\n", + neg_dentry_percpu_limit, ndblk.nfree); +} + static __initdata unsigned long dhash_entries; static int __init set_dhash_entries(char *str) { @@ -3606,6 +3756,8 @@ static void __init dcache_init(void) dentry_cache = KMEM_CACHE(dentry, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT); + neg_dentry_init(); + /* Hash may have been set up in dcache_init_early */ if (!hashdist) return; diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 3f3ff4c..498233b 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -218,6 +218,7 @@ struct dentry_operations { #define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */ #define DCACHE_DENTRY_CURSOR 0x20000000 +#define DCACHE_KILL_NEGATIVE 0x40000000 /* Kill negative dentry */ extern seqlock_t rename_lock;
The number of positive dentries is limited by the number of files in the filesystems. The number of negative dentries, however, has no limit other than the total amount of memory available in the system. So a rogue application that generates a lot of negative dentries can potentially exhaust most of the memory available in the system impacting performance on other running applications. To prevent this from happening, the dcache code is now updated to limit the amount of the negative dentries in the LRU lists that can be kept as a percentage of total available system memory. The default is 5% and can be changed by specifying the "neg_dentry_pc=" kernel command line option. Signed-off-by: Waiman Long <longman@redhat.com> --- Documentation/admin-guide/kernel-parameters.txt | 7 ++ fs/dcache.c | 154 +++++++++++++++++++++++- include/linux/dcache.h | 1 + 3 files changed, 161 insertions(+), 1 deletion(-)