diff mbox series

[RFC,v2,14/14] dcache: Implement object migration

Message ID 20190403042127.18755-15-tobin@kernel.org (mailing list archive)
State New, archived
Headers show
Series Slab Movable Objects (SMO) | expand

Commit Message

Tobin C. Harding April 3, 2019, 4:21 a.m. UTC
The dentry slab cache is susceptible to internal fragmentation.  Now
that we have Slab Movable Objects we can defragment the dcache.  Object
migration is only possible for dentry objects that are not currently
referenced by anyone, i.e. we are using the object migration
infrastructure to free unused dentries.

Implement isolate and migrate functions for the dentry slab cache.

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 fs/dcache.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

Comments

Al Viro April 3, 2019, 5:08 p.m. UTC | #1
On Wed, Apr 03, 2019 at 03:21:27PM +1100, Tobin C. Harding wrote:
> The dentry slab cache is susceptible to internal fragmentation.  Now
> that we have Slab Movable Objects we can defragment the dcache.  Object
> migration is only possible for dentry objects that are not currently
> referenced by anyone, i.e. we are using the object migration
> infrastructure to free unused dentries.
> 
> Implement isolate and migrate functions for the dentry slab cache.

> +		/*
> +		 * Three sorts of dentries cannot be reclaimed:
> +		 *
> +		 * 1. dentries that are in the process of being allocated
> +		 *    or being freed. In that case the dentry is neither
> +		 *    on the LRU nor hashed.
> +		 *
> +		 * 2. Fake hashed entries as used for anonymous dentries
> +		 *    and pipe I/O. The fake hashed entries have d_flags
> +		 *    set to indicate a hashed entry. However, the
> +		 *    d_hash field indicates that the entry is not hashed.
> +		 *
> +		 * 3. dentries that have a backing store that is not
> +		 *    writable. This is true for tmpsfs and other in
> +		 *    memory filesystems. Removing dentries from them
> +		 *    would loose dentries for good.
> +		 */
> +		if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
> +		    (!d_unhashed(dentry) && hlist_bl_unhashed(&dentry->d_hash)) ||
> +		    (dentry->d_inode &&
> +		     !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping))) {
> +			/* Ignore this dentry */
> +			v[i] = NULL;
> +		} else {
> +			__dget_dlock(dentry);
> +		}
> +		spin_unlock(&dentry->d_lock);
> +	}
> +	return NULL;		/* No need for private data */
> +}
> +
> +/*
> + * d_migrate() - Dentry migration callback function.
> + * @s: The dentry cache.
> + * @v: Vector of pointers to the objects to migrate.
> + * @nr: Number of objects in @v.
> + * @node: The NUMA node where new object should be allocated.
> + * @private: Returned by d_isolate() (currently %NULL).
> + *
> + * Slab has dropped all the locks. Get rid of the refcount obtained
> + * earlier and also free the object.
> + */
> +static void d_migrate(struct kmem_cache *s, void **v, int nr,
> +		      int node, void *_unused)
> +{
> +	struct dentry *dentry;
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		dentry = v[i];
> +		if (dentry)
> +			d_invalidate(dentry);

Oh, *brilliant*

Let's do d_invalidate() on random dentries and hope they go away.
With convoluted and brittle logics for deciding which ones to
spare, which is actually wrong.  This will pick mountpoints
and tear them out, to start with.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

And this is a NAK for the entire approach; if it has a positive refcount,
LEAVE IT ALONE.  Period.  Don't play this kind of games, they are wrong.
d_invalidate() is not something that can be done to an arbitrary dentry.
Al Viro April 3, 2019, 5:19 p.m. UTC | #2
On Wed, Apr 03, 2019 at 06:08:11PM +0100, Al Viro wrote:

> Oh, *brilliant*
> 
> Let's do d_invalidate() on random dentries and hope they go away.
> With convoluted and brittle logics for deciding which ones to
> spare, which is actually wrong.  This will pick mountpoints
> and tear them out, to start with.
> 
> NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> And this is a NAK for the entire approach; if it has a positive refcount,
> LEAVE IT ALONE.  Period.  Don't play this kind of games, they are wrong.
> d_invalidate() is not something that can be done to an arbitrary dentry.

PS: "try to evict what can be evicted out of this set" can be done, but
you want something like
	start with empty list
	go through your array of references
		grab dentry->d_lock
		if dentry->d_lockref.count is not zero
			unlock and continue
		if dentry->d_flags & DCACHE_SHRINK_LIST
			ditto, it's not for us to play with
                if (dentry->d_flags & DCACHE_LRU_LIST)
                        d_lru_del(dentry);
		d_shrink_add(dentry, &list);
		unlock

on the collection phase and
	if the list is not empty by the end of that loop
		shrink_dentry_list(&list);
on the disposal.
Al Viro April 3, 2019, 5:48 p.m. UTC | #3
On Wed, Apr 03, 2019 at 06:19:21PM +0100, Al Viro wrote:
> On Wed, Apr 03, 2019 at 06:08:11PM +0100, Al Viro wrote:
> 
> > Oh, *brilliant*
> > 
> > Let's do d_invalidate() on random dentries and hope they go away.
> > With convoluted and brittle logics for deciding which ones to
> > spare, which is actually wrong.  This will pick mountpoints
> > and tear them out, to start with.
> > 
> > NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
> > 
> > And this is a NAK for the entire approach; if it has a positive refcount,
> > LEAVE IT ALONE.  Period.  Don't play this kind of games, they are wrong.
> > d_invalidate() is not something that can be done to an arbitrary dentry.
> 
> PS: "try to evict what can be evicted out of this set" can be done, but
> you want something like
> 	start with empty list
> 	go through your array of references
> 		grab dentry->d_lock
> 		if dentry->d_lockref.count is not zero
> 			unlock and continue
> 		if dentry->d_flags & DCACHE_SHRINK_LIST
> 			ditto, it's not for us to play with
>                 if (dentry->d_flags & DCACHE_LRU_LIST)
>                         d_lru_del(dentry);
> 		d_shrink_add(dentry, &list);
> 		unlock
> 
> on the collection phase and
> 	if the list is not empty by the end of that loop
> 		shrink_dentry_list(&list);
> on the disposal.

Note, BTW, that your constructor is wrong - all it really needs to do
is spin_lock_init() and setting ->d_lockref.count same as lockref_mark_dead()
does, to match the state of dentries being torn down.

__d_alloc() is not holding ->d_lock, since the object is not visible to
anybody else yet; with your changes it *is* visible.  However, if the
assignment to ->d_lockref.count in __d_alloc() is guaranteed to be
non-zero to non-zero, the above should be safe.
Christoph Lameter (Ampere) April 3, 2019, 5:56 p.m. UTC | #4
On Wed, 3 Apr 2019, Al Viro wrote:

> Let's do d_invalidate() on random dentries and hope they go away.
> With convoluted and brittle logics for deciding which ones to
> spare, which is actually wrong.  This will pick mountpoints
> and tear them out, to start with.
>
> NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
>
> And this is a NAK for the entire approach; if it has a positive refcount,
> LEAVE IT ALONE.  Period.  Don't play this kind of games, they are wrong.
> d_invalidate() is not something that can be done to an arbitrary dentry.

Well could you help us figure out how to do it the right way? We (the MM
guys) are having a hard time not being familiar with the filesystem stuff.

This is an RFC and we want to know how to do this right.
Al Viro April 3, 2019, 6:24 p.m. UTC | #5
On Wed, Apr 03, 2019 at 05:56:27PM +0000, Christopher Lameter wrote:
> On Wed, 3 Apr 2019, Al Viro wrote:
> 
> > Let's do d_invalidate() on random dentries and hope they go away.
> > With convoluted and brittle logics for deciding which ones to
> > spare, which is actually wrong.  This will pick mountpoints
> > and tear them out, to start with.
> >
> > NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
> >
> > And this is a NAK for the entire approach; if it has a positive refcount,
> > LEAVE IT ALONE.  Period.  Don't play this kind of games, they are wrong.
> > d_invalidate() is not something that can be done to an arbitrary dentry.
> 
> Well could you help us figure out how to do it the right way? We (the MM
> guys) are having a hard time not being familiar with the filesystem stuff.
> 
> This is an RFC and we want to know how to do this right.

If by "how to do it right" you mean "expedit kicking out something with
non-zero refcount" - there's no way to do that.  Nothing even remotely
sane.

If you mean "kick out everything in this page with zero refcount" - that
can be done (see further in the thread).

Look, dentries and inodes are really, really not relocatable.  If they
can be evicted by memory pressure - sure, we can do that for a given
set (e.g. "everything in that page").  But that's it - if memory
pressure would _not_ get rid of that one, there's nothing to be done.
Again, all VM can do is to simulate shrinker hitting hard on given
bunch (rather than buggering the entire cache).  If filesystem (or
something in VFS) says "it's busy", it bloody well _is_ busy and
won't be going away until it ceases to be such.
Al Viro April 3, 2019, 7:05 p.m. UTC | #6
On Wed, Apr 03, 2019 at 07:24:54PM +0100, Al Viro wrote:

> If by "how to do it right" you mean "expedit kicking out something with
> non-zero refcount" - there's no way to do that.  Nothing even remotely
> sane.
> 
> If you mean "kick out everything in this page with zero refcount" - that
> can be done (see further in the thread).
> 
> Look, dentries and inodes are really, really not relocatable.  If they
> can be evicted by memory pressure - sure, we can do that for a given
> set (e.g. "everything in that page").  But that's it - if memory
> pressure would _not_ get rid of that one, there's nothing to be done.
> Again, all VM can do is to simulate shrinker hitting hard on given
> bunch (rather than buggering the entire cache).  If filesystem (or
> something in VFS) says "it's busy", it bloody well _is_ busy and
> won't be going away until it ceases to be such.

FWIW, some theory: the only kind of long-term reference that can
be killed off by memory pressure is that from child to parent.
Anything else (e.g. an opened file, current directory, mountpoint,
etc.) is out of limits - it either won't be going away until
the thing is not pinned anymore (close, chdir, etc.) *or*
it really shouldn't be ("VM wants this mountpoint dentry freed,
so just dissolve the mount" is a bloody bad idea for obvious
reasons).

Stuff in somebody's shrink list is none of our business - somebody
else is going to try and evict it anyway; if it can be evicted,
it will be.

Anything with zero refcount that isn't in somebody else's
shrink list is fair game.

Directories with children could, in principle, be helped along -
we could try shrink_dcache_parent() on them, which might end
up leaving them with zero refcount.  However, it's not cheap
and if you pick the root dentry of a filesystem, it'll try to
evict everything on it that can be evicted, be it in this page
or not.  And there's no promise that it will end up evictable
after that.

So from the correctness POV
	* you can kick out everything with zero refcount not
on shrink lists.
	* you _might_ try shrink_dcache_parent() on directory
dentries, in hope to drive their refcount to zero.  However,
that's almost certainly going to hit too hard and be too costly.
	* d_invalidate() is no-go; if anything, you want something
weaker than shrink_dcache_parent(), not stronger.

For anything beyond "just kick out everything in that page that
happens to have zero refcount" I would really like to see the
stats - how much does it help, how costly it is _and_ how much
of the cache does it throw away (see above re running into a root
dentry of some filesystem and essentially trimming dcache for
that fs down to the unevictable stuff).
Miklos Szeredi April 4, 2019, 8:01 a.m. UTC | #7
On Wed, Apr 3, 2019 at 9:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Apr 03, 2019 at 07:24:54PM +0100, Al Viro wrote:
>
> > If by "how to do it right" you mean "expedit kicking out something with
> > non-zero refcount" - there's no way to do that.  Nothing even remotely
> > sane.
> >
> > If you mean "kick out everything in this page with zero refcount" - that
> > can be done (see further in the thread).
> >
> > Look, dentries and inodes are really, really not relocatable.  If they
> > can be evicted by memory pressure - sure, we can do that for a given
> > set (e.g. "everything in that page").  But that's it - if memory
> > pressure would _not_ get rid of that one, there's nothing to be done.
> > Again, all VM can do is to simulate shrinker hitting hard on given
> > bunch (rather than buggering the entire cache).  If filesystem (or
> > something in VFS) says "it's busy", it bloody well _is_ busy and
> > won't be going away until it ceases to be such.
>
> FWIW, some theory: the only kind of long-term reference that can
> be killed off by memory pressure is that from child to parent.
> Anything else (e.g. an opened file, current directory, mountpoint,
> etc.) is out of limits - it either won't be going away until
> the thing is not pinned anymore (close, chdir, etc.) *or*
> it really shouldn't be ("VM wants this mountpoint dentry freed,
> so just dissolve the mount" is a bloody bad idea for obvious
> reasons).

Well, theoretically we could do two levels of references, where the
long term reference is stable and contains an rcu protected unstable
reference to the real object.   In the likely case when only read-only
access to the object is needed (d_lookup) then the cost is an extra
dereference and the associated additional cache usage.  If read-write
access is needed to object, then extra locking is needed to protect
against concurrent migration.  So there's non-trivial cost in addition
to the added complexity, and I don't see it actually making sense in
practice.   But maybe someone can expand this idea to something
practicable...

Thanks,
Miklos
Christoph Lameter (Ampere) April 4, 2019, 3:46 p.m. UTC | #8
On Wed, 3 Apr 2019, Al Viro wrote:

> > This is an RFC and we want to know how to do this right.
>
> If by "how to do it right" you mean "expedit kicking out something with
> non-zero refcount" - there's no way to do that.  Nothing even remotely
> sane.

Sure we know that.

> If you mean "kick out everything in this page with zero refcount" - that
> can be done (see further in the thread).

Ok that would already be progress. If we can use this to liberate some
slab pages with just a few dentry object then it may be worthwhile.

> Look, dentries and inodes are really, really not relocatable.  If they
> can be evicted by memory pressure - sure, we can do that for a given
> set (e.g. "everything in that page").  But that's it - if memory
> pressure would _not_ get rid of that one, there's nothing to be done.
> Again, all VM can do is to simulate shrinker hitting hard on given
> bunch (rather than buggering the entire cache).  If filesystem (or
> something in VFS) says "it's busy", it bloody well _is_ busy and
> won't be going away until it ceases to be such.

Right. Thats why the patch attempted to check for these things to avoid
touching such objects.
Tobin Harding April 4, 2019, 8:29 p.m. UTC | #9
On Wed, Apr 03, 2019 at 06:48:55PM +0100, Al Viro wrote:
> On Wed, Apr 03, 2019 at 06:19:21PM +0100, Al Viro wrote:
> > On Wed, Apr 03, 2019 at 06:08:11PM +0100, Al Viro wrote:
> > 
> > > Oh, *brilliant*
> > > 
> > > Let's do d_invalidate() on random dentries and hope they go away.
> > > With convoluted and brittle logics for deciding which ones to
> > > spare, which is actually wrong.  This will pick mountpoints
> > > and tear them out, to start with.
> > > 
> > > NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
> > > 
> > > And this is a NAK for the entire approach; if it has a positive refcount,
> > > LEAVE IT ALONE.  Period.  Don't play this kind of games, they are wrong.
> > > d_invalidate() is not something that can be done to an arbitrary dentry.
> > 
> > PS: "try to evict what can be evicted out of this set" can be done, but
> > you want something like
> > 	start with empty list
> > 	go through your array of references
> > 		grab dentry->d_lock
> > 		if dentry->d_lockref.count is not zero
> > 			unlock and continue
> > 		if dentry->d_flags & DCACHE_SHRINK_LIST
> > 			ditto, it's not for us to play with
> >                 if (dentry->d_flags & DCACHE_LRU_LIST)
> >                         d_lru_del(dentry);
> > 		d_shrink_add(dentry, &list);
> > 		unlock
> > 
> > on the collection phase and
> > 	if the list is not empty by the end of that loop
> > 		shrink_dentry_list(&list);
> > on the disposal.
> 
> Note, BTW, that your constructor is wrong - all it really needs to do
> is spin_lock_init() and setting ->d_lockref.count same as lockref_mark_dead()
> does, to match the state of dentries being torn down.

Thanks for looking at this Al.

> __d_alloc() is not holding ->d_lock, since the object is not visible to
> anybody else yet; with your changes it *is* visible.

I don't quite understand this comment.  How is the object visible?  The
constructor is only called when allocating a new page to the slab and
this is done with interrupts disabled.

>  However, if the
> assignment to ->d_lockref.count in __d_alloc() is guaranteed to be
> non-zero to non-zero, the above should be safe.

I've done as you suggest and set it to -128

Thanks for schooling me on the VFS stuff.


	Tobin
Tobin Harding April 4, 2019, 9:18 p.m. UTC | #10
On Wed, Apr 03, 2019 at 06:19:21PM +0100, Al Viro wrote:
> On Wed, Apr 03, 2019 at 06:08:11PM +0100, Al Viro wrote:
> 
> > Oh, *brilliant*
> > 
> > Let's do d_invalidate() on random dentries and hope they go away.
> > With convoluted and brittle logics for deciding which ones to
> > spare, which is actually wrong.  This will pick mountpoints
> > and tear them out, to start with.
> > 
> > NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
> > 
> > And this is a NAK for the entire approach; if it has a positive refcount,
> > LEAVE IT ALONE.  Period.  Don't play this kind of games, they are wrong.
> > d_invalidate() is not something that can be done to an arbitrary dentry.
> 
> PS: "try to evict what can be evicted out of this set" can be done, but
> you want something like
> 	start with empty list
> 	go through your array of references
> 		grab dentry->d_lock
> 		if dentry->d_lockref.count is not zero
> 			unlock and continue
> 		if dentry->d_flags & DCACHE_SHRINK_LIST
> 			ditto, it's not for us to play with
>                 if (dentry->d_flags & DCACHE_LRU_LIST)
>                         d_lru_del(dentry);
> 		d_shrink_add(dentry, &list);
> 		unlock
> 
> on the collection phase and
> 	if the list is not empty by the end of that loop
> 		shrink_dentry_list(&list);
> on the disposal.

Implemented as suggested, thanks.  RFCv3 to come when we have some stats
for you :)

thanks,
Tobin.
Al Viro April 4, 2019, 9:58 p.m. UTC | #11
On Fri, Apr 05, 2019 at 07:29:14AM +1100, Tobin C. Harding wrote:
> > __d_alloc() is not holding ->d_lock, since the object is not visible to
> > anybody else yet; with your changes it *is* visible.
> 
> I don't quite understand this comment.  How is the object visible?  The
> constructor is only called when allocating a new page to the slab and
> this is done with interrupts disabled.

Not to constructor, to your try-to-evict code...
diff mbox series

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index 606844ad5171..4387715b7ebb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -30,6 +30,7 @@ 
 #include <linux/bit_spinlock.h>
 #include <linux/rculist_bl.h>
 #include <linux/list_lru.h>
+#include <linux/backing-dev.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -3074,6 +3075,90 @@  void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/*
+ * d_isolate() - Dentry isolation callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to migrate.
+ * @nr: Number of objects in @v.
+ *
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *d_isolate(struct kmem_cache *s, void **v, int nr)
+{
+	struct dentry *dentry;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		spin_lock(&dentry->d_lock);
+		/*
+		 * Three sorts of dentries cannot be reclaimed:
+		 *
+		 * 1. dentries that are in the process of being allocated
+		 *    or being freed. In that case the dentry is neither
+		 *    on the LRU nor hashed.
+		 *
+		 * 2. Fake hashed entries as used for anonymous dentries
+		 *    and pipe I/O. The fake hashed entries have d_flags
+		 *    set to indicate a hashed entry. However, the
+		 *    d_hash field indicates that the entry is not hashed.
+		 *
+		 * 3. dentries that have a backing store that is not
+		 *    writable. This is true for tmpsfs and other in
+		 *    memory filesystems. Removing dentries from them
+		 *    would loose dentries for good.
+		 */
+		if ((d_unhashed(dentry) && list_empty(&dentry->d_lru)) ||
+		    (!d_unhashed(dentry) && hlist_bl_unhashed(&dentry->d_hash)) ||
+		    (dentry->d_inode &&
+		     !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping))) {
+			/* Ignore this dentry */
+			v[i] = NULL;
+		} else {
+			__dget_dlock(dentry);
+		}
+		spin_unlock(&dentry->d_lock);
+	}
+	return NULL;		/* No need for private data */
+}
+
+/*
+ * d_migrate() - Dentry migration callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to migrate.
+ * @nr: Number of objects in @v.
+ * @node: The NUMA node where new object should be allocated.
+ * @private: Returned by d_isolate() (currently %NULL).
+ *
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void d_migrate(struct kmem_cache *s, void **v, int nr,
+		      int node, void *_unused)
+{
+	struct dentry *dentry;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		if (dentry)
+			d_invalidate(dentry);
+	}
+
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		if (dentry)
+			dput(dentry);
+	}
+
+	/*
+	 * dentries are freed using RCU so we need to wait until RCU
+	 * operations are complete.
+	 */
+	synchronize_rcu();
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
@@ -3119,6 +3204,8 @@  static void __init dcache_init(void)
 					   sizeof_field(struct dentry, d_iname),
 					   dcache_ctor);
 
+	kmem_cache_setup_mobility(dentry_cache, d_isolate, d_migrate);
+
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
 		return;