diff mbox series

vfs: move dentry shrinking outside the inode lock in 'rmdir()'

Message ID 20240511182625.6717-2-torvalds@linux-foundation.org (mailing list archive)
State New, archived
Headers show
Series vfs: move dentry shrinking outside the inode lock in 'rmdir()' | expand

Commit Message

Linus Torvalds May 11, 2024, 6:26 p.m. UTC
Yafang Shao reports that he has seen loads that generate billions of
negative dentries in a directory, which then when the directory is
removed causes excessive latencies for other users because the dentry
shrinking is done under the directory inode lock.

There seems to be no actual reason for holding the inode lock any more
by the time we get rid of the now uninteresting negative dentries, and
it's an effect of just code layout (the shared error path).

Reorganize the code trivially to just have a separate success path,
which simplifies the code (since 'd_delete_notify()' is only called in
the success path anyway) and makes it trivial to just move the dentry
shrinking outside the inode lock.

Reported-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/all/20240511022729.35144-1-laoar.shao@gmail.com/
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Ok, this is the same patch, just with a commit message.  And I actually
am running a kernel with that patch, so it compiles and boots.

Very limited testing: I created a directory with ten million negative
dentries, and then did a "rmdir" in one terminal window while doing a
"ls" inside that directory in another to kind of reproduce (on a smaller
scale) what Yafang was reporting. 

The "ls" was not affected at all.  But honestly, this was *ONE* single
trivial test, so it's almost completely worthless. 

 fs/namei.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Comments

Linus Torvalds May 11, 2024, 6:42 p.m. UTC | #1
On Sat, 11 May 2024 at 11:29, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Reorganize the code trivially to just have a separate success path,
> which simplifies the code (since 'd_delete_notify()' is only called in
> the success path anyway) and makes it trivial to just move the dentry
> shrinking outside the inode lock.

Bah.

I think this might need more work.

The *caller* of vfs_rmdir() also holds a lock, ie we have do_rmdir() doing

        inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
        dentry = lookup_one_qstr_excl(&last, path.dentry, lookup_flags);
        ...
        error = vfs_rmdir(mnt_idmap(path.mnt), path.dentry->d_inode, dentry);
        dput(dentry);
        inode_unlock(path.dentry->d_inode);

so we have another level of locking going on, and my patch only moved
the dcache pruning outside the lock of the directory we're removing
(not outside the lock of the directory that contains the removed
directory).

And that outside lock is the much more important one, I bet.

So I still think this approach may be the right one, but that patch of
mine didn't go far enough.

Sadly, while do_rmdir() itself is trivial to fix up to do this, we
have several other users of vfs_rmdir() (ecryptfs, devtmpfs, overlayfs
in addition to nfsd and ksmbd), so the more complete patch would be
noticeably bigger.

My bad.

                  Linus
Al Viro May 11, 2024, 7:24 p.m. UTC | #2
On Sat, May 11, 2024 at 11:26:26AM -0700, Linus Torvalds wrote:
> Yafang Shao reports that he has seen loads that generate billions of
> negative dentries in a directory, which then when the directory is
> removed causes excessive latencies for other users because the dentry
> shrinking is done under the directory inode lock.
> 
> There seems to be no actual reason for holding the inode lock any more
> by the time we get rid of the now uninteresting negative dentries, and
> it's an effect of just code layout (the shared error path).
> 
> Reorganize the code trivially to just have a separate success path,
> which simplifies the code (since 'd_delete_notify()' is only called in
> the success path anyway) and makes it trivial to just move the dentry
> shrinking outside the inode lock.

Wait, I thought he had confirmed that in the setup in question directories
were *not* removed, hadn't he?

I've no objections against that patch, but if the above is accurate it's
not going to help...
Al Viro May 11, 2024, 7:28 p.m. UTC | #3
On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:

> so we have another level of locking going on, and my patch only moved
> the dcache pruning outside the lock of the directory we're removing
> (not outside the lock of the directory that contains the removed
> directory).
> 
> And that outside lock is the much more important one, I bet.

... and _that_ is where taking d_delete outside of the lock might
take an unpleasant analysis of a lot of code.

We have places where we assume that holding the parent locked
will prevent ->d_inode changes of children.  It might be possible
to get rid of that, but it will take a non-trivial amount of work.

In any case, I think the original poster said that parent directories
were not removed, so I doubt that rmdir() behaviour is relevant for
their load.
Linus Torvalds May 11, 2024, 7:55 p.m. UTC | #4
On Sat, 11 May 2024 at 12:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> >
> > And that outside lock is the much more important one, I bet.
>
> ... and _that_ is where taking d_delete outside of the lock might
> take an unpleasant analysis of a lot of code.

Hmm. It really shouldn't matter. There can only be negative children
of the now deleted directory, so there are no actual effects on
inodes.

It only affects the d_child list, which is protected by d_lock (and
can be modified outside of the inode lock anyway due to memory
pressure).

What am I missing?

> In any case, I think the original poster said that parent directories
> were not removed, so I doubt that rmdir() behaviour is relevant for
> their load.

I don't see that at all. The load was a "rm -rf" of a directory tree,
and all of that was successful as far as I can see from the report.

The issue was that an unrelated process just looking at the directory
(either one - I clearly tested the wrong one) would be held up by the
directory lock while the pruning was going on.

And yes, the pruning can take a couple of seconds with "just" a few
million negative dentries. The negative dentries obviously don't even
have to be the result of a 'delete' - the easy way to see this is to
do a lot of bogus lookups.

Attached is my excessively stupid test-case in case you want to play
around with it:

    [dirtest]$ time ./a.out dir ; time rmdir dir

    real 0m12.592s
    user 0m1.153s
    sys 0m11.412s

    real 0m1.892s
    user 0m0.001s
    sys 0m1.528s

so you can see how it takes almost two seconds to then flush those
negative dentries - even when there were no 'unlink()' calls at all,
just failed lookups.

It's maybe instructive to do the same on tmpfs, which has

    /*
     * Retaining negative dentries for an in-memory filesystem just wastes
     * memory and lookup time: arrange for them to be deleted immediately.
     */
    int always_delete_dentry(const struct dentry *dentry)
    {
        return 1;
    }

and so if you do the same test on /tmp, the results are very different:

    [dirtest]$ time ./a.out /tmp/sillydir ; time rmdir /tmp/sillydir

    real 0m8.129s
    user 0m1.164s
    sys 0m6.592s

    real 0m0.001s
    user 0m0.000s
    sys 0m0.001s

so it does show very different patterns and you can test the whole
"what happens without negative dentries" case.

                 Linus
Al Viro May 11, 2024, 8:31 p.m. UTC | #5
On Sat, May 11, 2024 at 12:55:29PM -0700, Linus Torvalds wrote:
> On Sat, 11 May 2024 at 12:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> > >
> > > And that outside lock is the much more important one, I bet.
> >
> > ... and _that_ is where taking d_delete outside of the lock might
> > take an unpleasant analysis of a lot of code.
> 
> Hmm. It really shouldn't matter. There can only be negative children
> of the now deleted directory, so there are no actual effects on
> inodes.
> 
> It only affects the d_child list, which is protected by d_lock (and
> can be modified outside of the inode lock anyway due to memory
> pressure).
> 
> What am I missing?

fsnotify and related fun, basically.  I'll need to redo the analysis,
but IIRC there had been places where correctness had been guaranteed
by the fact that this had been serialized by the lock on parent.

No idea how easy would it be to adapt to that change; I'll check,
but it'll take a few days, I'm afraid.
Al Viro May 11, 2024, 9:17 p.m. UTC | #6
On Sat, May 11, 2024 at 09:31:43PM +0100, Al Viro wrote:
> On Sat, May 11, 2024 at 12:55:29PM -0700, Linus Torvalds wrote:
> > On Sat, 11 May 2024 at 12:28, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> > > >
> > > > And that outside lock is the much more important one, I bet.
> > >
> > > ... and _that_ is where taking d_delete outside of the lock might
> > > take an unpleasant analysis of a lot of code.
> > 
> > Hmm. It really shouldn't matter. There can only be negative children
> > of the now deleted directory, so there are no actual effects on
> > inodes.
> > 
> > It only affects the d_child list, which is protected by d_lock (and
> > can be modified outside of the inode lock anyway due to memory
> > pressure).
> > 
> > What am I missing?
> 
> fsnotify and related fun, basically.  I'll need to redo the analysis,
> but IIRC there had been places where correctness had been guaranteed
> by the fact that this had been serialized by the lock on parent.

As an aside, I'd really love to see d_rehash() gone - the really old nest
of users is gone (used to be in nfs), but there's still a weird corner case
in exfat + a bunch in AFS.  Life would be simpler if those had been gone -
many correctness proofs around dcache have unpleasant warts dealing with
that crap.  Not relevant in this case, but it makes analysis harder in
general...
James Bottomley May 12, 2024, 3:45 p.m. UTC | #7
On Sat, 2024-05-11 at 20:28 +0100, Al Viro wrote:
> On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> 
> > so we have another level of locking going on, and my patch only
> > moved
> > the dcache pruning outside the lock of the directory we're removing
> > (not outside the lock of the directory that contains the removed
> > directory).
> > 
> > And that outside lock is the much more important one, I bet.
> 
> ... and _that_ is where taking d_delete outside of the lock might
> take an unpleasant analysis of a lot of code.

Couldn't you obviate this by doing it from a workqueue?  Even if the
directory is recreated, the chances are most of the negative dentries
that were under it will still exist and be removable by the time the
workqueue runs.

James
Al Viro May 12, 2024, 4:16 p.m. UTC | #8
On Sun, May 12, 2024 at 11:45:44AM -0400, James Bottomley wrote:
> On Sat, 2024-05-11 at 20:28 +0100, Al Viro wrote:
> > On Sat, May 11, 2024 at 11:42:34AM -0700, Linus Torvalds wrote:
> > 
> > > so we have another level of locking going on, and my patch only
> > > moved
> > > the dcache pruning outside the lock of the directory we're removing
> > > (not outside the lock of the directory that contains the removed
> > > directory).
> > > 
> > > And that outside lock is the much more important one, I bet.
> > 
> > ... and _that_ is where taking d_delete outside of the lock might
> > take an unpleasant analysis of a lot of code.
> 
> Couldn't you obviate this by doing it from a workqueue?  Even if the
> directory is recreated, the chances are most of the negative dentries
> that were under it will still exist and be removable by the time the
> workqueue runs.

Eviction in general - sure
shrink_dcache_parent() in particular... not really - you'd need to keep
dentry pinned for that and that'd cause all kinds of fun for umount
d_delete() - even worse (you don't want dcache lookups to find that
sucker after rmdir(2) returned success to userland).
Linus Torvalds May 12, 2024, 7:59 p.m. UTC | #9
On Sun, 12 May 2024 at 09:16, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Eviction in general - sure
> shrink_dcache_parent() in particular... not really - you'd need to keep
> dentry pinned for that and that'd cause all kinds of fun for umount
> d_delete() - even worse (you don't want dcache lookups to find that
> sucker after rmdir(2) returned success to userland).

All of those dentries *should* be negative dentries, so I really don't
see the worry.

For example, any child dentries will obviously still elevate the
dentry count on the parent, so I really think we could actually skip
the dentry shrink *entirely* and it wouldn't really have any semantic
effect.

IOW, I really think the "shrink dentries on rmdir" is a "let's get rid
of pointless children that will never be used" than a semantic issue -
cleaning things up and releasing memory, rather than about behavior.

We will have already marked the inode dead, and called d_delete() on
the directory:

        dentry->d_inode->i_flags |= S_DEAD;
        dont_mount(dentry);
        detach_mounts(dentry);
        ...
        if (!error)
                d_delete_notify(dir, dentry);  // does d_delete(dentry)

so the removed directory entry itself will have either turned into a
negatve dentry or will unhash it (if there are other users).

So the children are already unreachable through that name, and can
only be reached through somebody who still has the directory open. And
I do not see how "rmdir()" can *possibly* have any valid semantic
effect on any user that has that directory as its PWD, so I claim that
the dentries that exist at this point must already not be relevant
from a semantic standpoint.

So Al, this is my argument: the only dentry that *matters* is the
dentry of the removed directory itself, and that's the one that sees
the "d_delete()" (and all the noise to make sure you can't do new
lookups and can't mount on top of it).

The dentries *underneath* it cannot matter for semantics, and can
happily be pruned - or not pruned - at any later date.

Can you address *THIS* issue?

                  Linus
Linus Torvalds May 12, 2024, 8:29 p.m. UTC | #10
On Sun, 12 May 2024 at 12:59, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So the children are already unreachable through that name, and can
> only be reached through somebody who still has the directory open. And
> I do not see how "rmdir()" can *possibly* have any valid semantic
> effect on any user that has that directory as its PWD, so I claim that
> the dentries that exist at this point must already not be relevant
> from a semantic standpoint.

Side note: our IS_DEADDIR() checks seem a bit inconsistent.

That's actually what should catch some of the "I'm an a directory that
has been removed", but we have that check in lookup_open(), in
lookup_one_qstr_excl(), and in __lookup_slow().

I wonder if we should check it in lookup_dcache() too?

Because yes, that's a difference that rmdir makes, in how it sets
S_DEAD, and this seems to be an area where existing cached dentries
are handled differently.

                Linus
Al Viro May 13, 2024, 5:31 a.m. UTC | #11
On Sun, May 12, 2024 at 12:59:57PM -0700, Linus Torvalds wrote:
> so the removed directory entry itself will have either turned into a
> negatve dentry or will unhash it (if there are other users).
> 
> So the children are already unreachable through that name, and can
> only be reached through somebody who still has the directory open. And
> I do not see how "rmdir()" can *possibly* have any valid semantic
> effect on any user that has that directory as its PWD, so I claim that
> the dentries that exist at this point must already not be relevant
> from a semantic standpoint.
> 
> So Al, this is my argument: the only dentry that *matters* is the
> dentry of the removed directory itself, and that's the one that sees
> the "d_delete()" (and all the noise to make sure you can't do new
> lookups and can't mount on top of it).

Recall what d_delete() will do if you have other references.
That's why we want shrink_dcache_parent() *before* d_delete().

BTW, example of the reasons why d_delete() without directory being locked
is painful: simple_positive() is currently called either under ->d_lock
on dentry in question or under ->i_rwsem on the parent.  Either is enough
to stabilize it.  Get d_delete() happening without parent locked and
all callers of simple_positive() must take ->d_lock.

This one is not hard to adjust, but we need to find all such places.
Currently positivity of hashed dentry can change only with parent
held exclusive.  It's going to take some digging...
Linus Torvalds May 13, 2024, 3:58 p.m. UTC | #12
On Sun, 12 May 2024 at 22:31, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Recall what d_delete() will do if you have other references.
> That's why we want shrink_dcache_parent() *before* d_delete().

Ahh. Yes. So with the shrinking done after as my patch, that means
that now the directory that gets removed is always unhashed.

> BTW, example of the reasons why d_delete() without directory being locked
> is painful

Oh, I would never suggest moving the d_delete() of the directory being
removed itself outside the lock.

I agree 100% on that side.

That said, maybe it's ok to just always unhash the directory when
doing the vfs_rmdir(). I certainly think it's better than unhashing
all the negative dentries like the original patch did.

Ho humm. Let me think about this some more.

We *could* strive for a hybrid approach, where we handle the common
case ("not a ton of child dentries") differently, and just get rid of
them synchronously, and handle the "millions of children" case by
unhashing the directory and dealing with shrinking the children async.

But now it's the merge window, and I have 60+ pull requests in my
inbox, so I will have to go back to work on that.

                  Linus
Al Viro May 13, 2024, 4:33 p.m. UTC | #13
On Mon, May 13, 2024 at 08:58:33AM -0700, Linus Torvalds wrote:
 
> We *could* strive for a hybrid approach, where we handle the common
> case ("not a ton of child dentries") differently, and just get rid of
> them synchronously, and handle the "millions of children" case by
> unhashing the directory and dealing with shrinking the children async.

try_to_shrink_children()?  Doable, and not even that hard to do, but
as for shrinking async...  We can easily move it out of inode_lock
on parent, but doing that really async would either need to be
tied into e.g. remount r/o logics or we'd get userland regressions.

I mean, "I have an opened unlinked file, can't remount r/o" is one
thing, but "I've done rm -rf ~luser, can't remount r/o for a while"
when all luser's processes had been killed and nothing is holding
any of that opened... ouch.
Linus Torvalds May 13, 2024, 4:44 p.m. UTC | #14
On Mon, 13 May 2024 at 09:33, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> try_to_shrink_children()?  Doable, and not even that hard to do, but
> as for shrinking async...  We can easily move it out of inode_lock
> on parent, but doing that really async would either need to be
> tied into e.g. remount r/o logics or we'd get userland regressions.

Let's aim to just get it out from under the inode lock first, and then
look at anything more exciting later, perhaps?

                   Linus
Dave Chinner May 23, 2024, 7:18 a.m. UTC | #15
On Mon, May 13, 2024 at 05:33:32PM +0100, Al Viro wrote:
> On Mon, May 13, 2024 at 08:58:33AM -0700, Linus Torvalds wrote:
>  
> > We *could* strive for a hybrid approach, where we handle the common
> > case ("not a ton of child dentries") differently, and just get rid of
> > them synchronously, and handle the "millions of children" case by
> > unhashing the directory and dealing with shrinking the children async.
> 
> try_to_shrink_children()?  Doable, and not even that hard to do, but
> as for shrinking async...  We can easily move it out of inode_lock
> on parent, but doing that really async would either need to be
> tied into e.g. remount r/o logics or we'd get userland regressions.
> 
> I mean, "I have an opened unlinked file, can't remount r/o" is one
> thing, but "I've done rm -rf ~luser, can't remount r/o for a while"
> when all luser's processes had been killed and nothing is holding
> any of that opened... ouch.

There is no ouch for the vast majority of users: XFS has been doing
background async inode unlink processing since 5.14 (i.e. for almost
3 years now). See commit ab23a7768739 ("xfs: per-cpu deferred inode
inactivation queues") for more of the background on this change - it
was implemented because we needed to allow the scrub code to drop
inode references from within transaction contexts, and evict()
processing could run a nested transaction which could then deadlock
the filesystem.

Hence XFS offloads the inode freeing half of the unlink operation
(i.e. the bit that happens in evict() context) to per-cpu workqueues
instead of doing the work directly in evict() context. We allow
evict() to completely tear down the VFS inode context, but don't
free it in ->destroy_inode() because we still have work to do on it.
XFS doesn't need an active VFS inode context to do the internal
metadata updates needed to free an inode, so it's trivial to defer
this work to a background context outside the VFS inode life cycle.

Hence over half the processing work of every unlink() operation on
XFS is now done in kworker threads rather than via the unlink()
syscall context.

Yes, that means freeze, remount r/o and unmount will block in
xfs_inodegc_stop() waiting for these async inode freeing operations
to be flushed and completed.

However, there have been very few reported issues with freeze,
remount r/o or unmount being significantly delayed - there's an
occasional report of an inodes with tens of millions of extents to
free delaying an operation, but that's no change from unlink()
taking minutes to run and delaying the operation that way,
anyway....

-Dave.
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index 28e62238346e..474b1ee3266d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4217,16 +4217,19 @@  int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		goto out;
 
-	shrink_dcache_parent(dentry);
 	dentry->d_inode->i_flags |= S_DEAD;
 	dont_mount(dentry);
 	detach_mounts(dentry);
+	inode_unlock(dentry->d_inode);
+
+	shrink_dcache_parent(dentry);
+	dput(dentry);
+	d_delete_notify(dir, dentry);
+	return 0;
 
 out:
 	inode_unlock(dentry->d_inode);
 	dput(dentry);
-	if (!error)
-		d_delete_notify(dir, dentry);
 	return error;
 }
 EXPORT_SYMBOL(vfs_rmdir);