Message ID | 151019772760.30101.8513274540570798315.stgit@noble (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Nov 8, 2017 at 7:22 PM, NeilBrown <neilb@suse.com> wrote: > bit-spin-locks, as used for dcache hash chains, are not fair. > This is not a problem for the dcache hash table as different CPUs are > likely to access different entries in the hash table so high contention > is not expected. > However anonymous dentryies (created by NFSD) all live on a single hash > chain "s_anon" and the bitlock on this can be highly contended, resulting > in soft-lockup warnings. This really seems idiotic. Let me rephrase this commit message so that you can see why I think it's wrong: "NFSd uses a single hash chain for all dentries, which can cause horrible lock contention, in ways that the normal hashing does not. This horrendous contention can cause the machine to have bad latency spikes, which can cause soft lockup warnings. Instead of just fixing the bad design decision that causes this horrible contention, we'll just special-case this code and use an additional lock that hides the problem by serializing the actual contending access". Honestly, looking at the code, the whole s_anon thing seems entirely broken. There doesn't even seem to be much reason for it. In pretty much all cases, we could just hash the damn dentry, The only reason for actually having s_anon seems to be that we want some per-superblock list of these unconnected dentries for shrink_dcache_for_umount(). Everything else would actually be *much* happier with just having the dentry on the regular hash table. It would entirely get rid of this stupid performance problem, and it would actually simplify all the code elsewhere, because it would remove special cases like this if (unlikely(IS_ROOT(dentry))) b = &dentry->d_sb->s_anon; else b = d_hash(dentry->d_name.hash); and just turn them into b = d_hash(dentry->d_name.hash); so I really wonder if we could just get rid of s_anon entirely. Yes, getting rid of s_anon might involve crazy things like "let's just walk all the dentries at umount time", but honestly, that sounds preferable. Especially if we can just then do something like - set a special flag in the superblock if we ever use __d_obtain_alias() - only scan all the dentries on umount if that flag is set. Hmm? The other alternative would be to just try to make the bitlocks fairer. I would find that much less distasteful than this nasty hack. I could imagine, for example, just turning the bitlocks into "spinlock on hashed array". That's basically what we did with the page lock, which _used_ to be basically a blocking bitlock, and where there was no possible way to not have contention on some pages. We could afford to have two bits for the bitlock (lock and contention) to make something like that more viable. But I really get the feeling that the problem here is not the locking primitive, but that whole s_anon decision. And my gut feeling is that it really should be fixable. Because wouldn't it be much nicer if the horrible nfsd contention just went away rather than be worked around? But maybe I'm missing some _other_ reason why s_anon has to be its own separate list? Linus
On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote: > Honestly, looking at the code, the whole s_anon thing seems entirely > broken. There doesn't even seem to be much reason for it. In pretty > much all cases, we could just hash the damn dentry, > > The only reason for actually having s_anon seems to be that we want > some per-superblock list of these unconnected dentries for > shrink_dcache_for_umount(). > > Everything else would actually be *much* happier with just having the > dentry on the regular hash table. It would entirely get rid of this > stupid performance problem, and it would actually simplify all the > code elsewhere, because it would remove special cases like this > > if (unlikely(IS_ROOT(dentry))) > b = &dentry->d_sb->s_anon; > else > b = d_hash(dentry->d_name.hash); > > and just turn them into > > b = d_hash(dentry->d_name.hash); > > so I really wonder if we could just get rid of s_anon entirely. > > Yes, getting rid of s_anon might involve crazy things like "let's just > walk all the dentries at umount time", but honestly, that sounds > preferable. Especially if we can just then do something like > > - set a special flag in the superblock if we ever use __d_obtain_alias() Automatically set for a lot of NFS mounts (whenever you mount more than one tree from the same server, IIRC)... > - only scan all the dentries on umount if that flag is set. > > Hmm? That looks like a bloody painful approach, IMO. I'm not saying I like Neil's patch, but I doubt that "let's just walk the entire dcache on umount" is a good idea. I wonder if separating the d_obtain_alias() and d_obtain_root() would be a good idea; the former outnumber the latter by many orders of magnitude. The tricky part is that we could have a disconnected directory from d_obtain_alias() with children already connected to it (and thus normally hashed by d_splice_alias()) and fail to connect the whole thing to parent. That leaves us with an orphaned tree that might stick around past the time when we drop all references to dentries in it. And we want to get those hunted down and shot on umount. Could we * make s_anon hold d_obtain_root ones + orphans from such failed reconnects * make final dput() treat hashed IS_ROOT as "don't retain it" * have d_obtain_alias() put into normal hash, leaving the "move to s_anon" part to reconnect failures. * keep umount side of things unchanged. I agree that temporary insertions into ->s_anon are bogus; hell, I'm not even sure we want to put it on _any_ list initially - we want it to look like it's hashed, so we could set ->next to NULL and have ->pprev point to itself. Then normal case for d_obtain_alias() would not bother the hash chains at all at allocation time, then have it put into the right hash chain on reconnect. And on reconnect failure the caller would've moved it to orphan list (i.e. ->s_anon).
On Thu, Nov 09 2017, Al Viro wrote: > On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote: >> Honestly, looking at the code, the whole s_anon thing seems entirely >> broken. There doesn't even seem to be much reason for it. In pretty >> much all cases, we could just hash the damn dentry, >> >> The only reason for actually having s_anon seems to be that we want >> some per-superblock list of these unconnected dentries for >> shrink_dcache_for_umount(). >> >> Everything else would actually be *much* happier with just having the >> dentry on the regular hash table. It would entirely get rid of this >> stupid performance problem, and it would actually simplify all the >> code elsewhere, because it would remove special cases like this >> >> if (unlikely(IS_ROOT(dentry))) >> b = &dentry->d_sb->s_anon; >> else >> b = d_hash(dentry->d_name.hash); >> >> and just turn them into >> >> b = d_hash(dentry->d_name.hash); >> >> so I really wonder if we could just get rid of s_anon entirely. >> >> Yes, getting rid of s_anon might involve crazy things like "let's just >> walk all the dentries at umount time", but honestly, that sounds >> preferable. Especially if we can just then do something like >> >> - set a special flag in the superblock if we ever use __d_obtain_alias() > > Automatically set for a lot of NFS mounts (whenever you mount more than one > tree from the same server, IIRC)... > >> - only scan all the dentries on umount if that flag is set. >> >> Hmm? > > That looks like a bloody painful approach, IMO. I'm not saying I like > Neil's patch, but I doubt that "let's just walk the entire dcache on > umount" is a good idea. > > I wonder if separating the d_obtain_alias() and d_obtain_root() would be > a good idea; the former outnumber the latter by many orders of magnitude. > The tricky part is that we could have a disconnected directory from > d_obtain_alias() with children already connected to it (and thus normally > hashed by d_splice_alias()) and fail to connect the whole thing to parent. > > That leaves us with an orphaned tree that might stick around past the > time when we drop all references to dentries in it. And we want to > get those hunted down and shot on umount. Could we > * make s_anon hold d_obtain_root ones + orphans from such > failed reconnects > * make final dput() treat hashed IS_ROOT as "don't retain it" > * have d_obtain_alias() put into normal hash, leaving the > "move to s_anon" part to reconnect failures. > * keep umount side of things unchanged. > > I agree that temporary insertions into ->s_anon are bogus; hell, I'm not > even sure we want to put it on _any_ list initially - we want it to look > like it's hashed, so we could set ->next to NULL and have ->pprev point > to itself. Then normal case for d_obtain_alias() would not bother > the hash chains at all at allocation time, then have it put into the > right hash chain on reconnect. And on reconnect failure the caller > would've moved it to orphan list (i.e. ->s_anon). I looked back at the original bug report, and it was d_obtain_alias() that was particularly suffering. As this holds two spinlocks while waiting for the bl_lock, a delay can affect other code that doesn't touch s_anon. So improving d_obtain_alias() would be a good goal (as long as we don't just move the problem of course). It isn't only "reconnect failure" that leaves things on s_anon. We normally don't bother trying to connect non-directories. So if an NFS server is getting lots of read/write request without opens or other pathname lookups, it could easily have lots of disconnected files being repeatedly accessed. Keeping the dentries on d_anon means we don't need to keep allocating new ones for every request. So I'm not keen on dropping an IS_ROOT() dentry at final dput(), but it might make sense to add the dentry to the per-fs list of IS_ROOT dentries at that time. One possible approach would be to use d_child rather than d_hash to link together dentries that don't have a parent. We could assign a random number to d_name.hash so it could appear to be hashed without imposing on any one hash chain. We would still need a spinlock in the superblock to manage the d_anon list that links the d_child's together... I might try to see how the code looks. Thanks, NeilBrown
On Fri, Nov 10, 2017 at 10:09:09AM +1100, NeilBrown wrote: > So if an NFS server is getting lots of read/write request without opens > or other pathname lookups, it could easily have lots of disconnected > files being repeatedly accessed. Keeping the dentries on d_anon means > we don't need to keep allocating new ones for every request. > > So I'm not keen on dropping an IS_ROOT() dentry at final dput(), but > it might make sense to add the dentry to the per-fs list of IS_ROOT > dentries at that time. Watch out for dput() fast path (see fast_dput()) if you go that way. > One possible approach would be to use d_child rather than d_hash to link > together dentries that don't have a parent. > We could assign a random number to d_name.hash so it could appear to be > hashed without imposing on any one hash chain. We would still need a > spinlock in the superblock to manage the d_anon list that links the > d_child's together... > I might try to see how the code looks. Keep in mind that d_hash() includes bits of ->d_parent, aka. dentry itself. So no need for fake ->d_name.hash; you'll get spread from that part. ->d_child is... delicate. There are very interesting games around d_walk vs. dput already; I'd be very careful with that one.
On Thu, Nov 9, 2017 at 12:50 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > That looks like a bloody painful approach, IMO. I'm not saying I like > Neil's patch, but I doubt that "let's just walk the entire dcache on > umount" is a good idea. There may be other approaches, but how often do you umount a filesystem that you have NFS-exported? Not often. So making that slower doesn't really sound horrible. > I wonder if separating the d_obtain_alias() and d_obtain_root() would be > a good idea; the former outnumber the latter by many orders of magnitude. > The tricky part is that we could have a disconnected directory from > d_obtain_alias() with children already connected to it (and thus normally > hashed by d_splice_alias()) and fail to connect the whole thing to parent. Well, the real issue (I think) is that we abuse that hash list for the s_anon thing. I'd like to get rid of that part. Now, the reason we do that is because we want to have some way to find dentries that come from that filesystem, but aren't connected. So I really would like to get rid of s_anon entirely. We could easily use _another_ list entirely for holding the "unconnected dentries". Because we do have such a list: the dentry sibling list. So what if we instead of using s_anon, we create a "disconnected root", and make all the disconnected dentries be children of that disconnected root instead? Wouldn't that be nicer? The "reconnect" would be basically a "move from disconnected parent to right place". Maybe I'm missing something, but that sounds much more logical than the current s_anon list. > That leaves us with an orphaned tree that might stick around past the > time when we drop all references to dentries in it. And we want to > get those hunted down and shot on umount. Could we > * make s_anon hold d_obtain_root ones + orphans from such > failed reconnects > * make final dput() treat hashed IS_ROOT as "don't retain it" > * have d_obtain_alias() put into normal hash, leaving the > "move to s_anon" part to reconnect failures. > * keep umount side of things unchanged. I guess that would work too, but I'm not seeing why s_anon is so wonderful. If we want to make those entries look hashed, let's just hash them, and use a special parent for it (that "disconnected root"). Why wouldn't that work? That would make the umount side _simpler_, because the disconnected root would be handled exactly like we currently handle the real root. So we'd just do that same "do_one_tree()" on the disconnected root, the same way we do on the real one. That sounds _so_ straightforward that I'm probably missing something important. And if the "move from disconnected state" is very common, then maybe the disconnected root lock ends up being a horrible choke-point instead, simply because we'd take that parent lock all the time for the add/move operations.. So maybe it's a bad idea. But it sounds clean. Linus
On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote: > Yes, getting rid of s_anon might involve crazy things like "let's just > walk all the dentries at umount time", but honestly, that sounds > preferable. Especially if we can just then do something like > > - set a special flag in the superblock if we ever use __d_obtain_alias() > > - only scan all the dentries on umount if that flag is set. > > Hmm? The scan would be extremely painful with our current global dcache. But then again the global dcache isn't exactly a good idea to start with. If we had a per-sb dcache using e.g. the resizable hash table the networking people seem to be fond off your idea would work really well.
diff --git a/fs/dcache.c b/fs/dcache.c index f90141387f01..d5952306206b 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -51,6 +51,8 @@ * - the dcache hash table * s_anon bl list spinlock protects: * - the s_anon list (see __d_drop) + * dentry->d_sb->s_anon_lock protect: + * - the s_anon bl bitlock, ensuring fairness. * dentry->d_sb->s_dentry_lru_lock protects: * - the dcache lru lists and counters * d_lock protects: @@ -484,7 +486,12 @@ void __d_drop(struct dentry *dentry) else b = d_hash(dentry->d_name.hash); - hlist_bl_lock(b); + if (b == &dentry->d_sb->s_anon) { + spin_lock(&dentry->d_sb->s_anon_lock); + hlist_bl_lock(b); + spin_unlock(&dentry->d_sb->s_anon_lock); + } else + hlist_bl_lock(b); __hlist_bl_del(&dentry->d_hash); dentry->d_hash.pprev = NULL; hlist_bl_unlock(b); @@ -1965,7 +1972,9 @@ static struct dentry *__d_obtain_alias(struct inode *inode, int disconnected) spin_lock(&tmp->d_lock); __d_set_inode_and_type(tmp, inode, add_flags); hlist_add_head(&tmp->d_u.d_alias, &inode->i_dentry); + spin_lock(&tmp->d_sb->s_anon_lock); hlist_bl_lock(&tmp->d_sb->s_anon); + spin_unlock(&tmp->d_sb->s_anon_lock); hlist_bl_add_head(&tmp->d_hash, &tmp->d_sb->s_anon); hlist_bl_unlock(&tmp->d_sb->s_anon); spin_unlock(&tmp->d_lock); diff --git a/fs/super.c b/fs/super.c index 994db21f59bf..af644ae93445 100644 --- a/fs/super.c +++ b/fs/super.c @@ -209,6 +209,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, if (s->s_user_ns != &init_user_ns) s->s_iflags |= SB_I_NODEV; INIT_HLIST_NODE(&s->s_instances); + spin_lock_init(&s->s_anon_lock); INIT_HLIST_BL_HEAD(&s->s_anon); mutex_init(&s->s_sync_lock); INIT_LIST_HEAD(&s->s_inodes); diff --git a/include/linux/fs.h b/include/linux/fs.h index 885266aae2d7..9df9bace53fb 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1359,6 +1359,7 @@ struct super_block { const struct fscrypt_operations *s_cop; + spinlock_t s_anon_lock; /* needed for fairness */ struct hlist_bl_head s_anon; /* anonymous dentries for (nfs) exporting */ struct list_head s_mounts; /* list of mounts; _not_ for fs use */ struct block_device *s_bdev;
bit-spin-locks, as used for dcache hash chains, are not fair. This is not a problem for the dcache hash table as different CPUs are likely to access different entries in the hash table so high contention is not expected. However anonymous dentryies (created by NFSD) all live on a single hash chain "s_anon" and the bitlock on this can be highly contended, resulting in soft-lockup warnings. So introduce a per-sb (fair) spinlock and take it before grabing the bitlock on s_anon. This provides fairness and makes the warnings go away. Signed-off-by: NeilBrown <neilb@suse.com> --- fs/dcache.c | 11 ++++++++++- fs/super.c | 1 + include/linux/fs.h | 1 + 3 files changed, 12 insertions(+), 1 deletion(-)