mbox series

[RFC,0/7] vfs: improving inode cache iteration scalability

Message ID 20241002014017.3801899-1-david@fromorbit.com (mailing list archive)
Headers show
Series vfs: improving inode cache iteration scalability | expand

Message

Dave Chinner Oct. 2, 2024, 1:33 a.m. UTC
The management of the sb->s_inodes list is a scalability limitation;
it is protected by a single lock and every inode that is
instantiated has to be added to the list. Those inodes then need to
be removed from the list when evicted from cache. Hence every inode
that moves through the VFS inode cache must take this global scope
lock twice.

This proves to be a significant limiting factor for concurrent file
access workloads that repeatedly miss the dentry cache on lookup.
Directory search and traversal workloads are particularly prone to
these issues, though on XFS we have enough concurrency capability
in file creation and unlink for the sb->s_inodes list to be a
limitation there as well.

Previous efforts to solve this problem have
largely centered around reworking the sb->s_inodes list into
something more scalable such as this longstanding patchset does:

https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/

However, a recent discussion about inode cache behaviour that arose
from the bcachefs 6.12-rc1 pull request opened a new direction for
us to explore. With both XFS and bcachefs now providing their own
per-superblock inode cache implementations, we should try to make
use of these inode caches as first class citizens.

With that new direction in mind, it became obvious that XFS could
elide the sb->s_inodes list completely - "the best part is no part"
- if iteration was not reliant on open-coded sb->s_inodes list
walks.

We already use the internal inode cache for iteration, and we have
filters for selecting specific inodes to operate on with specific
callback operations. If we had an abstraction for iterating
all VFS inodes, we can easily implement that directly on the XFS
inode cache.

This is what this patchset aims to implement.

There are two superblock iterator functions provided. The first is a
generic iterator that provides safe, reference counted inodes for
the callback to operate on. This is generally what most sb->s_inodes
iterators use, and it allows the iterator to drop locks and perform
blocking operations on the inode before moving to the next inode in
the sb->s_inodes list.

There is one quirk to this interface - INO_ITER_REFERENCE - because
fsnotify iterates the inode cache -after- evict_inodes() has been
called during superblock shutdown to evict all non-referenced
inodes. Hence it should only find referenced inodes, and it has
a check to skip unreferenced inodes. This flag does the same.

However, I suspect this is now somewhat sub-optimal because LSMs can
hold references to inodes beyond evict_inodes(), and they don't get
torn down until after fsnotify evicts the referenced inodes it
holds. However, the landlock LSM doesn't have checks for
unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this
guard is not consistently applied.

I'm undecided on how best to handle this, but it does not need to be
solved for this patchset to work. fsnotify and
landlock don't need to run -after- evict_inodes(), but moving them
to before evict_inodes() mean we now do three full inode cache
iterations to evict all the inodes from the cache. That doesn't seem
like a good idea when there might be hundreds of millions of cached
inodes at unmount.

Similarly, needing the iterator to be aware that there should be no
unreferenced inodes left when they run doesn't seem like a good
idea, either. So perhaps the answer is that the iterator checks for
SB_ACTIVE (or some other similar flag) that indicates the superblock
is being torn down and so will skip zero-referenced inodes
automatically in this case. Like I said - this doesn't need to be
solved right now, it's just something to be aware of.

The second iterator is the "unsafe" iterator variant that only
provides the callback with an existence guarantee. It does this by
holding the rcu_read_lock() to guarantee that the inode is not freed
from under the callback. There are no validity checks performed on
the inode - it is entirely up to the callback to validate the inode
can be operated on safely.

Hence the "unsafe" variant is only for very specific internal uses
only. Nobody should be adding new uses of this function without
as there are very few good reasons for external access to inodes
without holding a valid reference. I have not decided whether the
unsafe callbacks should require a lockdep_assert_in_rcu_read_lock()
check in them to clearly document the context under which they are
running.

The patchset converts all the open coded iterators to use these
new iterator functions, which means the only use of sb->s_inodes
is now confined to fs/super.c (iterator API) and fs/inode.c
(add/remove API). A new superblock operation is then added to
call out from the iterators into the filesystem to allow them to run
the iteration instead of walking the sb->s_inodes list.

XFS is then converted to use this new superblock operation. I didn't
use the existing iterator function for this functionality right now
as it is currently based on radix tree tag lookups. It also uses a
batched 'lookup and lock' mechanism that complicated matters as I
developed this code. Hence I open coded a new, simpler cache walk
for testing purposes.

Now that I have stuff working and I think I have the callback API
semantics settled, batched radix tree lookups should still work to
minimise the iteration overhead. Also, we might want to tag VFS
inodes in the radix tree so that we can filter them efficiently for
traversals. This would allow us to use the existing generic inode
cache walker rather than a separate variant as this patch set
implements. This can be done as future work, though.

In terms of scalability improvements, a quick 'will it scale' test
demonstrates where the sb->s_inodes list hurts. Running a sharded,
share-nothing cold cache workload with 100,000 files per thread in
per-thread directories gives the following results on a 4-node 64p
machine with 128GB RAM.

The workloads "walk", "chmod" and "unlink" are all directory
traversal workloads that stream cold cache inodes into the cache.
There is enough memory on this test machine that these indoes are
not being reclaimed during the workload, and are being freed between
steps via drop_caches (which iterates the inode cache and so
explicitly tests the new iteration APIs!). Hence the sb->s_inodes
scalability issues aren't as bad in these tests as when memory is
tight and inodes are being reclaimed (i.e. the issues are worse in
real workloads).

The "bulkstat" workload uses the XFS bulkstat ioctl to iterate
inodes via walking the internal inode btrees. It uses
d_mark_dontcache() so it is actually tearing down each inode as soon
as it has been sampled by the bulkstat code. Hence it is doing two
sb->s_inodes list manipulations per inode and so shows scalability
issues much earlier than the other workloads.

Before:

Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
       xfs     400000     4      4.269      3.225      4.557      7.316      1.306
       xfs     800000     8      4.844      3.227      4.702      7.905      1.908
       xfs    1600000    16      6.286      3.296      5.592      8.838      4.392
       xfs    3200000    32      8.912      5.681      8.505     11.724      7.085
       xfs    6400000    64     15.344     11.144     14.162     18.604     15.494

Bulkstat starts to show issues at 8 threads, walk and chmod between
16 and 32 threads, and unlink is limited by internal XFS stuff.
Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the
sb->s_inodes list management.

After:

Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
       xfs     400000     4      4.140      3.502      4.154      7.242      1.164
       xfs     800000     8      4.637      2.836      4.444      7.896      1.093
       xfs    1600000    16      5.549      3.054      5.213      8.696      1.107
       xfs    3200000    32      8.387      3.218      6.867     10.668      1.125
       xfs    6400000    64     14.112      3.953     10.365     18.620      1.270

When patched, walk shows little in way of scalability degradation
out to 64 threads, chmod is significantly improved at 32-64 threads,
and bulkstat shows perfect scalability out to 64 threads now.

I did a couple of other longer running, higher inode count tests
with bulkstat to get an idea of inode cache streaming rates - 32
million inodes scanned in 4.4 seconds at 64 threads. That's about
7.2 million inodes/s being streamed through the inode cache with the
IO rates are peaking well above 5.5GB/s (near IO bound).

Hence raw VFS inode cache throughput sees a ~17x scalability
improvement on XFS at 64 threads (and probably a -lot- more on
higher CPU count machines).  That's far better performance than I
ever got from the dlist conversion of the sb->s_inodes list in
previous patchsets, so this seems like a much better direction to be
heading for optimising the way we cache inodes.

I haven't done a lot of testing on this patchset yet - it boots and
appears to work OK for block devices, ext4 and XFS, but checking
stuff like quota on/off is still working properly on ext4 hasn't
been done yet.

What do people think of moving towards per-sb inode caching and
traversal mechanisms like this?

-Dave.

Comments

Christian Brauner Oct. 2, 2024, 10 a.m. UTC | #1
On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> The management of the sb->s_inodes list is a scalability limitation;
> it is protected by a single lock and every inode that is
> instantiated has to be added to the list. Those inodes then need to
> be removed from the list when evicted from cache. Hence every inode
> that moves through the VFS inode cache must take this global scope
> lock twice.
> 
> This proves to be a significant limiting factor for concurrent file
> access workloads that repeatedly miss the dentry cache on lookup.
> Directory search and traversal workloads are particularly prone to
> these issues, though on XFS we have enough concurrency capability
> in file creation and unlink for the sb->s_inodes list to be a
> limitation there as well.
> 
> Previous efforts to solve this problem have
> largely centered around reworking the sb->s_inodes list into
> something more scalable such as this longstanding patchset does:
> 
> https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/
> 
> However, a recent discussion about inode cache behaviour that arose
> from the bcachefs 6.12-rc1 pull request opened a new direction for
> us to explore. With both XFS and bcachefs now providing their own
> per-superblock inode cache implementations, we should try to make
> use of these inode caches as first class citizens.
> 
> With that new direction in mind, it became obvious that XFS could
> elide the sb->s_inodes list completely - "the best part is no part"
> - if iteration was not reliant on open-coded sb->s_inodes list
> walks.
> 
> We already use the internal inode cache for iteration, and we have
> filters for selecting specific inodes to operate on with specific
> callback operations. If we had an abstraction for iterating
> all VFS inodes, we can easily implement that directly on the XFS
> inode cache.
> 
> This is what this patchset aims to implement.
> 
> There are two superblock iterator functions provided. The first is a
> generic iterator that provides safe, reference counted inodes for
> the callback to operate on. This is generally what most sb->s_inodes
> iterators use, and it allows the iterator to drop locks and perform
> blocking operations on the inode before moving to the next inode in
> the sb->s_inodes list.
> 
> There is one quirk to this interface - INO_ITER_REFERENCE - because
> fsnotify iterates the inode cache -after- evict_inodes() has been
> called during superblock shutdown to evict all non-referenced
> inodes. Hence it should only find referenced inodes, and it has
> a check to skip unreferenced inodes. This flag does the same.
> 
> However, I suspect this is now somewhat sub-optimal because LSMs can
> hold references to inodes beyond evict_inodes(), and they don't get
> torn down until after fsnotify evicts the referenced inodes it
> holds. However, the landlock LSM doesn't have checks for
> unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this
> guard is not consistently applied.
> 
> I'm undecided on how best to handle this, but it does not need to be
> solved for this patchset to work. fsnotify and
> landlock don't need to run -after- evict_inodes(), but moving them
> to before evict_inodes() mean we now do three full inode cache
> iterations to evict all the inodes from the cache. That doesn't seem
> like a good idea when there might be hundreds of millions of cached
> inodes at unmount.
> 
> Similarly, needing the iterator to be aware that there should be no
> unreferenced inodes left when they run doesn't seem like a good
> idea, either. So perhaps the answer is that the iterator checks for
> SB_ACTIVE (or some other similar flag) that indicates the superblock
> is being torn down and so will skip zero-referenced inodes
> automatically in this case. Like I said - this doesn't need to be
> solved right now, it's just something to be aware of.
> 
> The second iterator is the "unsafe" iterator variant that only
> provides the callback with an existence guarantee. It does this by
> holding the rcu_read_lock() to guarantee that the inode is not freed
> from under the callback. There are no validity checks performed on
> the inode - it is entirely up to the callback to validate the inode
> can be operated on safely.
> 
> Hence the "unsafe" variant is only for very specific internal uses
> only. Nobody should be adding new uses of this function without
> as there are very few good reasons for external access to inodes
> without holding a valid reference. I have not decided whether the
> unsafe callbacks should require a lockdep_assert_in_rcu_read_lock()
> check in them to clearly document the context under which they are
> running.
> 
> The patchset converts all the open coded iterators to use these
> new iterator functions, which means the only use of sb->s_inodes
> is now confined to fs/super.c (iterator API) and fs/inode.c
> (add/remove API). A new superblock operation is then added to
> call out from the iterators into the filesystem to allow them to run
> the iteration instead of walking the sb->s_inodes list.
> 
> XFS is then converted to use this new superblock operation. I didn't
> use the existing iterator function for this functionality right now
> as it is currently based on radix tree tag lookups. It also uses a
> batched 'lookup and lock' mechanism that complicated matters as I
> developed this code. Hence I open coded a new, simpler cache walk
> for testing purposes.
> 
> Now that I have stuff working and I think I have the callback API
> semantics settled, batched radix tree lookups should still work to
> minimise the iteration overhead. Also, we might want to tag VFS
> inodes in the radix tree so that we can filter them efficiently for
> traversals. This would allow us to use the existing generic inode
> cache walker rather than a separate variant as this patch set
> implements. This can be done as future work, though.
> 
> In terms of scalability improvements, a quick 'will it scale' test
> demonstrates where the sb->s_inodes list hurts. Running a sharded,
> share-nothing cold cache workload with 100,000 files per thread in
> per-thread directories gives the following results on a 4-node 64p
> machine with 128GB RAM.
> 
> The workloads "walk", "chmod" and "unlink" are all directory
> traversal workloads that stream cold cache inodes into the cache.
> There is enough memory on this test machine that these indoes are
> not being reclaimed during the workload, and are being freed between
> steps via drop_caches (which iterates the inode cache and so
> explicitly tests the new iteration APIs!). Hence the sb->s_inodes
> scalability issues aren't as bad in these tests as when memory is
> tight and inodes are being reclaimed (i.e. the issues are worse in
> real workloads).
> 
> The "bulkstat" workload uses the XFS bulkstat ioctl to iterate
> inodes via walking the internal inode btrees. It uses
> d_mark_dontcache() so it is actually tearing down each inode as soon
> as it has been sampled by the bulkstat code. Hence it is doing two
> sb->s_inodes list manipulations per inode and so shows scalability
> issues much earlier than the other workloads.
> 
> Before:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.269      3.225      4.557      7.316      1.306
>        xfs     800000     8      4.844      3.227      4.702      7.905      1.908
>        xfs    1600000    16      6.286      3.296      5.592      8.838      4.392
>        xfs    3200000    32      8.912      5.681      8.505     11.724      7.085
>        xfs    6400000    64     15.344     11.144     14.162     18.604     15.494
> 
> Bulkstat starts to show issues at 8 threads, walk and chmod between
> 16 and 32 threads, and unlink is limited by internal XFS stuff.
> Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the
> sb->s_inodes list management.
> 
> After:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.140      3.502      4.154      7.242      1.164
>        xfs     800000     8      4.637      2.836      4.444      7.896      1.093
>        xfs    1600000    16      5.549      3.054      5.213      8.696      1.107
>        xfs    3200000    32      8.387      3.218      6.867     10.668      1.125
>        xfs    6400000    64     14.112      3.953     10.365     18.620      1.270
> 
> When patched, walk shows little in way of scalability degradation
> out to 64 threads, chmod is significantly improved at 32-64 threads,
> and bulkstat shows perfect scalability out to 64 threads now.
> 
> I did a couple of other longer running, higher inode count tests
> with bulkstat to get an idea of inode cache streaming rates - 32
> million inodes scanned in 4.4 seconds at 64 threads. That's about
> 7.2 million inodes/s being streamed through the inode cache with the
> IO rates are peaking well above 5.5GB/s (near IO bound).
> 
> Hence raw VFS inode cache throughput sees a ~17x scalability
> improvement on XFS at 64 threads (and probably a -lot- more on
> higher CPU count machines).  That's far better performance than I
> ever got from the dlist conversion of the sb->s_inodes list in
> previous patchsets, so this seems like a much better direction to be
> heading for optimising the way we cache inodes.
> 
> I haven't done a lot of testing on this patchset yet - it boots and
> appears to work OK for block devices, ext4 and XFS, but checking
> stuff like quota on/off is still working properly on ext4 hasn't
> been done yet.
> 
> What do people think of moving towards per-sb inode caching and
> traversal mechanisms like this?

Patches 1-4 are great cleanups that I would like us to merge even
independent of the rest.

I don't have big conceptual issues with the series otherwise. The only
thing that makes me a bit uneasy is that we are now providing an api
that may encourage filesystems to do their own inode caching even if
they don't really have a need for it just because it's there. So really
a way that would've solved this issue generically would have been my
preference.

But the reality is that xfs has been doing that private inode cache for
a long time and reading through 5/7 and 6/7 it clearly provides value
for xfs. So I find it hard to object to adding ->iter_vfs_inodes()
(Though I would like to s/iter_vfs_inodes/iter_inodes/g).
Dave Chinner Oct. 2, 2024, 12:34 p.m. UTC | #2
On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > What do people think of moving towards per-sb inode caching and
> > traversal mechanisms like this?
> 
> Patches 1-4 are great cleanups that I would like us to merge even
> independent of the rest.

Yes, they make it much easier to manage the iteration code.

> I don't have big conceptual issues with the series otherwise. The only
> thing that makes me a bit uneasy is that we are now providing an api
> that may encourage filesystems to do their own inode caching even if
> they don't really have a need for it just because it's there.  So really
> a way that would've solved this issue generically would have been my
> preference.

Well, that's the problem, isn't it? :/

There really isn't a good generic solution for global list access
and management.  The dlist stuff kinda works, but it still has
significant overhead and doesn't get rid of spinlock contention
completely because of the lack of locality between list add and
remove operations.

i.e. dlist is optimised for low contention add operations (i.e.
local to the CPU). However, removal is not a local operation - it
almsot always happens on a different CPU to the add operation.
Hence removal always pulls the list and lock away from the CPU that
"owns" them, and hence there is still contention when inodes are
streaming through memory. This causes enough overhead that dlist
operations are still very visible in CPU profiles during scalability
testing...

XFS (and now bcachefs) have their own per-sb inode cache
implementations, and hence for them the sb->s_inodes list is pure
overhead.  If we restructure the generic inode cache infrastructure
to also be per-sb (this suggestion from Linus was what lead me to
this patch set), then they will also likely not need the
sb->s_inodes list, too.

That's the longer term "generic solution" to the sb->s_inodes list
scalability problem (i.e. get rid of it!), but it's a much larger
and longer term undertaking. Once we know what that new generic
inode cache infrastructure looks like, we'll probably only want to
be converting one filesystem at a time to the new infrastucture.

We'll need infrastructure to allow alternative per-sb iteration
mechanisms for such a conversion take place - the converted
filesystems will likely call a generic ->iter_vfs_inodes()
implementation based on the per-sb inode cache infrastructure rather
than iterating sb->s_inodes. Eventually, we'll end up with that
generic method replacing the sb->s_inodes iteration, we'll end up
with only a couple of filesystems using the callout again.

> But the reality is that xfs has been doing that private inode cache for
> a long time and reading through 5/7 and 6/7 it clearly provides value
> for xfs. So I find it hard to object to adding ->iter_vfs_inodes()
> (Though I would like to s/iter_vfs_inodes/iter_inodes/g).

I named it that way because, from my filesystem centric point of
view, there is a very distinct separation between VFS and filesystem
inodes. The VFS inode (struct inode) is a subset of the filesystem
inode structure and, in XFS's case, a subset of the filesystem inode
life cycle, too.

i.e. this method should not iterate cached filesystem inodes that
exist outside the VFS inode lifecycle or VFS visibility even though
they may be present in the filesystem's internal inode cache.

-Dave.
Kent Overstreet Oct. 2, 2024, 7:29 p.m. UTC | #3
On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > What do people think of moving towards per-sb inode caching and
> > > traversal mechanisms like this?
> > 
> > Patches 1-4 are great cleanups that I would like us to merge even
> > independent of the rest.
> 
> Yes, they make it much easier to manage the iteration code.
> 
> > I don't have big conceptual issues with the series otherwise. The only
> > thing that makes me a bit uneasy is that we are now providing an api
> > that may encourage filesystems to do their own inode caching even if
> > they don't really have a need for it just because it's there.  So really
> > a way that would've solved this issue generically would have been my
> > preference.
> 
> Well, that's the problem, isn't it? :/
> 
> There really isn't a good generic solution for global list access
> and management.  The dlist stuff kinda works, but it still has
> significant overhead and doesn't get rid of spinlock contention
> completely because of the lack of locality between list add and
> remove operations.

There is though; I haven't posted it yet because it still needs some
work, but the concept works and performs about the same as dlock-list.

https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list

The thing that needs to be sorted before posting is that it can't shrink
the radix tree. generic-radix-tree doesn't support shrinking, and I
could add that, but then ida doesn't provide a way to query the highest
id allocated (xarray doesn't support backwards iteration).

So I'm going to try it using idr and see how that performs (idr is not
really the right data structure for this, split ida and item radix tree
is better, so might end up doing something else).

But - this approach with more work will work for the list_lru lock
contention as well.

From 32cb8103ecfacdd5ed8e1eb390221c3f8339de6f Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@linux.dev>
Date: Sat, 28 Sep 2024 16:22:38 -0400
Subject: [PATCH] lib/fast_list.c

A fast "list" data structure, which is actually a radix tree, with an
IDA for slot allocation and a percpu buffer on top of that.

Items cannot be added or moved to the head or tail, only added at some
(arbitrary) position and removed. The advantage is that adding, removing
and iteration is generally lockless, only hitting the lock in ida when
the percpu buffer is full or empty.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/include/linux/fast_list.h b/include/linux/fast_list.h
new file mode 100644
index 000000000000..7d5d8592864d
--- /dev/null
+++ b/include/linux/fast_list.h
@@ -0,0 +1,22 @@
+#ifndef _LINUX_FAST_LIST_H
+#define _LINUX_FAST_LIST_H
+
+#include <linux/generic-radix-tree.h>
+#include <linux/idr.h>
+#include <linux/percpu.h>
+
+struct fast_list_pcpu;
+
+struct fast_list {
+	GENRADIX(void *)	items;
+	struct ida		slots_allocated;;
+	struct fast_list_pcpu	*buffer;
+};
+
+int fast_list_get_idx(struct fast_list *l);
+int fast_list_add(struct fast_list *l, void *item);
+void fast_list_remove(struct fast_list *l, unsigned idx);
+void fast_list_exit(struct fast_list *l);
+int fast_list_init(struct fast_list *l);
+
+#endif /* _LINUX_FAST_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index 773adf88af41..85cf5a0d36b1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -49,7 +49,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
 	 bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
 	 percpu-refcount.o rhashtable.o base64.o \
 	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
-	 generic-radix-tree.o bitmap-str.o
+	 generic-radix-tree.o bitmap-str.o fast_list.o
 obj-$(CONFIG_STRING_KUNIT_TEST) += string_kunit.o
 obj-y += string_helpers.o
 obj-$(CONFIG_STRING_HELPERS_KUNIT_TEST) += string_helpers_kunit.o
diff --git a/lib/fast_list.c b/lib/fast_list.c
new file mode 100644
index 000000000000..bbb69bb29687
--- /dev/null
+++ b/lib/fast_list.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Fast, unordered lists
+ *
+ * Supports add, remove, and iterate
+ *
+ * Underneath, they're a radix tree and an IDA, with a percpu buffer for slot
+ * allocation and freeing.
+ *
+ * This means that adding, removing, and iterating over items is lockless,
+ * except when refilling/emptying the percpu slot buffers.
+ */
+
+#include <linux/fast_list.h>
+
+struct fast_list_pcpu {
+	size_t			nr;
+	size_t			entries[31];
+};
+
+/**
+ * fast_list_get_idx - get a slot in a fast_list
+ * @l:		list to get slot in
+ *
+ * This allocates a slot in the radix tree without storing to it, so that we can
+ * take the potential memory allocation failure early and do the list add later
+ * when we can't take an allocation failure.
+ *
+ * Returns: positive integer on success, -ENOMEM on failure
+ */
+int fast_list_get_idx(struct fast_list *l)
+{
+	int idx;
+
+	preempt_disable();
+	struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer);
+
+	if (unlikely(!lp->nr))
+		while (lp->nr <= ARRAY_SIZE(lp->entries) / 2) {
+			idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_NOWAIT|__GFP_NOWARN);
+			if (unlikely(idx < 0)) {
+				preempt_enable();
+				idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_KERNEL);
+				if (unlikely(idx < 0))
+					return idx;
+
+				preempt_disable();
+				lp = this_cpu_ptr(l->buffer);
+			}
+
+			if (unlikely(!genradix_ptr_alloc_inlined(&l->items, idx,
+							GFP_NOWAIT|__GFP_NOWARN))) {
+				preempt_enable();
+				if (!genradix_ptr_alloc(&l->items, idx, GFP_KERNEL)) {
+					ida_free(&l->slots_allocated, idx);
+					return -ENOMEM;
+				}
+
+				preempt_disable();
+				lp = this_cpu_ptr(l->buffer);
+			}
+
+			if (unlikely(lp->nr == ARRAY_SIZE(lp->entries)))
+				ida_free(&l->slots_allocated, idx);
+			else
+				lp->entries[lp->nr++] = idx;
+		}
+
+	idx = lp->entries[--lp->nr];
+	preempt_enable();
+
+	return idx;
+}
+
+/**
+ * fast_list_add - add an item to a fast_list
+ * @l:		list
+ * @item:	item to add
+ *
+ * Allocates a slot in the radix tree and stores to it and then returns the
+ * slot index, which must be passed to fast_list_remove().
+ *
+ * Returns: positive integer on success, -ENOMEM on failure
+ */
+int fast_list_add(struct fast_list *l, void *item)
+{
+	int idx = fast_list_get_idx(l);
+	if (idx < 0)
+		return idx;
+
+	*genradix_ptr_inlined(&l->items, idx) = item;
+	return idx;
+}
+
+/**
+ * fast_list_remove - remove an item from a fast_list
+ * @l:		list
+ * @idx:	item's slot index
+ *
+ * Zeroes out the slot in the radix tree and frees the slot for future
+ * fast_list_add() operations.
+ */
+void fast_list_remove(struct fast_list *l, unsigned idx)
+{
+	if (!idx)
+		return;
+
+	*genradix_ptr_inlined(&l->items, idx) = NULL;
+
+	preempt_disable();
+	struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer);
+
+	if (unlikely(lp->nr == ARRAY_SIZE(lp->entries)))
+		while (lp->nr >= ARRAY_SIZE(lp->entries) / 2) {
+			ida_free(&l->slots_allocated, idx);
+			idx = lp->entries[--lp->nr];
+		}
+
+	lp->entries[lp->nr++] = idx;
+	preempt_enable();
+}
+
+void fast_list_exit(struct fast_list *l)
+{
+	/* XXX: warn if list isn't empty */
+	free_percpu(l->buffer);
+	ida_destroy(&l->slots_allocated);
+	genradix_free(&l->items);
+}
+
+int fast_list_init(struct fast_list *l)
+{
+	genradix_init(&l->items);
+	ida_init(&l->slots_allocated);
+	l->buffer = alloc_percpu(*l->buffer);
+	if (!l->buffer)
+		return -ENOMEM;
+	return 0;
+}
Linus Torvalds Oct. 2, 2024, 7:49 p.m. UTC | #4
On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
>
> > I don't have big conceptual issues with the series otherwise. The only
> > thing that makes me a bit uneasy is that we are now providing an api
> > that may encourage filesystems to do their own inode caching even if
> > they don't really have a need for it just because it's there.  So really
> > a way that would've solved this issue generically would have been my
> > preference.
>
> Well, that's the problem, isn't it? :/
>
> There really isn't a good generic solution for global list access
> and management.  The dlist stuff kinda works, but it still has
> significant overhead and doesn't get rid of spinlock contention
> completely because of the lack of locality between list add and
> remove operations.

I much prefer the approach taken in your patch series, to let the
filesystem own the inode list and keeping the old model as the
"default list".

In many ways, that is how *most* of the VFS layer works - it exposes
helper functions that the filesystems can use (and most do), but
doesn't force them.

Yes, the VFS layer does force some things - you can't avoid using
dentries, for example, because that's literally how the VFS layer
deals with filenames (and things like mounting etc). And honestly, the
VFS layer does a better job of filename caching than any filesystem
really can do, and with the whole UNIX mount model, filenames
fundamentally cross filesystem boundaries anyway.

But clearly the VFS layer inode list handling isn't the best it can
be, and unless we can fix that in some fundamental way (and I don't
love the "let's use crazy lists instead of a simple one" models) I do
think that just letting filesystems do their own thing if they have
something better is a good model.

That's how we deal with all the basic IO, after all. The VFS layer has
lots of support routines, but filesystems don't *have* to use things
like generic_file_read_iter() and friends.

Yes, most filesystems do use generic_file_read_iter() in some form or
other (sometimes raw, sometimes wrapped with filesystem logic),
because it fits their model, it's convenient, and it handles all the
normal stuff well, but you don't *have* to use it if you have special
needs.

Taking that approach to the inode caching sounds sane to me, and I
generally like Dave's series. It looks like an improvement to me.

              Linus
Kent Overstreet Oct. 2, 2024, 8:28 p.m. UTC | #5
On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> >
> > > I don't have big conceptual issues with the series otherwise. The only
> > > thing that makes me a bit uneasy is that we are now providing an api
> > > that may encourage filesystems to do their own inode caching even if
> > > they don't really have a need for it just because it's there.  So really
> > > a way that would've solved this issue generically would have been my
> > > preference.
> >
> > Well, that's the problem, isn't it? :/
> >
> > There really isn't a good generic solution for global list access
> > and management.  The dlist stuff kinda works, but it still has
> > significant overhead and doesn't get rid of spinlock contention
> > completely because of the lack of locality between list add and
> > remove operations.
> 
> I much prefer the approach taken in your patch series, to let the
> filesystem own the inode list and keeping the old model as the
> "default list".
> 
> In many ways, that is how *most* of the VFS layer works - it exposes
> helper functions that the filesystems can use (and most do), but
> doesn't force them.
> 
> Yes, the VFS layer does force some things - you can't avoid using
> dentries, for example, because that's literally how the VFS layer
> deals with filenames (and things like mounting etc). And honestly, the
> VFS layer does a better job of filename caching than any filesystem
> really can do, and with the whole UNIX mount model, filenames
> fundamentally cross filesystem boundaries anyway.
> 
> But clearly the VFS layer inode list handling isn't the best it can
> be, and unless we can fix that in some fundamental way (and I don't
> love the "let's use crazy lists instead of a simple one" models) I do
> think that just letting filesystems do their own thing if they have
> something better is a good model.

Well, I don't love adding more indirection and callbacks.

The underlying approach in this patchset of "just use the inode hash
table if that's available" - that I _do_ like, but this seems like
the wrong way to go about it, we're significantly adding to the amount
of special purpose "things" filesystems have to do if they want to
perform well.

Converting the standard inode hash table to an rhashtable (or more
likely, creating a new standard implementation and converting
filesystems one at a time) still needs to happen, and then the "use the
hash table for iteration" approach could use that without every
filesystem having to specialize.

Failing that, or even regardless, I think we do need either dlock-list
or fast-list. "I need some sort of generic list, but fast" is something
I've seen come up way too many times.
Dave Chinner Oct. 2, 2024, 10:23 p.m. UTC | #6
On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote:
> On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > > What do people think of moving towards per-sb inode caching and
> > > > traversal mechanisms like this?
> > > 
> > > Patches 1-4 are great cleanups that I would like us to merge even
> > > independent of the rest.
> > 
> > Yes, they make it much easier to manage the iteration code.
> > 
> > > I don't have big conceptual issues with the series otherwise. The only
> > > thing that makes me a bit uneasy is that we are now providing an api
> > > that may encourage filesystems to do their own inode caching even if
> > > they don't really have a need for it just because it's there.  So really
> > > a way that would've solved this issue generically would have been my
> > > preference.
> > 
> > Well, that's the problem, isn't it? :/
> > 
> > There really isn't a good generic solution for global list access
> > and management.  The dlist stuff kinda works, but it still has
> > significant overhead and doesn't get rid of spinlock contention
> > completely because of the lack of locality between list add and
> > remove operations.
> 
> There is though; I haven't posted it yet because it still needs some
> work, but the concept works and performs about the same as dlock-list.
> 
> https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list
> 
> The thing that needs to be sorted before posting is that it can't shrink
> the radix tree. generic-radix-tree doesn't support shrinking, and I
> could add that, but then ida doesn't provide a way to query the highest
> id allocated (xarray doesn't support backwards iteration).

That's an interesting construct, but...

> So I'm going to try it using idr and see how that performs (idr is not
> really the right data structure for this, split ida and item radix tree
> is better, so might end up doing something else).
> 
> But - this approach with more work will work for the list_lru lock
> contention as well.

....  it isn't a generic solution because it is dependent on
blocking memory allocation succeeding for list_add() operations.

Hence this cannot do list operations under external synchronisation
constructs like spinlocks or rcu_read_lock(). It also introduces
interesting interactions with memory reclaim - what happens we have
to add an object to one of these lists from memory reclaim context?

Taking the example of list_lru, this list construct will not work
for a variety of reasons. Some of them are:

- list_lru_add() being called from list_lru_add_obj() under RCU for
  memcg aware LRUs so cannot block and must not fail.
- list_lru_add_obj() is called under spinlocks from inode_lru_add(),
  the xfs buffer and dquot caches, the workingset code from under
  the address space mapping xarray lock, etc. Again, this must not
  fail.
- list_lru_add() operations take can place in large numbers in
  memory reclaim context (e.g. dentry reclaim drops inodes which
  adds them to the inode lru). Hence memory reclaim becomes even
  more dependent on PF_MEMALLOC memory allocation making forwards
  progress.
- adding long tail list latency to what are currently O(1) fast path
  operations (e.g.  mulitple allocations tree splits for LRUs
  tracking millions of objects) is not desirable.
- LRU lists are -ordered- (it's right there in the name!) and this
  appears to be an unordered list construct.

So while I think this is an interesting idea that might be useful in
some cases, I don't think it is a viable generic scalable list
construct we can use in areas like list_lru or global list
management that run under external synchronisation mechanisms.

-Dave.
Dave Chinner Oct. 2, 2024, 11:17 p.m. UTC | #7
On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote:
> On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > >
> > > > I don't have big conceptual issues with the series otherwise. The only
> > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > that may encourage filesystems to do their own inode caching even if
> > > > they don't really have a need for it just because it's there.  So really
> > > > a way that would've solved this issue generically would have been my
> > > > preference.
> > >
> > > Well, that's the problem, isn't it? :/
> > >
> > > There really isn't a good generic solution for global list access
> > > and management.  The dlist stuff kinda works, but it still has
> > > significant overhead and doesn't get rid of spinlock contention
> > > completely because of the lack of locality between list add and
> > > remove operations.
> > 
> > I much prefer the approach taken in your patch series, to let the
> > filesystem own the inode list and keeping the old model as the
> > "default list".
> > 
> > In many ways, that is how *most* of the VFS layer works - it exposes
> > helper functions that the filesystems can use (and most do), but
> > doesn't force them.
> > 
> > Yes, the VFS layer does force some things - you can't avoid using
> > dentries, for example, because that's literally how the VFS layer
> > deals with filenames (and things like mounting etc). And honestly, the
> > VFS layer does a better job of filename caching than any filesystem
> > really can do, and with the whole UNIX mount model, filenames
> > fundamentally cross filesystem boundaries anyway.
> > 
> > But clearly the VFS layer inode list handling isn't the best it can
> > be, and unless we can fix that in some fundamental way (and I don't
> > love the "let's use crazy lists instead of a simple one" models) I do
> > think that just letting filesystems do their own thing if they have
> > something better is a good model.
> 
> Well, I don't love adding more indirection and callbacks.

It's way better than open coding inode cache traversals everywhere.

The callback model is simply "call this function on every object",
and it allows implementations the freedom to decide how they are
going to run those callbacks.

For example, this abstraction allows XFS to parallelise the
traversal. We currently run the traversal across all inodes in a
single thread, but now that XFS is walking the inode cache we can
push each shard off to a workqueue and run each shard concurrently.
IOWs, we can actually make the traversal of large caches much, much
faster without changing the semantics of the operation the traversal
is trying to acheive.

We simply cannot do things like that without a new iteration model.
Abstraction is necessary to facilitate a new iteration model, and a
model that provides independent object callbacks allows scope for
concurrent processing of individual objects.

> The underlying approach in this patchset of "just use the inode hash
> table if that's available" - that I _do_ like, but this seems like
> the wrong way to go about it, we're significantly adding to the amount
> of special purpose "things" filesystems have to do if they want to
> perform well.

I've already addressed this in my response to Christian. This is a
mechanism that allows filesystems to be moved one-by-one to a new
generic cache and iteration implementation without impacting
existing code. Once we have that, scalability of the inode cache and
traversals should not be a reason for filesystems "doing their own
thing" because the generic infrastructure will be sufficient for
most filesystem implementations.

> Converting the standard inode hash table to an rhashtable (or more
> likely, creating a new standard implementation and converting
> filesystems one at a time) still needs to happen, and then the "use the
> hash table for iteration" approach could use that without every
> filesystem having to specialize.

Yes, but this still doesn't help filesystems like XFS where the
structure of the inode cache is highly optimised for the specific
on-disk and in-memory locality of inodes. We aren't going to be
converting XFS to a rhashtable based inode cache anytime soon
because it simply doesn't provide the functionality we require.
e.g. efficient lockless sequential inode number ordered traversal in
-every- inode cluster writeback operation.

> Failing that, or even regardless, I think we do need either dlock-list
> or fast-list. "I need some sort of generic list, but fast" is something
> I've seen come up way too many times.

There's nothing stopping you from using the dlist patchset for your
own purposes. It's public code - just make sure you retain the
correct attributions. :)

-Dave.
Kent Overstreet Oct. 2, 2024, 11:20 p.m. UTC | #8
On Thu, Oct 03, 2024 at 08:23:44AM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote:
> > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > > > What do people think of moving towards per-sb inode caching and
> > > > > traversal mechanisms like this?
> > > > 
> > > > Patches 1-4 are great cleanups that I would like us to merge even
> > > > independent of the rest.
> > > 
> > > Yes, they make it much easier to manage the iteration code.
> > > 
> > > > I don't have big conceptual issues with the series otherwise. The only
> > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > that may encourage filesystems to do their own inode caching even if
> > > > they don't really have a need for it just because it's there.  So really
> > > > a way that would've solved this issue generically would have been my
> > > > preference.
> > > 
> > > Well, that's the problem, isn't it? :/
> > > 
> > > There really isn't a good generic solution for global list access
> > > and management.  The dlist stuff kinda works, but it still has
> > > significant overhead and doesn't get rid of spinlock contention
> > > completely because of the lack of locality between list add and
> > > remove operations.
> > 
> > There is though; I haven't posted it yet because it still needs some
> > work, but the concept works and performs about the same as dlock-list.
> > 
> > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list
> > 
> > The thing that needs to be sorted before posting is that it can't shrink
> > the radix tree. generic-radix-tree doesn't support shrinking, and I
> > could add that, but then ida doesn't provide a way to query the highest
> > id allocated (xarray doesn't support backwards iteration).
> 
> That's an interesting construct, but...
> 
> > So I'm going to try it using idr and see how that performs (idr is not
> > really the right data structure for this, split ida and item radix tree
> > is better, so might end up doing something else).
> > 
> > But - this approach with more work will work for the list_lru lock
> > contention as well.
> 
> ....  it isn't a generic solution because it is dependent on
> blocking memory allocation succeeding for list_add() operations.
> 
> Hence this cannot do list operations under external synchronisation
> constructs like spinlocks or rcu_read_lock(). It also introduces
> interesting interactions with memory reclaim - what happens we have
> to add an object to one of these lists from memory reclaim context?
> 
> Taking the example of list_lru, this list construct will not work
> for a variety of reasons. Some of them are:
> 
> - list_lru_add() being called from list_lru_add_obj() under RCU for
>   memcg aware LRUs so cannot block and must not fail.
> - list_lru_add_obj() is called under spinlocks from inode_lru_add(),
>   the xfs buffer and dquot caches, the workingset code from under
>   the address space mapping xarray lock, etc. Again, this must not
>   fail.
> - list_lru_add() operations take can place in large numbers in
>   memory reclaim context (e.g. dentry reclaim drops inodes which
>   adds them to the inode lru). Hence memory reclaim becomes even
>   more dependent on PF_MEMALLOC memory allocation making forwards
>   progress.
> - adding long tail list latency to what are currently O(1) fast path
>   operations (e.g.  mulitple allocations tree splits for LRUs
>   tracking millions of objects) is not desirable.
> 
> So while I think this is an interesting idea that might be useful in
> some cases, I don't think it is a viable generic scalable list
> construct we can use in areas like list_lru or global list
> management that run under external synchronisation mechanisms.

There are difficulties, but given the fundamental scalability and
locking issues with linked lists, I think this is the approach we want
if we can make it work.

A couple things that help - we've already determined that the inode LRU
can go away for most filesystems, and we can preallocate slots without
actually adding objects. Iteration will see NULLs that they skip over,
so we can't simply preallocate a slot for everything if nr_live_objects
/ nr_lru_objects is too big. But, we can certainly preallocate slots on
a given code path and then release them back to the percpu buffer if
they're not used.

> - LRU lists are -ordered- (it's right there in the name!) and this
>   appears to be an unordered list construct.

Yes, it is. But in actual practice cache replacement policy tends not to
matter nearly as much as people think; there's many papers showing real
world hit ratio of common algorithms is only a fudge factor from random
replacement - the main thing you want is an accessed bit (or counter, if
you want the analagous version of n-lru for n > 2), and we'll still have
that.
Kent Overstreet Oct. 3, 2024, 1:22 a.m. UTC | #9
On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote:
> > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > >
> > > > > I don't have big conceptual issues with the series otherwise. The only
> > > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > > that may encourage filesystems to do their own inode caching even if
> > > > > they don't really have a need for it just because it's there.  So really
> > > > > a way that would've solved this issue generically would have been my
> > > > > preference.
> > > >
> > > > Well, that's the problem, isn't it? :/
> > > >
> > > > There really isn't a good generic solution for global list access
> > > > and management.  The dlist stuff kinda works, but it still has
> > > > significant overhead and doesn't get rid of spinlock contention
> > > > completely because of the lack of locality between list add and
> > > > remove operations.
> > > 
> > > I much prefer the approach taken in your patch series, to let the
> > > filesystem own the inode list and keeping the old model as the
> > > "default list".
> > > 
> > > In many ways, that is how *most* of the VFS layer works - it exposes
> > > helper functions that the filesystems can use (and most do), but
> > > doesn't force them.
> > > 
> > > Yes, the VFS layer does force some things - you can't avoid using
> > > dentries, for example, because that's literally how the VFS layer
> > > deals with filenames (and things like mounting etc). And honestly, the
> > > VFS layer does a better job of filename caching than any filesystem
> > > really can do, and with the whole UNIX mount model, filenames
> > > fundamentally cross filesystem boundaries anyway.
> > > 
> > > But clearly the VFS layer inode list handling isn't the best it can
> > > be, and unless we can fix that in some fundamental way (and I don't
> > > love the "let's use crazy lists instead of a simple one" models) I do
> > > think that just letting filesystems do their own thing if they have
> > > something better is a good model.
> > 
> > Well, I don't love adding more indirection and callbacks.
> 
> It's way better than open coding inode cache traversals everywhere.

Eh? You had a nice iterator for dlock-list :)

> The callback model is simply "call this function on every object",
> and it allows implementations the freedom to decide how they are
> going to run those callbacks.
> 
> For example, this abstraction allows XFS to parallelise the
> traversal. We currently run the traversal across all inodes in a
> single thread, but now that XFS is walking the inode cache we can
> push each shard off to a workqueue and run each shard concurrently.
> IOWs, we can actually make the traversal of large caches much, much
> faster without changing the semantics of the operation the traversal
> is trying to acheive.
> 
> We simply cannot do things like that without a new iteration model.
> Abstraction is necessary to facilitate a new iteration model, and a
> model that provides independent object callbacks allows scope for
> concurrent processing of individual objects.

Parallelized iteration is a slick possibility.

My concern is that we've been trying to get away from callbacks for
iteration - post spectre they're quite a bit more expensive than
external iterators, and we've generally been successful with that. 

> 
> > The underlying approach in this patchset of "just use the inode hash
> > table if that's available" - that I _do_ like, but this seems like
> > the wrong way to go about it, we're significantly adding to the amount
> > of special purpose "things" filesystems have to do if they want to
> > perform well.
> 
> I've already addressed this in my response to Christian. This is a
> mechanism that allows filesystems to be moved one-by-one to a new
> generic cache and iteration implementation without impacting
> existing code. Once we have that, scalability of the inode cache and
> traversals should not be a reason for filesystems "doing their own
> thing" because the generic infrastructure will be sufficient for
> most filesystem implementations.

Well, I'm not really seeing the need; based on my performance testing
both dlock-list and fast-list completely shift the bottleneck to the
lru_list locking - and in my testing both patchsets were about equal, to
within the margin of error.

Which is a touch surprising, given that dlock-list works similarly to
lru_list - possibly it's because you only have siblings sharing lists
vs. numa nodes for lru lists, or lru scanning is doing more cross
cpu/node accesses.

> > Converting the standard inode hash table to an rhashtable (or more
> > likely, creating a new standard implementation and converting
> > filesystems one at a time) still needs to happen, and then the "use the
> > hash table for iteration" approach could use that without every
> > filesystem having to specialize.
> 
> Yes, but this still doesn't help filesystems like XFS where the
> structure of the inode cache is highly optimised for the specific
> on-disk and in-memory locality of inodes. We aren't going to be
> converting XFS to a rhashtable based inode cache anytime soon
> because it simply doesn't provide the functionality we require.
> e.g. efficient lockless sequential inode number ordered traversal in
> -every- inode cluster writeback operation.

I was going to ask what your requirements are - I may take on the
general purpose inode rhashtable code, although since I'm still pretty
buried we'll see.

Coincidentally, just today I'm working on an issue in bcachefs where
we'd also prefer an ordered data structure to a hash table for the inode
cache - in online fsck, we need to be able to check if an inode is still
open, but checking for an inode in an interior snapshot node means we
have to do a scan and check if any of the open inodes are in a
descendent subvolume.

Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
bits - but it has me considering looking at maple trees (or something
like the lockless RCU btree you were working on awhile back) - those
modern approaches should be approaching hash table performance, if
enough needs for ordered access come up.

> > Failing that, or even regardless, I think we do need either dlock-list
> > or fast-list. "I need some sort of generic list, but fast" is something
> > I've seen come up way too many times.
> 
> There's nothing stopping you from using the dlist patchset for your
> own purposes. It's public code - just make sure you retain the
> correct attributions. :)

If this patchset goes in that might be just what I do, if I don't get
around to finishing fast-list :)
Dave Chinner Oct. 3, 2024, 1:41 a.m. UTC | #10
On Wed, Oct 02, 2024 at 07:20:16PM -0400, Kent Overstreet wrote:
> On Thu, Oct 03, 2024 at 08:23:44AM GMT, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote:
> > > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote:
> > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> > > > > > What do people think of moving towards per-sb inode caching and
> > > > > > traversal mechanisms like this?
> > > > > 
> > > > > Patches 1-4 are great cleanups that I would like us to merge even
> > > > > independent of the rest.
> > > > 
> > > > Yes, they make it much easier to manage the iteration code.
> > > > 
> > > > > I don't have big conceptual issues with the series otherwise. The only
> > > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > > that may encourage filesystems to do their own inode caching even if
> > > > > they don't really have a need for it just because it's there.  So really
> > > > > a way that would've solved this issue generically would have been my
> > > > > preference.
> > > > 
> > > > Well, that's the problem, isn't it? :/
> > > > 
> > > > There really isn't a good generic solution for global list access
> > > > and management.  The dlist stuff kinda works, but it still has
> > > > significant overhead and doesn't get rid of spinlock contention
> > > > completely because of the lack of locality between list add and
> > > > remove operations.
> > > 
> > > There is though; I haven't posted it yet because it still needs some
> > > work, but the concept works and performs about the same as dlock-list.
> > > 
> > > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list
> > > 
> > > The thing that needs to be sorted before posting is that it can't shrink
> > > the radix tree. generic-radix-tree doesn't support shrinking, and I
> > > could add that, but then ida doesn't provide a way to query the highest
> > > id allocated (xarray doesn't support backwards iteration).
> > 
> > That's an interesting construct, but...
> > 
> > > So I'm going to try it using idr and see how that performs (idr is not
> > > really the right data structure for this, split ida and item radix tree
> > > is better, so might end up doing something else).
> > > 
> > > But - this approach with more work will work for the list_lru lock
> > > contention as well.
> > 
> > ....  it isn't a generic solution because it is dependent on
> > blocking memory allocation succeeding for list_add() operations.
> > 
> > Hence this cannot do list operations under external synchronisation
> > constructs like spinlocks or rcu_read_lock(). It also introduces
> > interesting interactions with memory reclaim - what happens we have
> > to add an object to one of these lists from memory reclaim context?
> > 
> > Taking the example of list_lru, this list construct will not work
> > for a variety of reasons. Some of them are:
> > 
> > - list_lru_add() being called from list_lru_add_obj() under RCU for
> >   memcg aware LRUs so cannot block and must not fail.
> > - list_lru_add_obj() is called under spinlocks from inode_lru_add(),
> >   the xfs buffer and dquot caches, the workingset code from under
> >   the address space mapping xarray lock, etc. Again, this must not
> >   fail.
> > - list_lru_add() operations take can place in large numbers in
> >   memory reclaim context (e.g. dentry reclaim drops inodes which
> >   adds them to the inode lru). Hence memory reclaim becomes even
> >   more dependent on PF_MEMALLOC memory allocation making forwards
> >   progress.
> > - adding long tail list latency to what are currently O(1) fast path
> >   operations (e.g.  mulitple allocations tree splits for LRUs
> >   tracking millions of objects) is not desirable.
> > 
> > So while I think this is an interesting idea that might be useful in
> > some cases, I don't think it is a viable generic scalable list
> > construct we can use in areas like list_lru or global list
> > management that run under external synchronisation mechanisms.
> 
> There are difficulties, but given the fundamental scalability and
> locking issues with linked lists, I think this is the approach we want
> if we can make it work.

Sure, but this is a completely different problem to what I'm trying
to address here. I want infrastructure that does not need global
lists or list_lru for inode cache maintenance at all. So talking
about how to make the lists I am trying to remove scale better is
kinda missing the point....

> A couple things that help - we've already determined that the inode LRU
> can go away for most filesystems,

We haven't determined that yet. I *think* it is possible, but there
is a really nasty inode LRU dependencies that has been driven deep
down into the mm page cache writeback code.  We have to fix that
awful layering violation before we can get rid of the inode LRU.

I *think* we can do it by requiring dirty inodes to hold an explicit
inode reference, thereby keeping the inode pinned in memory whilst
it is being tracked for writeback. That would also get rid of the
nasty hacks needed in evict() to wait on writeback to complete on
unreferenced inodes.

However, this isn't simple to do, and so getting rid of the inode
LRU is not going to happen in the near term.

> and we can preallocate slots without
> actually adding objects. Iteration will see NULLs that they skip over,
> so we can't simply preallocate a slot for everything if nr_live_objects
> / nr_lru_objects is too big. But, we can certainly preallocate slots on
> a given code path and then release them back to the percpu buffer if
> they're not used.

I'm not really that interested in spending time trying to optimise
away list_lru contention at this point in time.

It's not a performance limiting factor because inode and
dentry LRU scalability is controllable by NUMA configuration. i.e.
if you have severe list_lru lock contention on inode and dentry
caches, then either turn on Sub-NUMA Clustering in your bios,
configure your VM with more discrete nodes, or use the fake-numa=N
boot parameter to increase the number of nodes the kernel sets up.
This will increase the number of list_lru instances for NUMA aware
shrinkers and the contention will go away.

This is trivial to do and I use the "configure your VM with more
discrete nodes" method for benchmarking purposes. I've run my perf
testing VMs with 4 nodes for the past decade and the list_lru
contention has never got above the threshold of concern. There's
always been something else causing worse problems, and even with
the sb->s_inodes list out of the way, it still isn't a problem on
64-way cache-hammering workloads...

> > - LRU lists are -ordered- (it's right there in the name!) and this
> >   appears to be an unordered list construct.
> 
> Yes, it is. But in actual practice cache replacement policy tends not to
> matter nearly as much as people think; there's many papers showing real
> world hit ratio of common algorithms is only a fudge factor from random
> replacement - the main thing you want is an accessed bit (or counter, if
> you want the analagous version of n-lru for n > 2), and we'll still have
> that.

Sure.  But I can cherry-pick many papers showing exactly the opposite.
i.e. that LRU and LFU algorithms are far superior at maintaining a
working set compared to random cache shootdown, especially when
there is significant latency for cache replacement.

What matters is whether there are any behavioural regressions as a
result of changing the current algorithm. We've used quasi-LRU
working set management for so long that this is the behaviour that
people have tuned their systems and applications to work well with.
Fundamental changes to working set maintenance behaviour is not
something I'm considering doing, nor something I *want* to do.

And, really, this is way outside the scope of this patch set. It's
even outside of the scope of "remove the inode cache LRU" proposal
because that proposal is based on the existing dentry cache LRU
working set management making the inode cache LRU completely
redundant. i.e.  it's not a change of working set management
algorithms at all....

-Dave.
Dave Chinner Oct. 3, 2024, 2:20 a.m. UTC | #11
On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote:
> On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote:
> > > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> > > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > > >
> > > > > > I don't have big conceptual issues with the series otherwise. The only
> > > > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > > > that may encourage filesystems to do their own inode caching even if
> > > > > > they don't really have a need for it just because it's there.  So really
> > > > > > a way that would've solved this issue generically would have been my
> > > > > > preference.
> > > > >
> > > > > Well, that's the problem, isn't it? :/
> > > > >
> > > > > There really isn't a good generic solution for global list access
> > > > > and management.  The dlist stuff kinda works, but it still has
> > > > > significant overhead and doesn't get rid of spinlock contention
> > > > > completely because of the lack of locality between list add and
> > > > > remove operations.
> > > > 
> > > > I much prefer the approach taken in your patch series, to let the
> > > > filesystem own the inode list and keeping the old model as the
> > > > "default list".
> > > > 
> > > > In many ways, that is how *most* of the VFS layer works - it exposes
> > > > helper functions that the filesystems can use (and most do), but
> > > > doesn't force them.
> > > > 
> > > > Yes, the VFS layer does force some things - you can't avoid using
> > > > dentries, for example, because that's literally how the VFS layer
> > > > deals with filenames (and things like mounting etc). And honestly, the
> > > > VFS layer does a better job of filename caching than any filesystem
> > > > really can do, and with the whole UNIX mount model, filenames
> > > > fundamentally cross filesystem boundaries anyway.
> > > > 
> > > > But clearly the VFS layer inode list handling isn't the best it can
> > > > be, and unless we can fix that in some fundamental way (and I don't
> > > > love the "let's use crazy lists instead of a simple one" models) I do
> > > > think that just letting filesystems do their own thing if they have
> > > > something better is a good model.
> > > 
> > > Well, I don't love adding more indirection and callbacks.
> > 
> > It's way better than open coding inode cache traversals everywhere.
> 
> Eh? You had a nice iterator for dlock-list :)

Which was painful to work with because
it maintains the existing spin lock based traversal pattern. This
was necessary because the iterator held a spinlock internally. I
really didn't like that aspect of it because it perpeutated the need
to open code the iget/iput game to allow the spinlock to be dropped
across the inode operation that needed to be performed.

i.e. adding an dlist iterator didn't clean up any of the other mess
that sb->s_inodes iteration required...

> > We simply cannot do things like that without a new iteration model.
> > Abstraction is necessary to facilitate a new iteration model, and a
> > model that provides independent object callbacks allows scope for
> > concurrent processing of individual objects.
> 
> Parallelized iteration is a slick possibility.
> 
> My concern is that we've been trying to get away from callbacks for
> iteration - post spectre they're quite a bit more expensive than
> external iterators, and we've generally been successful with that. 

So everyone keeps saying, but the old adage applies here: Penny
wise, pound foolish.

Optimising away the callbacks might bring us a few percent
performance improvement for each operation (e.g. via the dlist
iterator mechanisms) in a traversal, but that iteration is still
only single threaded. Hence the maximum processing rate is
determined by the performance of a single CPU core.

However, if we change the API to allow for parallelism at the cost
of a few percent per object operation, then a single CPU core will
not process quite as many objects as before. However, the moment we
allow multiple CPU cores to process in parallel, we acheive
processing rate improvements measured in integer multiples.

Modern CPUs have concurrency to burn.  Optimising APIs for minimum
per-operation overhead rather than for concurrent processing
implementations is the wrong direction to be taking....

> > > Converting the standard inode hash table to an rhashtable (or more
> > > likely, creating a new standard implementation and converting
> > > filesystems one at a time) still needs to happen, and then the "use the
> > > hash table for iteration" approach could use that without every
> > > filesystem having to specialize.
> > 
> > Yes, but this still doesn't help filesystems like XFS where the
> > structure of the inode cache is highly optimised for the specific
> > on-disk and in-memory locality of inodes. We aren't going to be
> > converting XFS to a rhashtable based inode cache anytime soon
> > because it simply doesn't provide the functionality we require.
> > e.g. efficient lockless sequential inode number ordered traversal in
> > -every- inode cluster writeback operation.
> 
> I was going to ask what your requirements are - I may take on the
> general purpose inode rhashtable code, although since I'm still pretty
> buried we'll see.
> 
> Coincidentally, just today I'm working on an issue in bcachefs where
> we'd also prefer an ordered data structure to a hash table for the inode
> cache - in online fsck, we need to be able to check if an inode is still
> open, but checking for an inode in an interior snapshot node means we
> have to do a scan and check if any of the open inodes are in a
> descendent subvolume.
> 
> Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
> bits -

Sure it does - you just need two layers of radix trees. i.e have a
radix tree per subvol to index inodes by inum, and a per-sb radix
tree to index the subvols. With some code to propagate radix tree
bits from the inode radix tree to the subvol radix tree they then
largely work in conjunction for filtered searches.

This is -exactly- the internal inode cache structure that XFS has.
We have a per-sb radix tree indexing the allocation groups, and a
radix tree per allocation group indexing inodes by inode number.
Hence an inode lookup involves splitting the inum into agno/agino
pairs, then doing a perag lookup with the agno, and doing a perag
inode cache lookup with the agino. All of these radix tree
lookups are lockless...

-Dave.
Kent Overstreet Oct. 3, 2024, 2:24 a.m. UTC | #12
On Thu, Oct 03, 2024 at 11:41:42AM GMT, Dave Chinner wrote:
> > A couple things that help - we've already determined that the inode LRU
> > can go away for most filesystems,
> 
> We haven't determined that yet. I *think* it is possible, but there
> is a really nasty inode LRU dependencies that has been driven deep
> down into the mm page cache writeback code.  We have to fix that
> awful layering violation before we can get rid of the inode LRU.
> 
> I *think* we can do it by requiring dirty inodes to hold an explicit
> inode reference, thereby keeping the inode pinned in memory whilst
> it is being tracked for writeback. That would also get rid of the
> nasty hacks needed in evict() to wait on writeback to complete on
> unreferenced inodes.
> 
> However, this isn't simple to do, and so getting rid of the inode
> LRU is not going to happen in the near term.

Ok.

> > and we can preallocate slots without
> > actually adding objects. Iteration will see NULLs that they skip over,
> > so we can't simply preallocate a slot for everything if nr_live_objects
> > / nr_lru_objects is too big. But, we can certainly preallocate slots on
> > a given code path and then release them back to the percpu buffer if
> > they're not used.
> 
> I'm not really that interested in spending time trying to optimise
> away list_lru contention at this point in time.

Fair, and I'm not trying to derail this one - whether dlock-list, or
fast-list, or super_iter_inodes(), we should do one of them. On current
mainline, I see lock contention bad enough to trigger bcachefs's 10
second "srcu lock held too long" warning, any of these solves the
biggest problem.

But...

> It's not a performance limiting factor because inode and
> dentry LRU scalability is controllable by NUMA configuration. i.e.
> if you have severe list_lru lock contention on inode and dentry
> caches, then either turn on Sub-NUMA Clustering in your bios,
> configure your VM with more discrete nodes, or use the fake-numa=N
> boot parameter to increase the number of nodes the kernel sets up.
> This will increase the number of list_lru instances for NUMA aware
> shrinkers and the contention will go away.

I don't buy this, asking users to change their bios (and even to know
that's a thing they should consider) is a big ask. Linked lists _suck_,
both w.r.t. locking and cache behaviour, and we need to be exploring
better options.

> > > - LRU lists are -ordered- (it's right there in the name!) and this
> > >   appears to be an unordered list construct.
> > 
> > Yes, it is. But in actual practice cache replacement policy tends not to
> > matter nearly as much as people think; there's many papers showing real
> > world hit ratio of common algorithms is only a fudge factor from random
> > replacement - the main thing you want is an accessed bit (or counter, if
> > you want the analagous version of n-lru for n > 2), and we'll still have
> > that.
> 
> Sure.  But I can cherry-pick many papers showing exactly the opposite.
> i.e. that LRU and LFU algorithms are far superior at maintaining a
> working set compared to random cache shootdown, especially when
> there is significant latency for cache replacement.

But as mentioned we won't be comparing against pure random, it'll be vs.
pure random with at least an accessed bit, preserving the multiple
generations which are the most important feature of LRU/LFU as we use
them.

> What matters is whether there are any behavioural regressions as a
> result of changing the current algorithm. We've used quasi-LRU
> working set management for so long that this is the behaviour that
> people have tuned their systems and applications to work well with.
> Fundamental changes to working set maintenance behaviour is not
> something I'm considering doing, nor something I *want* to do.

Yup, it's not going to be the easiest thing to tackle.
Kent Overstreet Oct. 3, 2024, 2:42 a.m. UTC | #13
On Thu, Oct 03, 2024 at 12:20:53PM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote:
> > On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> Which was painful to work with because
> it maintains the existing spin lock based traversal pattern. This
> was necessary because the iterator held a spinlock internally. I
> really didn't like that aspect of it because it perpeutated the need
> to open code the iget/iput game to allow the spinlock to be dropped
> across the inode operation that needed to be performed.
> 
> i.e. adding an dlist iterator didn't clean up any of the other mess
> that sb->s_inodes iteration required...

yeah, true.

that's actually something that does get cleaner with fast-list; because
we're iterating over a radix tree and our iterator is a radix tree
index, the crab-walk thing naturally goes away.

> > My concern is that we've been trying to get away from callbacks for
> > iteration - post spectre they're quite a bit more expensive than
> > external iterators, and we've generally been successful with that. 
> 
> So everyone keeps saying, but the old adage applies here: Penny
> wise, pound foolish.
> 
> Optimising away the callbacks might bring us a few percent
> performance improvement for each operation (e.g. via the dlist
> iterator mechanisms) in a traversal, but that iteration is still
> only single threaded. Hence the maximum processing rate is
> determined by the performance of a single CPU core.
> 
> However, if we change the API to allow for parallelism at the cost
> of a few percent per object operation, then a single CPU core will
> not process quite as many objects as before. However, the moment we
> allow multiple CPU cores to process in parallel, we acheive
> processing rate improvements measured in integer multiples.
> 
> Modern CPUs have concurrency to burn.  Optimising APIs for minimum
> per-operation overhead rather than for concurrent processing
> implementations is the wrong direction to be taking....

OTOH - this is all academic because none of the uses of s_inodes are
_remotely_ fastpaths. Aside from nr_blockdev_pages() it's more or less
all filesystem teardown, or similar frequency.

> > Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
> > bits -
> 
> Sure it does - you just need two layers of radix trees. i.e have a
> radix tree per subvol to index inodes by inum, and a per-sb radix
> tree to index the subvols. With some code to propagate radix tree
> bits from the inode radix tree to the subvol radix tree they then
> largely work in conjunction for filtered searches.

It'd have to be the reverse - index by inum, then subvol, and then we'd
need to do bit stuffing so that a radix tree with a single element is
just a pointer to the element. But - yeah, if the current approach (not
considering the subvol when calculating the hash) becomes an issue, that
might be the way to go.

> This is -exactly- the internal inode cache structure that XFS has.
> We have a per-sb radix tree indexing the allocation groups, and a
> radix tree per allocation group indexing inodes by inode number.
> Hence an inode lookup involves splitting the inum into agno/agino
> pairs, then doing a perag lookup with the agno, and doing a perag
> inode cache lookup with the agino. All of these radix tree
> lookups are lockless...

Speaking of, I'd like to pick your brain on AGIs at some point. We've
been sketching out future scalability work in bcachefs, and I think
that's going to be one of the things we'll end up needing.

Right now the scalability limit is backpointers fsck, but that looks
fairly trivial to solve: there's no reason to run the backpointers ->
extents pass except for debug testing, we can check and repair those
references at runtime, and we can sum up backpointers in a bucket and
check them against the bucket sector counts and skip extents ->
backpointers if they match.

After that, the next scalability limitation should be the main
check_alloc_info pass, and we'll need something analagous to AGIs to
shard that and run it efficiently when the main allocation info doesn't
fit in memory - and it sounds like you have other optimizations that
leverage AGIs as well.
Jan Kara Oct. 3, 2024, 9:17 a.m. UTC | #14
On Thu 03-10-24 11:41:42, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 07:20:16PM -0400, Kent Overstreet wrote:
> > A couple things that help - we've already determined that the inode LRU
> > can go away for most filesystems,
> 
> We haven't determined that yet. I *think* it is possible, but there
> is a really nasty inode LRU dependencies that has been driven deep
> down into the mm page cache writeback code.  We have to fix that
> awful layering violation before we can get rid of the inode LRU.
> 
> I *think* we can do it by requiring dirty inodes to hold an explicit
> inode reference, thereby keeping the inode pinned in memory whilst
> it is being tracked for writeback. That would also get rid of the
> nasty hacks needed in evict() to wait on writeback to complete on
> unreferenced inodes.
> 
> However, this isn't simple to do, and so getting rid of the inode
> LRU is not going to happen in the near term.

Yeah. I agree the way how writeback protects from inode eviction is not the
prettiest one but the problem with writeback holding normal inode reference
is that then flush worker for the device can end up deleting unlinked
inodes which was causing writeback stalls and generally unexpected lock
ordering issues for some filesystems (already forgot the details). Now this
was more that 12 years ago so maybe we could find a better solution to
those problems these days (e.g. interactions between page writeback and
page reclaim are very different these days) but I just wanted to warn there
may be nasty surprises there.

								Honza
Dave Chinner Oct. 3, 2024, 9:59 a.m. UTC | #15
On Thu, Oct 03, 2024 at 11:17:41AM +0200, Jan Kara wrote:
> On Thu 03-10-24 11:41:42, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 07:20:16PM -0400, Kent Overstreet wrote:
> > > A couple things that help - we've already determined that the inode LRU
> > > can go away for most filesystems,
> > 
> > We haven't determined that yet. I *think* it is possible, but there
> > is a really nasty inode LRU dependencies that has been driven deep
> > down into the mm page cache writeback code.  We have to fix that
> > awful layering violation before we can get rid of the inode LRU.
> > 
> > I *think* we can do it by requiring dirty inodes to hold an explicit
> > inode reference, thereby keeping the inode pinned in memory whilst
> > it is being tracked for writeback. That would also get rid of the
> > nasty hacks needed in evict() to wait on writeback to complete on
> > unreferenced inodes.
> > 
> > However, this isn't simple to do, and so getting rid of the inode
> > LRU is not going to happen in the near term.
> 
> Yeah. I agree the way how writeback protects from inode eviction is not the
> prettiest one but the problem with writeback holding normal inode reference
> is that then flush worker for the device can end up deleting unlinked
> inodes which was causing writeback stalls and generally unexpected lock
> ordering issues for some filesystems (already forgot the details).

Yeah, if we end up in evict() on ext4 it will can then do all sorts
of whacky stuff that involves blocking, running transactions and
doing other IO. XFS, OTOH, has been changed to defer all that crap
to background threads (the xfs_inodegc infrastructure) that runs
after the VFS thinks the inode is dead and destroyed. There are some
benefits to having the filesystem inode exist outside the VFS inode
life cycle....

> Now this
> was more that 12 years ago so maybe we could find a better solution to
> those problems these days (e.g. interactions between page writeback and
> page reclaim are very different these days) but I just wanted to warn there
> may be nasty surprises there.

I don't think the situation has improved with filesytsems like ext4.
I think they've actually gotten worse - I recently learnt that ext4
inode eviction can recurse back into the inode cache to instantiate
extended attribute inodes so they can be truncated to allow inode
eviction to make progress.

I suspect the ext4 eviction behaviour is unfixable in any reasonable
time frame, so the only solution I can come up with is to run the
iput() call from a background thread context.  (e.g. defer it to a
workqueue). That way iput_final() and eviction processing will not
interfere with other writeback operations....

-Dave.
Jan Kara Oct. 3, 2024, 11:45 a.m. UTC | #16
Hi Dave!

On Wed 02-10-24 11:33:17, Dave Chinner wrote:
> There are two superblock iterator functions provided. The first is a
> generic iterator that provides safe, reference counted inodes for
> the callback to operate on. This is generally what most sb->s_inodes
> iterators use, and it allows the iterator to drop locks and perform
> blocking operations on the inode before moving to the next inode in
> the sb->s_inodes list.
> 
> There is one quirk to this interface - INO_ITER_REFERENCE - because
> fsnotify iterates the inode cache -after- evict_inodes() has been
> called during superblock shutdown to evict all non-referenced
> inodes. Hence it should only find referenced inodes, and it has
> a check to skip unreferenced inodes. This flag does the same.

Overall I really like the series. A lot of duplicated code removed and
scalability improved, we don't get such deals frequently :) Regarding
INO_ITER_REFERENCE I think that after commit 1edc8eb2e9313 ("fs: call
fsnotify_sb_delete after evict_inodes") the check for 0 i_count in
fsnotify_unmount_inodes() isn't that useful anymore so I'd be actually fine
dropping it (as a separate patch please).

That being said I'd like to discuss one thing: As you have surely noticed,
some of the places iterating inodes perform additional checks on the inode
to determine whether the inode is interesting or not (e.g. the Landlock
iterator or iterators in quota code) to avoid the unnecessary iget / iput
and locking dance. The inode refcount check you've worked-around with
INO_ITER_REFERENCE is a special case of that. Have you considered option to
provide callback for the check inside the iterator?

Also maybe I'm went a *bit* overboard here with macro magic but the code
below should provide an iterator that you can use like:

	for_each_sb_inode(sb, inode, inode_eligible_check(inode)) {
		do my stuff here
	}

that will avoid any indirect calls and will magically handle all the
cleanup that needs to be done if you break / jump out of the loop or
similar. I actually find such constructs more convenient to use than your
version of the iterator because there's no need to create & pass around the
additional data structure for the iterator body, no need for special return
values to abort iteration etc.

								Honza

/* Find next inode on the inode list eligible for processing */
#define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) 	\
({									\
	struct inode *ret = NULL;					\
									\
	cond_resched();							\
	spin_lock(&(sb)->s_inode_list_lock);				\
	if (!(inode))							\
		inode = list_first_entry((sb)->s_inodes, struct inode,	\
					 i_sb_list);			\
	while (1) {							\
		if (list_entry_is_head(inode, (sb)->s_inodes, i_sb_list)) { \
			spin_unlock(&(sb)->s_inode_list_lock);		\
			break;						\
		}							\
		spin_lock(&inode->i_lock);				\
		if ((inode)->i_state & (I_NEW | I_FREEING | I_WILL_FREE) || \
		    !inode_eligible) {					\
			spin_unlock(&(inode)->i_lock);			\
			continue;					\
		}							\
		__iget(inode);						\
		spin_unlock(&(inode)->i_lock);				\
		spin_unlock(&(sb)->s_inode_list_lock);			\
		iput(*old_inode);					\
		*old_inode = inode;					\
		ret = inode;						\
		break;							\
	}								\
	ret;								\
})

#define for_each_sb_inode(sb, inode, inode_eligible)			\
	for (DEFINE_FREE(old_inode, struct inode *, if (_T) iput(_T)),	\
	     inode = NULL;						\
	     inode = sb_inode_iter_next((sb), inode, &old_inode,	\
					 inode_eligible);		\
	    )
Christoph Hellwig Oct. 3, 2024, 12:18 p.m. UTC | #17
On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote:
> /* Find next inode on the inode list eligible for processing */
> #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) 	\
> ({									\
> 	struct inode *ret = NULL;					\

<snip>

> 	ret;								\
> })

How is this going to interact with calling into the file system
to do the interaction, which is kinda the point of this series?

Sure, you could have a get_next_inode-style method, but it would need
a fair amount of entirely file system specific state that needs
to be stashed away somewhere, and all options for that looks pretty
ugly.

Also even with your pre-iget callback we'd at best halve the number
of indirect calls for something that isn't exactly performance
critical.  So while it could be done that way, it feels like a
more complex and much harder to reason about version for no real
benefit.

> #define for_each_sb_inode(sb, inode, inode_eligible)			\
> 	for (DEFINE_FREE(old_inode, struct inode *, if (_T) iput(_T)),	\
> 	     inode = NULL;						\
> 	     inode = sb_inode_iter_next((sb), inode, &old_inode,	\
> 					 inode_eligible);		\
> 	    )

And while I liked:

	obj = NULL;

	while ((obj = get_next_object(foo, obj))) {
	}

style iterators, magic for_each macros that do magic cleanup are just
a nightmare to read.  Keep it simple and optimize for someone actually
having to read and understand the code, and not for saving a few lines
of code.
Jan Kara Oct. 3, 2024, 12:46 p.m. UTC | #18
On Thu 03-10-24 05:18:30, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote:
> > /* Find next inode on the inode list eligible for processing */
> > #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) 	\
> > ({									\
> > 	struct inode *ret = NULL;					\
> 
> <snip>
> 
> > 	ret;								\
> > })
> 
> How is this going to interact with calling into the file system
> to do the interaction, which is kinda the point of this series?

Yeah, I was concentrated on the VFS bits and forgot why Dave wrote this
series in the first place. So this style of iterator isn't useful for what
Dave wants to achieve. Sorry for the noise. Still the possibility to have a
callback under inode->i_lock being able to do stuff and decide whether we
should grab a reference or continue would be useful (and would allow us to
combine the three iterations on unmount into one without too much hassle).

> > #define for_each_sb_inode(sb, inode, inode_eligible)			\
> > 	for (DEFINE_FREE(old_inode, struct inode *, if (_T) iput(_T)),	\
> > 	     inode = NULL;						\
> > 	     inode = sb_inode_iter_next((sb), inode, &old_inode,	\
> > 					 inode_eligible);		\
> > 	    )
> 
> And while I liked:
> 
> 	obj = NULL;
> 
> 	while ((obj = get_next_object(foo, obj))) {
> 	}
> 
> style iterators, magic for_each macros that do magic cleanup are just
> a nightmare to read.  Keep it simple and optimize for someone actually
> having to read and understand the code, and not for saving a few lines
> of code.

Well, I agree the above is hard to read but I don't know how to write it in
a more readable way while keeping the properties of the iterator (like
auto-cleanup when you break out of the loop - which is IMO a must for a
sane iterator). Anyway, this is now mostly academic since I agree this
iterator isn't really useful for the situation here.

								Honza
Dave Chinner Oct. 3, 2024, 1:03 p.m. UTC | #19
On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote:
> Hi Dave!
> 
> On Wed 02-10-24 11:33:17, Dave Chinner wrote:
> > There are two superblock iterator functions provided. The first is a
> > generic iterator that provides safe, reference counted inodes for
> > the callback to operate on. This is generally what most sb->s_inodes
> > iterators use, and it allows the iterator to drop locks and perform
> > blocking operations on the inode before moving to the next inode in
> > the sb->s_inodes list.
> > 
> > There is one quirk to this interface - INO_ITER_REFERENCE - because
> > fsnotify iterates the inode cache -after- evict_inodes() has been
> > called during superblock shutdown to evict all non-referenced
> > inodes. Hence it should only find referenced inodes, and it has
> > a check to skip unreferenced inodes. This flag does the same.
> 
> Overall I really like the series. A lot of duplicated code removed and
> scalability improved, we don't get such deals frequently :) Regarding
> INO_ITER_REFERENCE I think that after commit 1edc8eb2e9313 ("fs: call
> fsnotify_sb_delete after evict_inodes") the check for 0 i_count in
> fsnotify_unmount_inodes() isn't that useful anymore so I'd be actually fine
> dropping it (as a separate patch please).
> 
> That being said I'd like to discuss one thing: As you have surely noticed,
> some of the places iterating inodes perform additional checks on the inode
> to determine whether the inode is interesting or not (e.g. the Landlock
> iterator or iterators in quota code) to avoid the unnecessary iget / iput
> and locking dance.

Yes, but we really don't care. None of these cases are performance
critical, and I'd much prefer that we have a consistent behaviour.

> The inode refcount check you've worked-around with
> INO_ITER_REFERENCE is a special case of that. Have you considered option to
> provide callback for the check inside the iterator?

I did. I decided that it wasn't necessary just to avoid the
occasional iget/iput. It's certainly not necessary for the
fsnotify/landlock cases where INO_ITER_REFERENCE was used because
at that point there are only landlock and fsnotify inodes left in
the cache. We're going to be doing iget/iput on all of them
anyway.

Really, subsystems should be tracking inodes they have references to
themselves, not running 'needle in haystack' searches for inodes
they hold references to. That would get rid of both the fsnotify and
landlock iterators completely...

> Also maybe I'm went a *bit* overboard here with macro magic but the code
> below should provide an iterator that you can use like:
> 
> 	for_each_sb_inode(sb, inode, inode_eligible_check(inode)) {
> 		do my stuff here
> 	}

As I explained to Kent: wrapping the existing code in a different
iterator defeats the entire purpose of the change to the iteration
code.

> that will avoid any indirect calls and will magically handle all the
> cleanup that needs to be done if you break / jump out of the loop or
> similar. I actually find such constructs more convenient to use than your
> version of the iterator because there's no need to create & pass around the
> additional data structure for the iterator body, no need for special return
> values to abort iteration etc.

I'm not introducing the callback-based iterator function to clean
the code up - I'm introducing it as infrastructure that allows the
*iteration mechanism to be completely replaced* by filesystems that
have more efficient, more scalable  inode iterators already built
in.

This change of iterator model also allows seamless transition of
indivudal filesystems to new iterator mechanisms. Macro based
iterators do not allow for different iterator implementations to
co-exist, but that's exactly what I'm trying to acheive here.
I'm not trying to clean the code up - I'm trying to lay the
ground-work for new functionality....


-Dave.
Dave Chinner Oct. 3, 2024, 1:35 p.m. UTC | #20
On Thu, Oct 03, 2024 at 02:46:19PM +0200, Jan Kara wrote:
> On Thu 03-10-24 05:18:30, Christoph Hellwig wrote:
> > On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote:
> > > /* Find next inode on the inode list eligible for processing */
> > > #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) 	\
> > > ({									\
> > > 	struct inode *ret = NULL;					\
> > 
> > <snip>
> > 
> > > 	ret;								\
> > > })
> > 
> > How is this going to interact with calling into the file system
> > to do the interaction, which is kinda the point of this series?
> 
> Yeah, I was concentrated on the VFS bits and forgot why Dave wrote this
> series in the first place. So this style of iterator isn't useful for what
> Dave wants to achieve. Sorry for the noise. Still the possibility to have a
> callback under inode->i_lock being able to do stuff and decide whether we

I did that first, and turned into an utter mess the moment we step
outside the existing iterator mechanism.

I implemented a separate XFS icwalk function because having to hold
the inode->i_lock between inode lookup and the callback function
means we cannot do batched inode lookups from the radix tree.

The existing icwalk code grabs 32 inodes at a time from the radix
tree and validates them all, then runs the callback on them one at a
time, then it drops them all.

If the VFS inode callback requires the inode i_lock to be held and
be atomic with the initial state checks, then we have to nest 32
spinlocks in what is effectively a random lock order.

So I implemented a non-batched icwalk method, and it didn't get that
much cleaner. It wasn't until I dropped the inode->i_lock from the
callback API that everything cleaned up and the offload mechanism
started to make sense. And it was this change that also makes it
possible for XFs to use it's existing batched lookup mechanisms
instead of the special case implementation that I wrote for this
patch set.

IOWs, we can't hold the inode->i_lock across lookup validation to
callback if we want to provide freedom of implementation to the
filesystem specific code. We aren't really concerned about
performance of traversals, so I went with freedom of implementation
over clunky locking semantics to optimise away a couple of atomic
ops per inode for iget/iput calls.

-Dave.