Message ID | 20241002014017.3801899-1-david@fromorbit.com (mailing list archive) |
---|---|
Headers | show |
Series | vfs: improving inode cache iteration scalability | expand |
On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > The management of the sb->s_inodes list is a scalability limitation; > it is protected by a single lock and every inode that is > instantiated has to be added to the list. Those inodes then need to > be removed from the list when evicted from cache. Hence every inode > that moves through the VFS inode cache must take this global scope > lock twice. > > This proves to be a significant limiting factor for concurrent file > access workloads that repeatedly miss the dentry cache on lookup. > Directory search and traversal workloads are particularly prone to > these issues, though on XFS we have enough concurrency capability > in file creation and unlink for the sb->s_inodes list to be a > limitation there as well. > > Previous efforts to solve this problem have > largely centered around reworking the sb->s_inodes list into > something more scalable such as this longstanding patchset does: > > https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@fromorbit.com/ > > However, a recent discussion about inode cache behaviour that arose > from the bcachefs 6.12-rc1 pull request opened a new direction for > us to explore. With both XFS and bcachefs now providing their own > per-superblock inode cache implementations, we should try to make > use of these inode caches as first class citizens. > > With that new direction in mind, it became obvious that XFS could > elide the sb->s_inodes list completely - "the best part is no part" > - if iteration was not reliant on open-coded sb->s_inodes list > walks. > > We already use the internal inode cache for iteration, and we have > filters for selecting specific inodes to operate on with specific > callback operations. If we had an abstraction for iterating > all VFS inodes, we can easily implement that directly on the XFS > inode cache. > > This is what this patchset aims to implement. > > There are two superblock iterator functions provided. The first is a > generic iterator that provides safe, reference counted inodes for > the callback to operate on. This is generally what most sb->s_inodes > iterators use, and it allows the iterator to drop locks and perform > blocking operations on the inode before moving to the next inode in > the sb->s_inodes list. > > There is one quirk to this interface - INO_ITER_REFERENCE - because > fsnotify iterates the inode cache -after- evict_inodes() has been > called during superblock shutdown to evict all non-referenced > inodes. Hence it should only find referenced inodes, and it has > a check to skip unreferenced inodes. This flag does the same. > > However, I suspect this is now somewhat sub-optimal because LSMs can > hold references to inodes beyond evict_inodes(), and they don't get > torn down until after fsnotify evicts the referenced inodes it > holds. However, the landlock LSM doesn't have checks for > unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this > guard is not consistently applied. > > I'm undecided on how best to handle this, but it does not need to be > solved for this patchset to work. fsnotify and > landlock don't need to run -after- evict_inodes(), but moving them > to before evict_inodes() mean we now do three full inode cache > iterations to evict all the inodes from the cache. That doesn't seem > like a good idea when there might be hundreds of millions of cached > inodes at unmount. > > Similarly, needing the iterator to be aware that there should be no > unreferenced inodes left when they run doesn't seem like a good > idea, either. So perhaps the answer is that the iterator checks for > SB_ACTIVE (or some other similar flag) that indicates the superblock > is being torn down and so will skip zero-referenced inodes > automatically in this case. Like I said - this doesn't need to be > solved right now, it's just something to be aware of. > > The second iterator is the "unsafe" iterator variant that only > provides the callback with an existence guarantee. It does this by > holding the rcu_read_lock() to guarantee that the inode is not freed > from under the callback. There are no validity checks performed on > the inode - it is entirely up to the callback to validate the inode > can be operated on safely. > > Hence the "unsafe" variant is only for very specific internal uses > only. Nobody should be adding new uses of this function without > as there are very few good reasons for external access to inodes > without holding a valid reference. I have not decided whether the > unsafe callbacks should require a lockdep_assert_in_rcu_read_lock() > check in them to clearly document the context under which they are > running. > > The patchset converts all the open coded iterators to use these > new iterator functions, which means the only use of sb->s_inodes > is now confined to fs/super.c (iterator API) and fs/inode.c > (add/remove API). A new superblock operation is then added to > call out from the iterators into the filesystem to allow them to run > the iteration instead of walking the sb->s_inodes list. > > XFS is then converted to use this new superblock operation. I didn't > use the existing iterator function for this functionality right now > as it is currently based on radix tree tag lookups. It also uses a > batched 'lookup and lock' mechanism that complicated matters as I > developed this code. Hence I open coded a new, simpler cache walk > for testing purposes. > > Now that I have stuff working and I think I have the callback API > semantics settled, batched radix tree lookups should still work to > minimise the iteration overhead. Also, we might want to tag VFS > inodes in the radix tree so that we can filter them efficiently for > traversals. This would allow us to use the existing generic inode > cache walker rather than a separate variant as this patch set > implements. This can be done as future work, though. > > In terms of scalability improvements, a quick 'will it scale' test > demonstrates where the sb->s_inodes list hurts. Running a sharded, > share-nothing cold cache workload with 100,000 files per thread in > per-thread directories gives the following results on a 4-node 64p > machine with 128GB RAM. > > The workloads "walk", "chmod" and "unlink" are all directory > traversal workloads that stream cold cache inodes into the cache. > There is enough memory on this test machine that these indoes are > not being reclaimed during the workload, and are being freed between > steps via drop_caches (which iterates the inode cache and so > explicitly tests the new iteration APIs!). Hence the sb->s_inodes > scalability issues aren't as bad in these tests as when memory is > tight and inodes are being reclaimed (i.e. the issues are worse in > real workloads). > > The "bulkstat" workload uses the XFS bulkstat ioctl to iterate > inodes via walking the internal inode btrees. It uses > d_mark_dontcache() so it is actually tearing down each inode as soon > as it has been sampled by the bulkstat code. Hence it is doing two > sb->s_inodes list manipulations per inode and so shows scalability > issues much earlier than the other workloads. > > Before: > > Filesystem Files Threads Create Walk Chmod Unlink Bulkstat > xfs 400000 4 4.269 3.225 4.557 7.316 1.306 > xfs 800000 8 4.844 3.227 4.702 7.905 1.908 > xfs 1600000 16 6.286 3.296 5.592 8.838 4.392 > xfs 3200000 32 8.912 5.681 8.505 11.724 7.085 > xfs 6400000 64 15.344 11.144 14.162 18.604 15.494 > > Bulkstat starts to show issues at 8 threads, walk and chmod between > 16 and 32 threads, and unlink is limited by internal XFS stuff. > Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the > sb->s_inodes list management. > > After: > > Filesystem Files Threads Create Walk Chmod Unlink Bulkstat > xfs 400000 4 4.140 3.502 4.154 7.242 1.164 > xfs 800000 8 4.637 2.836 4.444 7.896 1.093 > xfs 1600000 16 5.549 3.054 5.213 8.696 1.107 > xfs 3200000 32 8.387 3.218 6.867 10.668 1.125 > xfs 6400000 64 14.112 3.953 10.365 18.620 1.270 > > When patched, walk shows little in way of scalability degradation > out to 64 threads, chmod is significantly improved at 32-64 threads, > and bulkstat shows perfect scalability out to 64 threads now. > > I did a couple of other longer running, higher inode count tests > with bulkstat to get an idea of inode cache streaming rates - 32 > million inodes scanned in 4.4 seconds at 64 threads. That's about > 7.2 million inodes/s being streamed through the inode cache with the > IO rates are peaking well above 5.5GB/s (near IO bound). > > Hence raw VFS inode cache throughput sees a ~17x scalability > improvement on XFS at 64 threads (and probably a -lot- more on > higher CPU count machines). That's far better performance than I > ever got from the dlist conversion of the sb->s_inodes list in > previous patchsets, so this seems like a much better direction to be > heading for optimising the way we cache inodes. > > I haven't done a lot of testing on this patchset yet - it boots and > appears to work OK for block devices, ext4 and XFS, but checking > stuff like quota on/off is still working properly on ext4 hasn't > been done yet. > > What do people think of moving towards per-sb inode caching and > traversal mechanisms like this? Patches 1-4 are great cleanups that I would like us to merge even independent of the rest. I don't have big conceptual issues with the series otherwise. The only thing that makes me a bit uneasy is that we are now providing an api that may encourage filesystems to do their own inode caching even if they don't really have a need for it just because it's there. So really a way that would've solved this issue generically would have been my preference. But the reality is that xfs has been doing that private inode cache for a long time and reading through 5/7 and 6/7 it clearly provides value for xfs. So I find it hard to object to adding ->iter_vfs_inodes() (Though I would like to s/iter_vfs_inodes/iter_inodes/g).
On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > > What do people think of moving towards per-sb inode caching and > > traversal mechanisms like this? > > Patches 1-4 are great cleanups that I would like us to merge even > independent of the rest. Yes, they make it much easier to manage the iteration code. > I don't have big conceptual issues with the series otherwise. The only > thing that makes me a bit uneasy is that we are now providing an api > that may encourage filesystems to do their own inode caching even if > they don't really have a need for it just because it's there. So really > a way that would've solved this issue generically would have been my > preference. Well, that's the problem, isn't it? :/ There really isn't a good generic solution for global list access and management. The dlist stuff kinda works, but it still has significant overhead and doesn't get rid of spinlock contention completely because of the lack of locality between list add and remove operations. i.e. dlist is optimised for low contention add operations (i.e. local to the CPU). However, removal is not a local operation - it almsot always happens on a different CPU to the add operation. Hence removal always pulls the list and lock away from the CPU that "owns" them, and hence there is still contention when inodes are streaming through memory. This causes enough overhead that dlist operations are still very visible in CPU profiles during scalability testing... XFS (and now bcachefs) have their own per-sb inode cache implementations, and hence for them the sb->s_inodes list is pure overhead. If we restructure the generic inode cache infrastructure to also be per-sb (this suggestion from Linus was what lead me to this patch set), then they will also likely not need the sb->s_inodes list, too. That's the longer term "generic solution" to the sb->s_inodes list scalability problem (i.e. get rid of it!), but it's a much larger and longer term undertaking. Once we know what that new generic inode cache infrastructure looks like, we'll probably only want to be converting one filesystem at a time to the new infrastucture. We'll need infrastructure to allow alternative per-sb iteration mechanisms for such a conversion take place - the converted filesystems will likely call a generic ->iter_vfs_inodes() implementation based on the per-sb inode cache infrastructure rather than iterating sb->s_inodes. Eventually, we'll end up with that generic method replacing the sb->s_inodes iteration, we'll end up with only a couple of filesystems using the callout again. > But the reality is that xfs has been doing that private inode cache for > a long time and reading through 5/7 and 6/7 it clearly provides value > for xfs. So I find it hard to object to adding ->iter_vfs_inodes() > (Though I would like to s/iter_vfs_inodes/iter_inodes/g). I named it that way because, from my filesystem centric point of view, there is a very distinct separation between VFS and filesystem inodes. The VFS inode (struct inode) is a subset of the filesystem inode structure and, in XFS's case, a subset of the filesystem inode life cycle, too. i.e. this method should not iterate cached filesystem inodes that exist outside the VFS inode lifecycle or VFS visibility even though they may be present in the filesystem's internal inode cache. -Dave.
On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote: > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > > > What do people think of moving towards per-sb inode caching and > > > traversal mechanisms like this? > > > > Patches 1-4 are great cleanups that I would like us to merge even > > independent of the rest. > > Yes, they make it much easier to manage the iteration code. > > > I don't have big conceptual issues with the series otherwise. The only > > thing that makes me a bit uneasy is that we are now providing an api > > that may encourage filesystems to do their own inode caching even if > > they don't really have a need for it just because it's there. So really > > a way that would've solved this issue generically would have been my > > preference. > > Well, that's the problem, isn't it? :/ > > There really isn't a good generic solution for global list access > and management. The dlist stuff kinda works, but it still has > significant overhead and doesn't get rid of spinlock contention > completely because of the lack of locality between list add and > remove operations. There is though; I haven't posted it yet because it still needs some work, but the concept works and performs about the same as dlock-list. https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list The thing that needs to be sorted before posting is that it can't shrink the radix tree. generic-radix-tree doesn't support shrinking, and I could add that, but then ida doesn't provide a way to query the highest id allocated (xarray doesn't support backwards iteration). So I'm going to try it using idr and see how that performs (idr is not really the right data structure for this, split ida and item radix tree is better, so might end up doing something else). But - this approach with more work will work for the list_lru lock contention as well. From 32cb8103ecfacdd5ed8e1eb390221c3f8339de6f Mon Sep 17 00:00:00 2001 From: Kent Overstreet <kent.overstreet@linux.dev> Date: Sat, 28 Sep 2024 16:22:38 -0400 Subject: [PATCH] lib/fast_list.c A fast "list" data structure, which is actually a radix tree, with an IDA for slot allocation and a percpu buffer on top of that. Items cannot be added or moved to the head or tail, only added at some (arbitrary) position and removed. The advantage is that adding, removing and iteration is generally lockless, only hitting the lock in ida when the percpu buffer is full or empty. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> diff --git a/include/linux/fast_list.h b/include/linux/fast_list.h new file mode 100644 index 000000000000..7d5d8592864d --- /dev/null +++ b/include/linux/fast_list.h @@ -0,0 +1,22 @@ +#ifndef _LINUX_FAST_LIST_H +#define _LINUX_FAST_LIST_H + +#include <linux/generic-radix-tree.h> +#include <linux/idr.h> +#include <linux/percpu.h> + +struct fast_list_pcpu; + +struct fast_list { + GENRADIX(void *) items; + struct ida slots_allocated;; + struct fast_list_pcpu *buffer; +}; + +int fast_list_get_idx(struct fast_list *l); +int fast_list_add(struct fast_list *l, void *item); +void fast_list_remove(struct fast_list *l, unsigned idx); +void fast_list_exit(struct fast_list *l); +int fast_list_init(struct fast_list *l); + +#endif /* _LINUX_FAST_LIST_H */ diff --git a/lib/Makefile b/lib/Makefile index 773adf88af41..85cf5a0d36b1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -49,7 +49,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \ bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \ percpu-refcount.o rhashtable.o base64.o \ once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \ - generic-radix-tree.o bitmap-str.o + generic-radix-tree.o bitmap-str.o fast_list.o obj-$(CONFIG_STRING_KUNIT_TEST) += string_kunit.o obj-y += string_helpers.o obj-$(CONFIG_STRING_HELPERS_KUNIT_TEST) += string_helpers_kunit.o diff --git a/lib/fast_list.c b/lib/fast_list.c new file mode 100644 index 000000000000..bbb69bb29687 --- /dev/null +++ b/lib/fast_list.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Fast, unordered lists + * + * Supports add, remove, and iterate + * + * Underneath, they're a radix tree and an IDA, with a percpu buffer for slot + * allocation and freeing. + * + * This means that adding, removing, and iterating over items is lockless, + * except when refilling/emptying the percpu slot buffers. + */ + +#include <linux/fast_list.h> + +struct fast_list_pcpu { + size_t nr; + size_t entries[31]; +}; + +/** + * fast_list_get_idx - get a slot in a fast_list + * @l: list to get slot in + * + * This allocates a slot in the radix tree without storing to it, so that we can + * take the potential memory allocation failure early and do the list add later + * when we can't take an allocation failure. + * + * Returns: positive integer on success, -ENOMEM on failure + */ +int fast_list_get_idx(struct fast_list *l) +{ + int idx; + + preempt_disable(); + struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer); + + if (unlikely(!lp->nr)) + while (lp->nr <= ARRAY_SIZE(lp->entries) / 2) { + idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_NOWAIT|__GFP_NOWARN); + if (unlikely(idx < 0)) { + preempt_enable(); + idx = ida_alloc_range(&l->slots_allocated, 1, ~0, GFP_KERNEL); + if (unlikely(idx < 0)) + return idx; + + preempt_disable(); + lp = this_cpu_ptr(l->buffer); + } + + if (unlikely(!genradix_ptr_alloc_inlined(&l->items, idx, + GFP_NOWAIT|__GFP_NOWARN))) { + preempt_enable(); + if (!genradix_ptr_alloc(&l->items, idx, GFP_KERNEL)) { + ida_free(&l->slots_allocated, idx); + return -ENOMEM; + } + + preempt_disable(); + lp = this_cpu_ptr(l->buffer); + } + + if (unlikely(lp->nr == ARRAY_SIZE(lp->entries))) + ida_free(&l->slots_allocated, idx); + else + lp->entries[lp->nr++] = idx; + } + + idx = lp->entries[--lp->nr]; + preempt_enable(); + + return idx; +} + +/** + * fast_list_add - add an item to a fast_list + * @l: list + * @item: item to add + * + * Allocates a slot in the radix tree and stores to it and then returns the + * slot index, which must be passed to fast_list_remove(). + * + * Returns: positive integer on success, -ENOMEM on failure + */ +int fast_list_add(struct fast_list *l, void *item) +{ + int idx = fast_list_get_idx(l); + if (idx < 0) + return idx; + + *genradix_ptr_inlined(&l->items, idx) = item; + return idx; +} + +/** + * fast_list_remove - remove an item from a fast_list + * @l: list + * @idx: item's slot index + * + * Zeroes out the slot in the radix tree and frees the slot for future + * fast_list_add() operations. + */ +void fast_list_remove(struct fast_list *l, unsigned idx) +{ + if (!idx) + return; + + *genradix_ptr_inlined(&l->items, idx) = NULL; + + preempt_disable(); + struct fast_list_pcpu *lp = this_cpu_ptr(l->buffer); + + if (unlikely(lp->nr == ARRAY_SIZE(lp->entries))) + while (lp->nr >= ARRAY_SIZE(lp->entries) / 2) { + ida_free(&l->slots_allocated, idx); + idx = lp->entries[--lp->nr]; + } + + lp->entries[lp->nr++] = idx; + preempt_enable(); +} + +void fast_list_exit(struct fast_list *l) +{ + /* XXX: warn if list isn't empty */ + free_percpu(l->buffer); + ida_destroy(&l->slots_allocated); + genradix_free(&l->items); +} + +int fast_list_init(struct fast_list *l) +{ + genradix_init(&l->items); + ida_init(&l->slots_allocated); + l->buffer = alloc_percpu(*l->buffer); + if (!l->buffer) + return -ENOMEM; + return 0; +}
On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > I don't have big conceptual issues with the series otherwise. The only > > thing that makes me a bit uneasy is that we are now providing an api > > that may encourage filesystems to do their own inode caching even if > > they don't really have a need for it just because it's there. So really > > a way that would've solved this issue generically would have been my > > preference. > > Well, that's the problem, isn't it? :/ > > There really isn't a good generic solution for global list access > and management. The dlist stuff kinda works, but it still has > significant overhead and doesn't get rid of spinlock contention > completely because of the lack of locality between list add and > remove operations. I much prefer the approach taken in your patch series, to let the filesystem own the inode list and keeping the old model as the "default list". In many ways, that is how *most* of the VFS layer works - it exposes helper functions that the filesystems can use (and most do), but doesn't force them. Yes, the VFS layer does force some things - you can't avoid using dentries, for example, because that's literally how the VFS layer deals with filenames (and things like mounting etc). And honestly, the VFS layer does a better job of filename caching than any filesystem really can do, and with the whole UNIX mount model, filenames fundamentally cross filesystem boundaries anyway. But clearly the VFS layer inode list handling isn't the best it can be, and unless we can fix that in some fundamental way (and I don't love the "let's use crazy lists instead of a simple one" models) I do think that just letting filesystems do their own thing if they have something better is a good model. That's how we deal with all the basic IO, after all. The VFS layer has lots of support routines, but filesystems don't *have* to use things like generic_file_read_iter() and friends. Yes, most filesystems do use generic_file_read_iter() in some form or other (sometimes raw, sometimes wrapped with filesystem logic), because it fits their model, it's convenient, and it handles all the normal stuff well, but you don't *have* to use it if you have special needs. Taking that approach to the inode caching sounds sane to me, and I generally like Dave's series. It looks like an improvement to me. Linus
On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote: > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote: > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > > I don't have big conceptual issues with the series otherwise. The only > > > thing that makes me a bit uneasy is that we are now providing an api > > > that may encourage filesystems to do their own inode caching even if > > > they don't really have a need for it just because it's there. So really > > > a way that would've solved this issue generically would have been my > > > preference. > > > > Well, that's the problem, isn't it? :/ > > > > There really isn't a good generic solution for global list access > > and management. The dlist stuff kinda works, but it still has > > significant overhead and doesn't get rid of spinlock contention > > completely because of the lack of locality between list add and > > remove operations. > > I much prefer the approach taken in your patch series, to let the > filesystem own the inode list and keeping the old model as the > "default list". > > In many ways, that is how *most* of the VFS layer works - it exposes > helper functions that the filesystems can use (and most do), but > doesn't force them. > > Yes, the VFS layer does force some things - you can't avoid using > dentries, for example, because that's literally how the VFS layer > deals with filenames (and things like mounting etc). And honestly, the > VFS layer does a better job of filename caching than any filesystem > really can do, and with the whole UNIX mount model, filenames > fundamentally cross filesystem boundaries anyway. > > But clearly the VFS layer inode list handling isn't the best it can > be, and unless we can fix that in some fundamental way (and I don't > love the "let's use crazy lists instead of a simple one" models) I do > think that just letting filesystems do their own thing if they have > something better is a good model. Well, I don't love adding more indirection and callbacks. The underlying approach in this patchset of "just use the inode hash table if that's available" - that I _do_ like, but this seems like the wrong way to go about it, we're significantly adding to the amount of special purpose "things" filesystems have to do if they want to perform well. Converting the standard inode hash table to an rhashtable (or more likely, creating a new standard implementation and converting filesystems one at a time) still needs to happen, and then the "use the hash table for iteration" approach could use that without every filesystem having to specialize. Failing that, or even regardless, I think we do need either dlock-list or fast-list. "I need some sort of generic list, but fast" is something I've seen come up way too many times.
On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote: > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote: > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > > > > What do people think of moving towards per-sb inode caching and > > > > traversal mechanisms like this? > > > > > > Patches 1-4 are great cleanups that I would like us to merge even > > > independent of the rest. > > > > Yes, they make it much easier to manage the iteration code. > > > > > I don't have big conceptual issues with the series otherwise. The only > > > thing that makes me a bit uneasy is that we are now providing an api > > > that may encourage filesystems to do their own inode caching even if > > > they don't really have a need for it just because it's there. So really > > > a way that would've solved this issue generically would have been my > > > preference. > > > > Well, that's the problem, isn't it? :/ > > > > There really isn't a good generic solution for global list access > > and management. The dlist stuff kinda works, but it still has > > significant overhead and doesn't get rid of spinlock contention > > completely because of the lack of locality between list add and > > remove operations. > > There is though; I haven't posted it yet because it still needs some > work, but the concept works and performs about the same as dlock-list. > > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list > > The thing that needs to be sorted before posting is that it can't shrink > the radix tree. generic-radix-tree doesn't support shrinking, and I > could add that, but then ida doesn't provide a way to query the highest > id allocated (xarray doesn't support backwards iteration). That's an interesting construct, but... > So I'm going to try it using idr and see how that performs (idr is not > really the right data structure for this, split ida and item radix tree > is better, so might end up doing something else). > > But - this approach with more work will work for the list_lru lock > contention as well. .... it isn't a generic solution because it is dependent on blocking memory allocation succeeding for list_add() operations. Hence this cannot do list operations under external synchronisation constructs like spinlocks or rcu_read_lock(). It also introduces interesting interactions with memory reclaim - what happens we have to add an object to one of these lists from memory reclaim context? Taking the example of list_lru, this list construct will not work for a variety of reasons. Some of them are: - list_lru_add() being called from list_lru_add_obj() under RCU for memcg aware LRUs so cannot block and must not fail. - list_lru_add_obj() is called under spinlocks from inode_lru_add(), the xfs buffer and dquot caches, the workingset code from under the address space mapping xarray lock, etc. Again, this must not fail. - list_lru_add() operations take can place in large numbers in memory reclaim context (e.g. dentry reclaim drops inodes which adds them to the inode lru). Hence memory reclaim becomes even more dependent on PF_MEMALLOC memory allocation making forwards progress. - adding long tail list latency to what are currently O(1) fast path operations (e.g. mulitple allocations tree splits for LRUs tracking millions of objects) is not desirable. - LRU lists are -ordered- (it's right there in the name!) and this appears to be an unordered list construct. So while I think this is an interesting idea that might be useful in some cases, I don't think it is a viable generic scalable list construct we can use in areas like list_lru or global list management that run under external synchronisation mechanisms. -Dave.
On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote: > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote: > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > > > > I don't have big conceptual issues with the series otherwise. The only > > > > thing that makes me a bit uneasy is that we are now providing an api > > > > that may encourage filesystems to do their own inode caching even if > > > > they don't really have a need for it just because it's there. So really > > > > a way that would've solved this issue generically would have been my > > > > preference. > > > > > > Well, that's the problem, isn't it? :/ > > > > > > There really isn't a good generic solution for global list access > > > and management. The dlist stuff kinda works, but it still has > > > significant overhead and doesn't get rid of spinlock contention > > > completely because of the lack of locality between list add and > > > remove operations. > > > > I much prefer the approach taken in your patch series, to let the > > filesystem own the inode list and keeping the old model as the > > "default list". > > > > In many ways, that is how *most* of the VFS layer works - it exposes > > helper functions that the filesystems can use (and most do), but > > doesn't force them. > > > > Yes, the VFS layer does force some things - you can't avoid using > > dentries, for example, because that's literally how the VFS layer > > deals with filenames (and things like mounting etc). And honestly, the > > VFS layer does a better job of filename caching than any filesystem > > really can do, and with the whole UNIX mount model, filenames > > fundamentally cross filesystem boundaries anyway. > > > > But clearly the VFS layer inode list handling isn't the best it can > > be, and unless we can fix that in some fundamental way (and I don't > > love the "let's use crazy lists instead of a simple one" models) I do > > think that just letting filesystems do their own thing if they have > > something better is a good model. > > Well, I don't love adding more indirection and callbacks. It's way better than open coding inode cache traversals everywhere. The callback model is simply "call this function on every object", and it allows implementations the freedom to decide how they are going to run those callbacks. For example, this abstraction allows XFS to parallelise the traversal. We currently run the traversal across all inodes in a single thread, but now that XFS is walking the inode cache we can push each shard off to a workqueue and run each shard concurrently. IOWs, we can actually make the traversal of large caches much, much faster without changing the semantics of the operation the traversal is trying to acheive. We simply cannot do things like that without a new iteration model. Abstraction is necessary to facilitate a new iteration model, and a model that provides independent object callbacks allows scope for concurrent processing of individual objects. > The underlying approach in this patchset of "just use the inode hash > table if that's available" - that I _do_ like, but this seems like > the wrong way to go about it, we're significantly adding to the amount > of special purpose "things" filesystems have to do if they want to > perform well. I've already addressed this in my response to Christian. This is a mechanism that allows filesystems to be moved one-by-one to a new generic cache and iteration implementation without impacting existing code. Once we have that, scalability of the inode cache and traversals should not be a reason for filesystems "doing their own thing" because the generic infrastructure will be sufficient for most filesystem implementations. > Converting the standard inode hash table to an rhashtable (or more > likely, creating a new standard implementation and converting > filesystems one at a time) still needs to happen, and then the "use the > hash table for iteration" approach could use that without every > filesystem having to specialize. Yes, but this still doesn't help filesystems like XFS where the structure of the inode cache is highly optimised for the specific on-disk and in-memory locality of inodes. We aren't going to be converting XFS to a rhashtable based inode cache anytime soon because it simply doesn't provide the functionality we require. e.g. efficient lockless sequential inode number ordered traversal in -every- inode cluster writeback operation. > Failing that, or even regardless, I think we do need either dlock-list > or fast-list. "I need some sort of generic list, but fast" is something > I've seen come up way too many times. There's nothing stopping you from using the dlist patchset for your own purposes. It's public code - just make sure you retain the correct attributions. :) -Dave.
On Thu, Oct 03, 2024 at 08:23:44AM GMT, Dave Chinner wrote: > On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote: > > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote: > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > > > > > What do people think of moving towards per-sb inode caching and > > > > > traversal mechanisms like this? > > > > > > > > Patches 1-4 are great cleanups that I would like us to merge even > > > > independent of the rest. > > > > > > Yes, they make it much easier to manage the iteration code. > > > > > > > I don't have big conceptual issues with the series otherwise. The only > > > > thing that makes me a bit uneasy is that we are now providing an api > > > > that may encourage filesystems to do their own inode caching even if > > > > they don't really have a need for it just because it's there. So really > > > > a way that would've solved this issue generically would have been my > > > > preference. > > > > > > Well, that's the problem, isn't it? :/ > > > > > > There really isn't a good generic solution for global list access > > > and management. The dlist stuff kinda works, but it still has > > > significant overhead and doesn't get rid of spinlock contention > > > completely because of the lack of locality between list add and > > > remove operations. > > > > There is though; I haven't posted it yet because it still needs some > > work, but the concept works and performs about the same as dlock-list. > > > > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list > > > > The thing that needs to be sorted before posting is that it can't shrink > > the radix tree. generic-radix-tree doesn't support shrinking, and I > > could add that, but then ida doesn't provide a way to query the highest > > id allocated (xarray doesn't support backwards iteration). > > That's an interesting construct, but... > > > So I'm going to try it using idr and see how that performs (idr is not > > really the right data structure for this, split ida and item radix tree > > is better, so might end up doing something else). > > > > But - this approach with more work will work for the list_lru lock > > contention as well. > > .... it isn't a generic solution because it is dependent on > blocking memory allocation succeeding for list_add() operations. > > Hence this cannot do list operations under external synchronisation > constructs like spinlocks or rcu_read_lock(). It also introduces > interesting interactions with memory reclaim - what happens we have > to add an object to one of these lists from memory reclaim context? > > Taking the example of list_lru, this list construct will not work > for a variety of reasons. Some of them are: > > - list_lru_add() being called from list_lru_add_obj() under RCU for > memcg aware LRUs so cannot block and must not fail. > - list_lru_add_obj() is called under spinlocks from inode_lru_add(), > the xfs buffer and dquot caches, the workingset code from under > the address space mapping xarray lock, etc. Again, this must not > fail. > - list_lru_add() operations take can place in large numbers in > memory reclaim context (e.g. dentry reclaim drops inodes which > adds them to the inode lru). Hence memory reclaim becomes even > more dependent on PF_MEMALLOC memory allocation making forwards > progress. > - adding long tail list latency to what are currently O(1) fast path > operations (e.g. mulitple allocations tree splits for LRUs > tracking millions of objects) is not desirable. > > So while I think this is an interesting idea that might be useful in > some cases, I don't think it is a viable generic scalable list > construct we can use in areas like list_lru or global list > management that run under external synchronisation mechanisms. There are difficulties, but given the fundamental scalability and locking issues with linked lists, I think this is the approach we want if we can make it work. A couple things that help - we've already determined that the inode LRU can go away for most filesystems, and we can preallocate slots without actually adding objects. Iteration will see NULLs that they skip over, so we can't simply preallocate a slot for everything if nr_live_objects / nr_lru_objects is too big. But, we can certainly preallocate slots on a given code path and then release them back to the percpu buffer if they're not used. > - LRU lists are -ordered- (it's right there in the name!) and this > appears to be an unordered list construct. Yes, it is. But in actual practice cache replacement policy tends not to matter nearly as much as people think; there's many papers showing real world hit ratio of common algorithms is only a fudge factor from random replacement - the main thing you want is an accessed bit (or counter, if you want the analagous version of n-lru for n > 2), and we'll still have that.
On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote: > On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote: > > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote: > > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > > > > > > I don't have big conceptual issues with the series otherwise. The only > > > > > thing that makes me a bit uneasy is that we are now providing an api > > > > > that may encourage filesystems to do their own inode caching even if > > > > > they don't really have a need for it just because it's there. So really > > > > > a way that would've solved this issue generically would have been my > > > > > preference. > > > > > > > > Well, that's the problem, isn't it? :/ > > > > > > > > There really isn't a good generic solution for global list access > > > > and management. The dlist stuff kinda works, but it still has > > > > significant overhead and doesn't get rid of spinlock contention > > > > completely because of the lack of locality between list add and > > > > remove operations. > > > > > > I much prefer the approach taken in your patch series, to let the > > > filesystem own the inode list and keeping the old model as the > > > "default list". > > > > > > In many ways, that is how *most* of the VFS layer works - it exposes > > > helper functions that the filesystems can use (and most do), but > > > doesn't force them. > > > > > > Yes, the VFS layer does force some things - you can't avoid using > > > dentries, for example, because that's literally how the VFS layer > > > deals with filenames (and things like mounting etc). And honestly, the > > > VFS layer does a better job of filename caching than any filesystem > > > really can do, and with the whole UNIX mount model, filenames > > > fundamentally cross filesystem boundaries anyway. > > > > > > But clearly the VFS layer inode list handling isn't the best it can > > > be, and unless we can fix that in some fundamental way (and I don't > > > love the "let's use crazy lists instead of a simple one" models) I do > > > think that just letting filesystems do their own thing if they have > > > something better is a good model. > > > > Well, I don't love adding more indirection and callbacks. > > It's way better than open coding inode cache traversals everywhere. Eh? You had a nice iterator for dlock-list :) > The callback model is simply "call this function on every object", > and it allows implementations the freedom to decide how they are > going to run those callbacks. > > For example, this abstraction allows XFS to parallelise the > traversal. We currently run the traversal across all inodes in a > single thread, but now that XFS is walking the inode cache we can > push each shard off to a workqueue and run each shard concurrently. > IOWs, we can actually make the traversal of large caches much, much > faster without changing the semantics of the operation the traversal > is trying to acheive. > > We simply cannot do things like that without a new iteration model. > Abstraction is necessary to facilitate a new iteration model, and a > model that provides independent object callbacks allows scope for > concurrent processing of individual objects. Parallelized iteration is a slick possibility. My concern is that we've been trying to get away from callbacks for iteration - post spectre they're quite a bit more expensive than external iterators, and we've generally been successful with that. > > > The underlying approach in this patchset of "just use the inode hash > > table if that's available" - that I _do_ like, but this seems like > > the wrong way to go about it, we're significantly adding to the amount > > of special purpose "things" filesystems have to do if they want to > > perform well. > > I've already addressed this in my response to Christian. This is a > mechanism that allows filesystems to be moved one-by-one to a new > generic cache and iteration implementation without impacting > existing code. Once we have that, scalability of the inode cache and > traversals should not be a reason for filesystems "doing their own > thing" because the generic infrastructure will be sufficient for > most filesystem implementations. Well, I'm not really seeing the need; based on my performance testing both dlock-list and fast-list completely shift the bottleneck to the lru_list locking - and in my testing both patchsets were about equal, to within the margin of error. Which is a touch surprising, given that dlock-list works similarly to lru_list - possibly it's because you only have siblings sharing lists vs. numa nodes for lru lists, or lru scanning is doing more cross cpu/node accesses. > > Converting the standard inode hash table to an rhashtable (or more > > likely, creating a new standard implementation and converting > > filesystems one at a time) still needs to happen, and then the "use the > > hash table for iteration" approach could use that without every > > filesystem having to specialize. > > Yes, but this still doesn't help filesystems like XFS where the > structure of the inode cache is highly optimised for the specific > on-disk and in-memory locality of inodes. We aren't going to be > converting XFS to a rhashtable based inode cache anytime soon > because it simply doesn't provide the functionality we require. > e.g. efficient lockless sequential inode number ordered traversal in > -every- inode cluster writeback operation. I was going to ask what your requirements are - I may take on the general purpose inode rhashtable code, although since I'm still pretty buried we'll see. Coincidentally, just today I'm working on an issue in bcachefs where we'd also prefer an ordered data structure to a hash table for the inode cache - in online fsck, we need to be able to check if an inode is still open, but checking for an inode in an interior snapshot node means we have to do a scan and check if any of the open inodes are in a descendent subvolume. Radix tree doesn't work for us, since our keys are { inum, subvol } - 96 bits - but it has me considering looking at maple trees (or something like the lockless RCU btree you were working on awhile back) - those modern approaches should be approaching hash table performance, if enough needs for ordered access come up. > > Failing that, or even regardless, I think we do need either dlock-list > > or fast-list. "I need some sort of generic list, but fast" is something > > I've seen come up way too many times. > > There's nothing stopping you from using the dlist patchset for your > own purposes. It's public code - just make sure you retain the > correct attributions. :) If this patchset goes in that might be just what I do, if I don't get around to finishing fast-list :)
On Wed, Oct 02, 2024 at 07:20:16PM -0400, Kent Overstreet wrote: > On Thu, Oct 03, 2024 at 08:23:44AM GMT, Dave Chinner wrote: > > On Wed, Oct 02, 2024 at 03:29:10PM -0400, Kent Overstreet wrote: > > > On Wed, Oct 02, 2024 at 10:34:58PM GMT, Dave Chinner wrote: > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > > On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > > > > > > What do people think of moving towards per-sb inode caching and > > > > > > traversal mechanisms like this? > > > > > > > > > > Patches 1-4 are great cleanups that I would like us to merge even > > > > > independent of the rest. > > > > > > > > Yes, they make it much easier to manage the iteration code. > > > > > > > > > I don't have big conceptual issues with the series otherwise. The only > > > > > thing that makes me a bit uneasy is that we are now providing an api > > > > > that may encourage filesystems to do their own inode caching even if > > > > > they don't really have a need for it just because it's there. So really > > > > > a way that would've solved this issue generically would have been my > > > > > preference. > > > > > > > > Well, that's the problem, isn't it? :/ > > > > > > > > There really isn't a good generic solution for global list access > > > > and management. The dlist stuff kinda works, but it still has > > > > significant overhead and doesn't get rid of spinlock contention > > > > completely because of the lack of locality between list add and > > > > remove operations. > > > > > > There is though; I haven't posted it yet because it still needs some > > > work, but the concept works and performs about the same as dlock-list. > > > > > > https://evilpiepirate.org/git/bcachefs.git/log/?h=fast_list > > > > > > The thing that needs to be sorted before posting is that it can't shrink > > > the radix tree. generic-radix-tree doesn't support shrinking, and I > > > could add that, but then ida doesn't provide a way to query the highest > > > id allocated (xarray doesn't support backwards iteration). > > > > That's an interesting construct, but... > > > > > So I'm going to try it using idr and see how that performs (idr is not > > > really the right data structure for this, split ida and item radix tree > > > is better, so might end up doing something else). > > > > > > But - this approach with more work will work for the list_lru lock > > > contention as well. > > > > .... it isn't a generic solution because it is dependent on > > blocking memory allocation succeeding for list_add() operations. > > > > Hence this cannot do list operations under external synchronisation > > constructs like spinlocks or rcu_read_lock(). It also introduces > > interesting interactions with memory reclaim - what happens we have > > to add an object to one of these lists from memory reclaim context? > > > > Taking the example of list_lru, this list construct will not work > > for a variety of reasons. Some of them are: > > > > - list_lru_add() being called from list_lru_add_obj() under RCU for > > memcg aware LRUs so cannot block and must not fail. > > - list_lru_add_obj() is called under spinlocks from inode_lru_add(), > > the xfs buffer and dquot caches, the workingset code from under > > the address space mapping xarray lock, etc. Again, this must not > > fail. > > - list_lru_add() operations take can place in large numbers in > > memory reclaim context (e.g. dentry reclaim drops inodes which > > adds them to the inode lru). Hence memory reclaim becomes even > > more dependent on PF_MEMALLOC memory allocation making forwards > > progress. > > - adding long tail list latency to what are currently O(1) fast path > > operations (e.g. mulitple allocations tree splits for LRUs > > tracking millions of objects) is not desirable. > > > > So while I think this is an interesting idea that might be useful in > > some cases, I don't think it is a viable generic scalable list > > construct we can use in areas like list_lru or global list > > management that run under external synchronisation mechanisms. > > There are difficulties, but given the fundamental scalability and > locking issues with linked lists, I think this is the approach we want > if we can make it work. Sure, but this is a completely different problem to what I'm trying to address here. I want infrastructure that does not need global lists or list_lru for inode cache maintenance at all. So talking about how to make the lists I am trying to remove scale better is kinda missing the point.... > A couple things that help - we've already determined that the inode LRU > can go away for most filesystems, We haven't determined that yet. I *think* it is possible, but there is a really nasty inode LRU dependencies that has been driven deep down into the mm page cache writeback code. We have to fix that awful layering violation before we can get rid of the inode LRU. I *think* we can do it by requiring dirty inodes to hold an explicit inode reference, thereby keeping the inode pinned in memory whilst it is being tracked for writeback. That would also get rid of the nasty hacks needed in evict() to wait on writeback to complete on unreferenced inodes. However, this isn't simple to do, and so getting rid of the inode LRU is not going to happen in the near term. > and we can preallocate slots without > actually adding objects. Iteration will see NULLs that they skip over, > so we can't simply preallocate a slot for everything if nr_live_objects > / nr_lru_objects is too big. But, we can certainly preallocate slots on > a given code path and then release them back to the percpu buffer if > they're not used. I'm not really that interested in spending time trying to optimise away list_lru contention at this point in time. It's not a performance limiting factor because inode and dentry LRU scalability is controllable by NUMA configuration. i.e. if you have severe list_lru lock contention on inode and dentry caches, then either turn on Sub-NUMA Clustering in your bios, configure your VM with more discrete nodes, or use the fake-numa=N boot parameter to increase the number of nodes the kernel sets up. This will increase the number of list_lru instances for NUMA aware shrinkers and the contention will go away. This is trivial to do and I use the "configure your VM with more discrete nodes" method for benchmarking purposes. I've run my perf testing VMs with 4 nodes for the past decade and the list_lru contention has never got above the threshold of concern. There's always been something else causing worse problems, and even with the sb->s_inodes list out of the way, it still isn't a problem on 64-way cache-hammering workloads... > > - LRU lists are -ordered- (it's right there in the name!) and this > > appears to be an unordered list construct. > > Yes, it is. But in actual practice cache replacement policy tends not to > matter nearly as much as people think; there's many papers showing real > world hit ratio of common algorithms is only a fudge factor from random > replacement - the main thing you want is an accessed bit (or counter, if > you want the analagous version of n-lru for n > 2), and we'll still have > that. Sure. But I can cherry-pick many papers showing exactly the opposite. i.e. that LRU and LFU algorithms are far superior at maintaining a working set compared to random cache shootdown, especially when there is significant latency for cache replacement. What matters is whether there are any behavioural regressions as a result of changing the current algorithm. We've used quasi-LRU working set management for so long that this is the behaviour that people have tuned their systems and applications to work well with. Fundamental changes to working set maintenance behaviour is not something I'm considering doing, nor something I *want* to do. And, really, this is way outside the scope of this patch set. It's even outside of the scope of "remove the inode cache LRU" proposal because that proposal is based on the existing dentry cache LRU working set management making the inode cache LRU completely redundant. i.e. it's not a change of working set management algorithms at all.... -Dave.
On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote: > On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote: > > On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote: > > > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote: > > > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote: > > > > > > > > > > > I don't have big conceptual issues with the series otherwise. The only > > > > > > thing that makes me a bit uneasy is that we are now providing an api > > > > > > that may encourage filesystems to do their own inode caching even if > > > > > > they don't really have a need for it just because it's there. So really > > > > > > a way that would've solved this issue generically would have been my > > > > > > preference. > > > > > > > > > > Well, that's the problem, isn't it? :/ > > > > > > > > > > There really isn't a good generic solution for global list access > > > > > and management. The dlist stuff kinda works, but it still has > > > > > significant overhead and doesn't get rid of spinlock contention > > > > > completely because of the lack of locality between list add and > > > > > remove operations. > > > > > > > > I much prefer the approach taken in your patch series, to let the > > > > filesystem own the inode list and keeping the old model as the > > > > "default list". > > > > > > > > In many ways, that is how *most* of the VFS layer works - it exposes > > > > helper functions that the filesystems can use (and most do), but > > > > doesn't force them. > > > > > > > > Yes, the VFS layer does force some things - you can't avoid using > > > > dentries, for example, because that's literally how the VFS layer > > > > deals with filenames (and things like mounting etc). And honestly, the > > > > VFS layer does a better job of filename caching than any filesystem > > > > really can do, and with the whole UNIX mount model, filenames > > > > fundamentally cross filesystem boundaries anyway. > > > > > > > > But clearly the VFS layer inode list handling isn't the best it can > > > > be, and unless we can fix that in some fundamental way (and I don't > > > > love the "let's use crazy lists instead of a simple one" models) I do > > > > think that just letting filesystems do their own thing if they have > > > > something better is a good model. > > > > > > Well, I don't love adding more indirection and callbacks. > > > > It's way better than open coding inode cache traversals everywhere. > > Eh? You had a nice iterator for dlock-list :) Which was painful to work with because it maintains the existing spin lock based traversal pattern. This was necessary because the iterator held a spinlock internally. I really didn't like that aspect of it because it perpeutated the need to open code the iget/iput game to allow the spinlock to be dropped across the inode operation that needed to be performed. i.e. adding an dlist iterator didn't clean up any of the other mess that sb->s_inodes iteration required... > > We simply cannot do things like that without a new iteration model. > > Abstraction is necessary to facilitate a new iteration model, and a > > model that provides independent object callbacks allows scope for > > concurrent processing of individual objects. > > Parallelized iteration is a slick possibility. > > My concern is that we've been trying to get away from callbacks for > iteration - post spectre they're quite a bit more expensive than > external iterators, and we've generally been successful with that. So everyone keeps saying, but the old adage applies here: Penny wise, pound foolish. Optimising away the callbacks might bring us a few percent performance improvement for each operation (e.g. via the dlist iterator mechanisms) in a traversal, but that iteration is still only single threaded. Hence the maximum processing rate is determined by the performance of a single CPU core. However, if we change the API to allow for parallelism at the cost of a few percent per object operation, then a single CPU core will not process quite as many objects as before. However, the moment we allow multiple CPU cores to process in parallel, we acheive processing rate improvements measured in integer multiples. Modern CPUs have concurrency to burn. Optimising APIs for minimum per-operation overhead rather than for concurrent processing implementations is the wrong direction to be taking.... > > > Converting the standard inode hash table to an rhashtable (or more > > > likely, creating a new standard implementation and converting > > > filesystems one at a time) still needs to happen, and then the "use the > > > hash table for iteration" approach could use that without every > > > filesystem having to specialize. > > > > Yes, but this still doesn't help filesystems like XFS where the > > structure of the inode cache is highly optimised for the specific > > on-disk and in-memory locality of inodes. We aren't going to be > > converting XFS to a rhashtable based inode cache anytime soon > > because it simply doesn't provide the functionality we require. > > e.g. efficient lockless sequential inode number ordered traversal in > > -every- inode cluster writeback operation. > > I was going to ask what your requirements are - I may take on the > general purpose inode rhashtable code, although since I'm still pretty > buried we'll see. > > Coincidentally, just today I'm working on an issue in bcachefs where > we'd also prefer an ordered data structure to a hash table for the inode > cache - in online fsck, we need to be able to check if an inode is still > open, but checking for an inode in an interior snapshot node means we > have to do a scan and check if any of the open inodes are in a > descendent subvolume. > > Radix tree doesn't work for us, since our keys are { inum, subvol } - 96 > bits - Sure it does - you just need two layers of radix trees. i.e have a radix tree per subvol to index inodes by inum, and a per-sb radix tree to index the subvols. With some code to propagate radix tree bits from the inode radix tree to the subvol radix tree they then largely work in conjunction for filtered searches. This is -exactly- the internal inode cache structure that XFS has. We have a per-sb radix tree indexing the allocation groups, and a radix tree per allocation group indexing inodes by inode number. Hence an inode lookup involves splitting the inum into agno/agino pairs, then doing a perag lookup with the agno, and doing a perag inode cache lookup with the agino. All of these radix tree lookups are lockless... -Dave.
On Thu, Oct 03, 2024 at 11:41:42AM GMT, Dave Chinner wrote: > > A couple things that help - we've already determined that the inode LRU > > can go away for most filesystems, > > We haven't determined that yet. I *think* it is possible, but there > is a really nasty inode LRU dependencies that has been driven deep > down into the mm page cache writeback code. We have to fix that > awful layering violation before we can get rid of the inode LRU. > > I *think* we can do it by requiring dirty inodes to hold an explicit > inode reference, thereby keeping the inode pinned in memory whilst > it is being tracked for writeback. That would also get rid of the > nasty hacks needed in evict() to wait on writeback to complete on > unreferenced inodes. > > However, this isn't simple to do, and so getting rid of the inode > LRU is not going to happen in the near term. Ok. > > and we can preallocate slots without > > actually adding objects. Iteration will see NULLs that they skip over, > > so we can't simply preallocate a slot for everything if nr_live_objects > > / nr_lru_objects is too big. But, we can certainly preallocate slots on > > a given code path and then release them back to the percpu buffer if > > they're not used. > > I'm not really that interested in spending time trying to optimise > away list_lru contention at this point in time. Fair, and I'm not trying to derail this one - whether dlock-list, or fast-list, or super_iter_inodes(), we should do one of them. On current mainline, I see lock contention bad enough to trigger bcachefs's 10 second "srcu lock held too long" warning, any of these solves the biggest problem. But... > It's not a performance limiting factor because inode and > dentry LRU scalability is controllable by NUMA configuration. i.e. > if you have severe list_lru lock contention on inode and dentry > caches, then either turn on Sub-NUMA Clustering in your bios, > configure your VM with more discrete nodes, or use the fake-numa=N > boot parameter to increase the number of nodes the kernel sets up. > This will increase the number of list_lru instances for NUMA aware > shrinkers and the contention will go away. I don't buy this, asking users to change their bios (and even to know that's a thing they should consider) is a big ask. Linked lists _suck_, both w.r.t. locking and cache behaviour, and we need to be exploring better options. > > > - LRU lists are -ordered- (it's right there in the name!) and this > > > appears to be an unordered list construct. > > > > Yes, it is. But in actual practice cache replacement policy tends not to > > matter nearly as much as people think; there's many papers showing real > > world hit ratio of common algorithms is only a fudge factor from random > > replacement - the main thing you want is an accessed bit (or counter, if > > you want the analagous version of n-lru for n > 2), and we'll still have > > that. > > Sure. But I can cherry-pick many papers showing exactly the opposite. > i.e. that LRU and LFU algorithms are far superior at maintaining a > working set compared to random cache shootdown, especially when > there is significant latency for cache replacement. But as mentioned we won't be comparing against pure random, it'll be vs. pure random with at least an accessed bit, preserving the multiple generations which are the most important feature of LRU/LFU as we use them. > What matters is whether there are any behavioural regressions as a > result of changing the current algorithm. We've used quasi-LRU > working set management for so long that this is the behaviour that > people have tuned their systems and applications to work well with. > Fundamental changes to working set maintenance behaviour is not > something I'm considering doing, nor something I *want* to do. Yup, it's not going to be the easiest thing to tackle.
On Thu, Oct 03, 2024 at 12:20:53PM GMT, Dave Chinner wrote: > On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote: > > On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote: > Which was painful to work with because > it maintains the existing spin lock based traversal pattern. This > was necessary because the iterator held a spinlock internally. I > really didn't like that aspect of it because it perpeutated the need > to open code the iget/iput game to allow the spinlock to be dropped > across the inode operation that needed to be performed. > > i.e. adding an dlist iterator didn't clean up any of the other mess > that sb->s_inodes iteration required... yeah, true. that's actually something that does get cleaner with fast-list; because we're iterating over a radix tree and our iterator is a radix tree index, the crab-walk thing naturally goes away. > > My concern is that we've been trying to get away from callbacks for > > iteration - post spectre they're quite a bit more expensive than > > external iterators, and we've generally been successful with that. > > So everyone keeps saying, but the old adage applies here: Penny > wise, pound foolish. > > Optimising away the callbacks might bring us a few percent > performance improvement for each operation (e.g. via the dlist > iterator mechanisms) in a traversal, but that iteration is still > only single threaded. Hence the maximum processing rate is > determined by the performance of a single CPU core. > > However, if we change the API to allow for parallelism at the cost > of a few percent per object operation, then a single CPU core will > not process quite as many objects as before. However, the moment we > allow multiple CPU cores to process in parallel, we acheive > processing rate improvements measured in integer multiples. > > Modern CPUs have concurrency to burn. Optimising APIs for minimum > per-operation overhead rather than for concurrent processing > implementations is the wrong direction to be taking.... OTOH - this is all academic because none of the uses of s_inodes are _remotely_ fastpaths. Aside from nr_blockdev_pages() it's more or less all filesystem teardown, or similar frequency. > > Radix tree doesn't work for us, since our keys are { inum, subvol } - 96 > > bits - > > Sure it does - you just need two layers of radix trees. i.e have a > radix tree per subvol to index inodes by inum, and a per-sb radix > tree to index the subvols. With some code to propagate radix tree > bits from the inode radix tree to the subvol radix tree they then > largely work in conjunction for filtered searches. It'd have to be the reverse - index by inum, then subvol, and then we'd need to do bit stuffing so that a radix tree with a single element is just a pointer to the element. But - yeah, if the current approach (not considering the subvol when calculating the hash) becomes an issue, that might be the way to go. > This is -exactly- the internal inode cache structure that XFS has. > We have a per-sb radix tree indexing the allocation groups, and a > radix tree per allocation group indexing inodes by inode number. > Hence an inode lookup involves splitting the inum into agno/agino > pairs, then doing a perag lookup with the agno, and doing a perag > inode cache lookup with the agino. All of these radix tree > lookups are lockless... Speaking of, I'd like to pick your brain on AGIs at some point. We've been sketching out future scalability work in bcachefs, and I think that's going to be one of the things we'll end up needing. Right now the scalability limit is backpointers fsck, but that looks fairly trivial to solve: there's no reason to run the backpointers -> extents pass except for debug testing, we can check and repair those references at runtime, and we can sum up backpointers in a bucket and check them against the bucket sector counts and skip extents -> backpointers if they match. After that, the next scalability limitation should be the main check_alloc_info pass, and we'll need something analagous to AGIs to shard that and run it efficiently when the main allocation info doesn't fit in memory - and it sounds like you have other optimizations that leverage AGIs as well.
On Thu 03-10-24 11:41:42, Dave Chinner wrote: > On Wed, Oct 02, 2024 at 07:20:16PM -0400, Kent Overstreet wrote: > > A couple things that help - we've already determined that the inode LRU > > can go away for most filesystems, > > We haven't determined that yet. I *think* it is possible, but there > is a really nasty inode LRU dependencies that has been driven deep > down into the mm page cache writeback code. We have to fix that > awful layering violation before we can get rid of the inode LRU. > > I *think* we can do it by requiring dirty inodes to hold an explicit > inode reference, thereby keeping the inode pinned in memory whilst > it is being tracked for writeback. That would also get rid of the > nasty hacks needed in evict() to wait on writeback to complete on > unreferenced inodes. > > However, this isn't simple to do, and so getting rid of the inode > LRU is not going to happen in the near term. Yeah. I agree the way how writeback protects from inode eviction is not the prettiest one but the problem with writeback holding normal inode reference is that then flush worker for the device can end up deleting unlinked inodes which was causing writeback stalls and generally unexpected lock ordering issues for some filesystems (already forgot the details). Now this was more that 12 years ago so maybe we could find a better solution to those problems these days (e.g. interactions between page writeback and page reclaim are very different these days) but I just wanted to warn there may be nasty surprises there. Honza
On Thu, Oct 03, 2024 at 11:17:41AM +0200, Jan Kara wrote: > On Thu 03-10-24 11:41:42, Dave Chinner wrote: > > On Wed, Oct 02, 2024 at 07:20:16PM -0400, Kent Overstreet wrote: > > > A couple things that help - we've already determined that the inode LRU > > > can go away for most filesystems, > > > > We haven't determined that yet. I *think* it is possible, but there > > is a really nasty inode LRU dependencies that has been driven deep > > down into the mm page cache writeback code. We have to fix that > > awful layering violation before we can get rid of the inode LRU. > > > > I *think* we can do it by requiring dirty inodes to hold an explicit > > inode reference, thereby keeping the inode pinned in memory whilst > > it is being tracked for writeback. That would also get rid of the > > nasty hacks needed in evict() to wait on writeback to complete on > > unreferenced inodes. > > > > However, this isn't simple to do, and so getting rid of the inode > > LRU is not going to happen in the near term. > > Yeah. I agree the way how writeback protects from inode eviction is not the > prettiest one but the problem with writeback holding normal inode reference > is that then flush worker for the device can end up deleting unlinked > inodes which was causing writeback stalls and generally unexpected lock > ordering issues for some filesystems (already forgot the details). Yeah, if we end up in evict() on ext4 it will can then do all sorts of whacky stuff that involves blocking, running transactions and doing other IO. XFS, OTOH, has been changed to defer all that crap to background threads (the xfs_inodegc infrastructure) that runs after the VFS thinks the inode is dead and destroyed. There are some benefits to having the filesystem inode exist outside the VFS inode life cycle.... > Now this > was more that 12 years ago so maybe we could find a better solution to > those problems these days (e.g. interactions between page writeback and > page reclaim are very different these days) but I just wanted to warn there > may be nasty surprises there. I don't think the situation has improved with filesytsems like ext4. I think they've actually gotten worse - I recently learnt that ext4 inode eviction can recurse back into the inode cache to instantiate extended attribute inodes so they can be truncated to allow inode eviction to make progress. I suspect the ext4 eviction behaviour is unfixable in any reasonable time frame, so the only solution I can come up with is to run the iput() call from a background thread context. (e.g. defer it to a workqueue). That way iput_final() and eviction processing will not interfere with other writeback operations.... -Dave.
Hi Dave! On Wed 02-10-24 11:33:17, Dave Chinner wrote: > There are two superblock iterator functions provided. The first is a > generic iterator that provides safe, reference counted inodes for > the callback to operate on. This is generally what most sb->s_inodes > iterators use, and it allows the iterator to drop locks and perform > blocking operations on the inode before moving to the next inode in > the sb->s_inodes list. > > There is one quirk to this interface - INO_ITER_REFERENCE - because > fsnotify iterates the inode cache -after- evict_inodes() has been > called during superblock shutdown to evict all non-referenced > inodes. Hence it should only find referenced inodes, and it has > a check to skip unreferenced inodes. This flag does the same. Overall I really like the series. A lot of duplicated code removed and scalability improved, we don't get such deals frequently :) Regarding INO_ITER_REFERENCE I think that after commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after evict_inodes") the check for 0 i_count in fsnotify_unmount_inodes() isn't that useful anymore so I'd be actually fine dropping it (as a separate patch please). That being said I'd like to discuss one thing: As you have surely noticed, some of the places iterating inodes perform additional checks on the inode to determine whether the inode is interesting or not (e.g. the Landlock iterator or iterators in quota code) to avoid the unnecessary iget / iput and locking dance. The inode refcount check you've worked-around with INO_ITER_REFERENCE is a special case of that. Have you considered option to provide callback for the check inside the iterator? Also maybe I'm went a *bit* overboard here with macro magic but the code below should provide an iterator that you can use like: for_each_sb_inode(sb, inode, inode_eligible_check(inode)) { do my stuff here } that will avoid any indirect calls and will magically handle all the cleanup that needs to be done if you break / jump out of the loop or similar. I actually find such constructs more convenient to use than your version of the iterator because there's no need to create & pass around the additional data structure for the iterator body, no need for special return values to abort iteration etc. Honza /* Find next inode on the inode list eligible for processing */ #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) \ ({ \ struct inode *ret = NULL; \ \ cond_resched(); \ spin_lock(&(sb)->s_inode_list_lock); \ if (!(inode)) \ inode = list_first_entry((sb)->s_inodes, struct inode, \ i_sb_list); \ while (1) { \ if (list_entry_is_head(inode, (sb)->s_inodes, i_sb_list)) { \ spin_unlock(&(sb)->s_inode_list_lock); \ break; \ } \ spin_lock(&inode->i_lock); \ if ((inode)->i_state & (I_NEW | I_FREEING | I_WILL_FREE) || \ !inode_eligible) { \ spin_unlock(&(inode)->i_lock); \ continue; \ } \ __iget(inode); \ spin_unlock(&(inode)->i_lock); \ spin_unlock(&(sb)->s_inode_list_lock); \ iput(*old_inode); \ *old_inode = inode; \ ret = inode; \ break; \ } \ ret; \ }) #define for_each_sb_inode(sb, inode, inode_eligible) \ for (DEFINE_FREE(old_inode, struct inode *, if (_T) iput(_T)), \ inode = NULL; \ inode = sb_inode_iter_next((sb), inode, &old_inode, \ inode_eligible); \ )
On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote: > /* Find next inode on the inode list eligible for processing */ > #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) \ > ({ \ > struct inode *ret = NULL; \ <snip> > ret; \ > }) How is this going to interact with calling into the file system to do the interaction, which is kinda the point of this series? Sure, you could have a get_next_inode-style method, but it would need a fair amount of entirely file system specific state that needs to be stashed away somewhere, and all options for that looks pretty ugly. Also even with your pre-iget callback we'd at best halve the number of indirect calls for something that isn't exactly performance critical. So while it could be done that way, it feels like a more complex and much harder to reason about version for no real benefit. > #define for_each_sb_inode(sb, inode, inode_eligible) \ > for (DEFINE_FREE(old_inode, struct inode *, if (_T) iput(_T)), \ > inode = NULL; \ > inode = sb_inode_iter_next((sb), inode, &old_inode, \ > inode_eligible); \ > ) And while I liked: obj = NULL; while ((obj = get_next_object(foo, obj))) { } style iterators, magic for_each macros that do magic cleanup are just a nightmare to read. Keep it simple and optimize for someone actually having to read and understand the code, and not for saving a few lines of code.
On Thu 03-10-24 05:18:30, Christoph Hellwig wrote: > On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote: > > /* Find next inode on the inode list eligible for processing */ > > #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) \ > > ({ \ > > struct inode *ret = NULL; \ > > <snip> > > > ret; \ > > }) > > How is this going to interact with calling into the file system > to do the interaction, which is kinda the point of this series? Yeah, I was concentrated on the VFS bits and forgot why Dave wrote this series in the first place. So this style of iterator isn't useful for what Dave wants to achieve. Sorry for the noise. Still the possibility to have a callback under inode->i_lock being able to do stuff and decide whether we should grab a reference or continue would be useful (and would allow us to combine the three iterations on unmount into one without too much hassle). > > #define for_each_sb_inode(sb, inode, inode_eligible) \ > > for (DEFINE_FREE(old_inode, struct inode *, if (_T) iput(_T)), \ > > inode = NULL; \ > > inode = sb_inode_iter_next((sb), inode, &old_inode, \ > > inode_eligible); \ > > ) > > And while I liked: > > obj = NULL; > > while ((obj = get_next_object(foo, obj))) { > } > > style iterators, magic for_each macros that do magic cleanup are just > a nightmare to read. Keep it simple and optimize for someone actually > having to read and understand the code, and not for saving a few lines > of code. Well, I agree the above is hard to read but I don't know how to write it in a more readable way while keeping the properties of the iterator (like auto-cleanup when you break out of the loop - which is IMO a must for a sane iterator). Anyway, this is now mostly academic since I agree this iterator isn't really useful for the situation here. Honza
On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote: > Hi Dave! > > On Wed 02-10-24 11:33:17, Dave Chinner wrote: > > There are two superblock iterator functions provided. The first is a > > generic iterator that provides safe, reference counted inodes for > > the callback to operate on. This is generally what most sb->s_inodes > > iterators use, and it allows the iterator to drop locks and perform > > blocking operations on the inode before moving to the next inode in > > the sb->s_inodes list. > > > > There is one quirk to this interface - INO_ITER_REFERENCE - because > > fsnotify iterates the inode cache -after- evict_inodes() has been > > called during superblock shutdown to evict all non-referenced > > inodes. Hence it should only find referenced inodes, and it has > > a check to skip unreferenced inodes. This flag does the same. > > Overall I really like the series. A lot of duplicated code removed and > scalability improved, we don't get such deals frequently :) Regarding > INO_ITER_REFERENCE I think that after commit 1edc8eb2e9313 ("fs: call > fsnotify_sb_delete after evict_inodes") the check for 0 i_count in > fsnotify_unmount_inodes() isn't that useful anymore so I'd be actually fine > dropping it (as a separate patch please). > > That being said I'd like to discuss one thing: As you have surely noticed, > some of the places iterating inodes perform additional checks on the inode > to determine whether the inode is interesting or not (e.g. the Landlock > iterator or iterators in quota code) to avoid the unnecessary iget / iput > and locking dance. Yes, but we really don't care. None of these cases are performance critical, and I'd much prefer that we have a consistent behaviour. > The inode refcount check you've worked-around with > INO_ITER_REFERENCE is a special case of that. Have you considered option to > provide callback for the check inside the iterator? I did. I decided that it wasn't necessary just to avoid the occasional iget/iput. It's certainly not necessary for the fsnotify/landlock cases where INO_ITER_REFERENCE was used because at that point there are only landlock and fsnotify inodes left in the cache. We're going to be doing iget/iput on all of them anyway. Really, subsystems should be tracking inodes they have references to themselves, not running 'needle in haystack' searches for inodes they hold references to. That would get rid of both the fsnotify and landlock iterators completely... > Also maybe I'm went a *bit* overboard here with macro magic but the code > below should provide an iterator that you can use like: > > for_each_sb_inode(sb, inode, inode_eligible_check(inode)) { > do my stuff here > } As I explained to Kent: wrapping the existing code in a different iterator defeats the entire purpose of the change to the iteration code. > that will avoid any indirect calls and will magically handle all the > cleanup that needs to be done if you break / jump out of the loop or > similar. I actually find such constructs more convenient to use than your > version of the iterator because there's no need to create & pass around the > additional data structure for the iterator body, no need for special return > values to abort iteration etc. I'm not introducing the callback-based iterator function to clean the code up - I'm introducing it as infrastructure that allows the *iteration mechanism to be completely replaced* by filesystems that have more efficient, more scalable inode iterators already built in. This change of iterator model also allows seamless transition of indivudal filesystems to new iterator mechanisms. Macro based iterators do not allow for different iterator implementations to co-exist, but that's exactly what I'm trying to acheive here. I'm not trying to clean the code up - I'm trying to lay the ground-work for new functionality.... -Dave.
On Thu, Oct 03, 2024 at 02:46:19PM +0200, Jan Kara wrote: > On Thu 03-10-24 05:18:30, Christoph Hellwig wrote: > > On Thu, Oct 03, 2024 at 01:45:55PM +0200, Jan Kara wrote: > > > /* Find next inode on the inode list eligible for processing */ > > > #define sb_inode_iter_next(sb, inode, old_inode, inode_eligible) \ > > > ({ \ > > > struct inode *ret = NULL; \ > > > > <snip> > > > > > ret; \ > > > }) > > > > How is this going to interact with calling into the file system > > to do the interaction, which is kinda the point of this series? > > Yeah, I was concentrated on the VFS bits and forgot why Dave wrote this > series in the first place. So this style of iterator isn't useful for what > Dave wants to achieve. Sorry for the noise. Still the possibility to have a > callback under inode->i_lock being able to do stuff and decide whether we I did that first, and turned into an utter mess the moment we step outside the existing iterator mechanism. I implemented a separate XFS icwalk function because having to hold the inode->i_lock between inode lookup and the callback function means we cannot do batched inode lookups from the radix tree. The existing icwalk code grabs 32 inodes at a time from the radix tree and validates them all, then runs the callback on them one at a time, then it drops them all. If the VFS inode callback requires the inode i_lock to be held and be atomic with the initial state checks, then we have to nest 32 spinlocks in what is effectively a random lock order. So I implemented a non-batched icwalk method, and it didn't get that much cleaner. It wasn't until I dropped the inode->i_lock from the callback API that everything cleaned up and the offload mechanism started to make sense. And it was this change that also makes it possible for XFs to use it's existing batched lookup mechanisms instead of the special case implementation that I wrote for this patch set. IOWs, we can't hold the inode->i_lock across lookup validation to callback if we want to provide freedom of implementation to the filesystem specific code. We aren't really concerned about performance of traversals, so I went with freedom of implementation over clunky locking semantics to optimise away a couple of atomic ops per inode for iget/iput calls. -Dave.