Message ID | 164180589176.86426.501271559065590169.stgit@mickey.themaw.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfs: check dentry is still valid in get_link() | expand |
On Mon, Jan 10, 2022 at 05:11:31PM +0800, Ian Kent wrote: > When following a trailing symlink in rcu-walk mode it's possible for > the dentry to become invalid between the last dentry seq lock check > and getting the link (eg. an unlink) leading to a backtrace similar > to this: > > crash> bt > PID: 10964 TASK: ffff951c8aa92f80 CPU: 3 COMMAND: "TaniumCX" > … > #7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe > [exception RIP: unknown or invalid address] > RIP: 0000000000000000 RSP: ffffae44d0a6fc90 RFLAGS: 00010246 > RAX: ffffffff8da3cc80 RBX: ffffae44d0a6fd30 RCX: 0000000000000000 > RDX: ffffae44d0a6fd98 RSI: ffff951aa9af3008 RDI: 0000000000000000 > RBP: 0000000000000000 R8: ffffae44d0a6fb94 R9: 0000000000000000 > R10: ffff951c95d8c318 R11: 0000000000080000 R12: ffffae44d0a6fd98 > R13: ffff951aa9af3008 R14: ffff951c8c9eb840 R15: 0000000000000000 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61 > #9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1 > #10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700 > #11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4 > #12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9 > #13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b > > Most of the time this is not a problem because the inode is unchanged > while the rcu read lock is held. > > But xfs can re-use inodes which can result in the inode ->get_link() > method becoming invalid (or NULL). Without an RCU delay? Then we have much worse problems... Details, please.
On Sat, 2022-01-15 at 06:38 +0000, Al Viro wrote: > On Mon, Jan 10, 2022 at 05:11:31PM +0800, Ian Kent wrote: > > When following a trailing symlink in rcu-walk mode it's possible > > for > > the dentry to become invalid between the last dentry seq lock check > > and getting the link (eg. an unlink) leading to a backtrace similar > > to this: > > > > crash> bt > > PID: 10964 TASK: ffff951c8aa92f80 CPU: 3 COMMAND: "TaniumCX" > > … > > #7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe > > [exception RIP: unknown or invalid address] > > RIP: 0000000000000000 RSP: ffffae44d0a6fc90 RFLAGS: 00010246 > > RAX: ffffffff8da3cc80 RBX: ffffae44d0a6fd30 RCX: > > 0000000000000000 > > RDX: ffffae44d0a6fd98 RSI: ffff951aa9af3008 RDI: > > 0000000000000000 > > RBP: 0000000000000000 R8: ffffae44d0a6fb94 R9: > > 0000000000000000 > > R10: ffff951c95d8c318 R11: 0000000000080000 R12: > > ffffae44d0a6fd98 > > R13: ffff951aa9af3008 R14: ffff951c8c9eb840 R15: > > 0000000000000000 > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > #8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61 > > #9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1 > > #10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700 > > #11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4 > > #12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9 > > #13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b > > > > Most of the time this is not a problem because the inode is > > unchanged > > while the rcu read lock is held. > > > > But xfs can re-use inodes which can result in the inode - > > >get_link() > > method becoming invalid (or NULL). > > Without an RCU delay? Then we have much worse problems... Sorry for the delay. That was a problem that was discussed at length with the original post of this patch that included a patch for this too (misguided though it was). That discussion resulted in Darrick merging the problem xfs inline symlink handling with the xfs normal symlink handling. Another problem with these inline syslinks was they would hand a pointer to internal xfs storage to the VFS. Darrick's change allocates and copies the link then hands it to the VFS to free after use. And since there's an allocation in the symlink handler the rcu-walk case returns -ECHILD (on passed NULL dentry) so the VFS will call unlazy before that next call which I think is itself enough to resolve this problem. The only thing I think might be questionable is the VFS copy of the inode pointer but I think the inode is rcu freed so it will be around and the seq count will have changed so I think it should be ok. If I'm missing something please say so, ;) Darrick's patch is (was last I looked) in his xfs-next tree. Ian
On Mon, Jan 17, 2022 at 10:55:32AM +0800, Ian Kent wrote: > On Sat, 2022-01-15 at 06:38 +0000, Al Viro wrote: > > On Mon, Jan 10, 2022 at 05:11:31PM +0800, Ian Kent wrote: > > > When following a trailing symlink in rcu-walk mode it's possible > > > for > > > the dentry to become invalid between the last dentry seq lock check > > > and getting the link (eg. an unlink) leading to a backtrace similar > > > to this: > > > > > > crash> bt > > > PID: 10964 TASK: ffff951c8aa92f80 CPU: 3 COMMAND: "TaniumCX" > > > … > > > #7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe > > > [exception RIP: unknown or invalid address] > > > RIP: 0000000000000000 RSP: ffffae44d0a6fc90 RFLAGS: 00010246 > > > RAX: ffffffff8da3cc80 RBX: ffffae44d0a6fd30 RCX: > > > 0000000000000000 > > > RDX: ffffae44d0a6fd98 RSI: ffff951aa9af3008 RDI: > > > 0000000000000000 > > > RBP: 0000000000000000 R8: ffffae44d0a6fb94 R9: > > > 0000000000000000 > > > R10: ffff951c95d8c318 R11: 0000000000080000 R12: > > > ffffae44d0a6fd98 > > > R13: ffff951aa9af3008 R14: ffff951c8c9eb840 R15: > > > 0000000000000000 > > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > > #8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61 > > > #9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1 > > > #10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700 > > > #11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4 > > > #12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9 > > > #13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b > > > > > > Most of the time this is not a problem because the inode is > > > unchanged > > > while the rcu read lock is held. > > > > > > But xfs can re-use inodes which can result in the inode - > > > >get_link() > > > method becoming invalid (or NULL). > > > > Without an RCU delay? Then we have much worse problems... > > Sorry for the delay. > > That was a problem that was discussed at length with the original post > of this patch that included a patch for this too (misguided though it > was). > To Al's question, at the end of the day there is no rcu delay involved with inode reuse in XFS. We do use call_rcu() for eventual freeing of inodes (see __xfs_inode_free()), but inode reuse occurs for inodes that have been put into a "reclaim" state before getting to the point of freeing the struct inode memory. This lead to the long discussion [1] Ian references around ways to potentially deal with that. I think the TLDR of that thread is there are various potential options for improvement, such as to rcu wait on inode creation/reuse (either explicitly or via more open coded grace period cookie tracking), to rcu wait somewhere in the destroy sequence before inodes become reuse candidates, etc., but none of them seemingly agreeable for varying reasons (IIRC mostly stemming from either performance or compexity) [2]. The change that has been made so far in XFS is to turn rcuwalk for symlinks off once again, which looks like landed in Linus' tree as commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to the vfs"). The hope is that between that patch and this prospective vfs tweak, we can have a couple incremental fixes that at least address the practical problem users have been running into (which is a crash due to a NULL ->get_link() callback pointer due to inode reuse). The inode reuse vs. rcu thing might still be a broader problem, but AFAIA that mechanism has been in place in XFS on Linux pretty much forever. Brian [1] https://lore.kernel.org/linux-fsdevel/163660197073.22525.11235124150551283676.stgit@mickey.themaw.net/ [2] Yet another idea could be a mix of two of the previously discussed approaches: stamp the current rcu gp marker in the xfs_inode somewhere on destroy and check it on reuse to conditionally rcu wait when necessary. Perhaps that might provide enough batching to mitigate performance impact when compared to an unconditional create side wait. > That discussion resulted in Darrick merging the problem xfs inline > symlink handling with the xfs normal symlink handling. > > Another problem with these inline syslinks was they would hand a > pointer to internal xfs storage to the VFS. Darrick's change > allocates and copies the link then hands it to the VFS to free > after use. And since there's an allocation in the symlink handler > the rcu-walk case returns -ECHILD (on passed NULL dentry) so the > VFS will call unlazy before that next call which I think is itself > enough to resolve this problem. > > The only thing I think might be questionable is the VFS copy of the > inode pointer but I think the inode is rcu freed so it will be > around and the seq count will have changed so I think it should be > ok. > > If I'm missing something please say so, ;) > > Darrick's patch is (was last I looked) in his xfs-next tree. > > Ian > >
On Mon, Jan 17, 2022 at 09:35:58AM -0500, Brian Foster wrote: > To Al's question, at the end of the day there is no rcu delay involved > with inode reuse in XFS. We do use call_rcu() for eventual freeing of > inodes (see __xfs_inode_free()), but inode reuse occurs for inodes that > have been put into a "reclaim" state before getting to the point of > freeing the struct inode memory. This lead to the long discussion [1] > Ian references around ways to potentially deal with that. I think the > TLDR of that thread is there are various potential options for > improvement, such as to rcu wait on inode creation/reuse (either > explicitly or via more open coded grace period cookie tracking), to rcu > wait somewhere in the destroy sequence before inodes become reuse > candidates, etc., but none of them seemingly agreeable for varying > reasons (IIRC mostly stemming from either performance or compexity) [2]. > > The change that has been made so far in XFS is to turn rcuwalk for > symlinks off once again, which looks like landed in Linus' tree as > commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata > buffers to the vfs"). The hope is that between that patch and this > prospective vfs tweak, we can have a couple incremental fixes that at > least address the practical problem users have been running into (which > is a crash due to a NULL ->get_link() callback pointer due to inode > reuse). The inode reuse vs. rcu thing might still be a broader problem, > but AFAIA that mechanism has been in place in XFS on Linux pretty much > forever. My problem with that is that pathname resolution very much relies upon the assumption that any inode it observes will *not* change its nature until the final rcu_read_unlock(). Papering over ->i_op->get_link reads in symlink case might be sufficient at the moment (I'm still not certain about that, though), but that's rather brittle. E.g. if some XFS change down the road adds ->permission() on some inodes, you'll get the same problem in do_inode_permission(). We also have places where we rely upon sample ->d_seq fetch ->d_flags fetch ->d_inode validate ->d_seq ... assume that inode type matches the information in flags How painful would it be to make xfs_destroy_inode() a ->free_inode() instance? IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU callback context? Note that ->destroy_inode() is called via static void destroy_inode(struct inode *inode) { const struct super_operations *ops = inode->i_sb->s_op; BUG_ON(!list_empty(&inode->i_lru)); __destroy_inode(inode); if (ops->destroy_inode) { ops->destroy_inode(inode); if (!ops->free_inode) return; } inode->free_inode = ops->free_inode; call_rcu(&inode->i_rcu, i_callback); } with static void i_callback(struct rcu_head *head) { struct inode *inode = container_of(head, struct inode, i_rcu); if (inode->free_inode) inode->free_inode(inode); else free_inode_nonrcu(inode); } IOW, ->free_inode() is RCU-delayed part of ->destroy_inode(). If both are present, ->destroy_inode() will be called synchronously, followed by ->free_inode() from RCU callback, so you can have both - moving just the "finally mark for reuse" part into ->free_inode() would be OK. Any blocking stuff (if any) can be left in ->destroy_inode()...
On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote: > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode(). If both > are present, ->destroy_inode() will be called synchronously, followed > by ->free_inode() from RCU callback, so you can have both - moving just > the "finally mark for reuse" part into ->free_inode() would be OK. > Any blocking stuff (if any) can be left in ->destroy_inode()... BTW, we *do* have a problem with ext4 fast symlinks. Pathwalk assumes that strings it parses are not changing under it. There are rather delicate dances in dcache lookups re possibility of ->d_name contents changing under it, but the search key is assumed to be stable. What's more, there's a correctness issue even if we do not oops. Currently we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from the stack. After all, we'd just finished traversing what used to be the contents of a symlink that used to be in the right place. It might have been unlinked while we'd been traversing it, but that's not a correctness issue. But that critically depends upon the contents not getting mangled. If it *can* be screwed by such unlink, we risk successful lookup leading to the wrong place, with nothing to tell us that it's happening. We could handle that by adding a check to fs/namei.c:put_link(), and propagating the error to callers. It's not impossible, but it won't be pretty. And that assumes we avoid oopsen on string changing under us in the first place. Which might or might not be true - I hadn't finished the audit yet. Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods - we need to make sure that e.g. everything called by ->d_hash() instances is OK with strings changing right under them. Including utf8_to_utf32(), crc32_le(), utf8_casefold_hash(), etc.
On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote: > On Mon, Jan 17, 2022 at 09:35:58AM -0500, Brian Foster wrote: > > > To Al's question, at the end of the day there is no rcu delay involved > > with inode reuse in XFS. We do use call_rcu() for eventual freeing of > > inodes (see __xfs_inode_free()), but inode reuse occurs for inodes that > > have been put into a "reclaim" state before getting to the point of > > freeing the struct inode memory. This lead to the long discussion [1] > > Ian references around ways to potentially deal with that. I think the > > TLDR of that thread is there are various potential options for > > improvement, such as to rcu wait on inode creation/reuse (either > > explicitly or via more open coded grace period cookie tracking), to rcu > > wait somewhere in the destroy sequence before inodes become reuse > > candidates, etc., but none of them seemingly agreeable for varying > > reasons (IIRC mostly stemming from either performance or compexity) [2]. > > > > The change that has been made so far in XFS is to turn rcuwalk for > > symlinks off once again, which looks like landed in Linus' tree as > > commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata > > buffers to the vfs"). The hope is that between that patch and this > > prospective vfs tweak, we can have a couple incremental fixes that at > > least address the practical problem users have been running into (which > > is a crash due to a NULL ->get_link() callback pointer due to inode > > reuse). The inode reuse vs. rcu thing might still be a broader problem, > > but AFAIA that mechanism has been in place in XFS on Linux pretty much > > forever. > > My problem with that is that pathname resolution very much relies upon > the assumption that any inode it observes will *not* change its nature > until the final rcu_read_unlock(). Papering over ->i_op->get_link reads > in symlink case might be sufficient at the moment (I'm still not certain > about that, though), but that's rather brittle. E.g. if some XFS change > down the road adds ->permission() on some inodes, you'll get the same > problem in do_inode_permission(). We also have places where we rely upon > sample ->d_seq > fetch ->d_flags > fetch ->d_inode > validate ->d_seq > ... > assume that inode type matches the information in flags > > How painful would it be to make xfs_destroy_inode() a ->free_inode() instance? > IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU > callback context? Note that ->destroy_inode() is called via > As discussed on IRC, this was brought up in the earlier discussion by Miklos. Dave expressed some concern around locking, but I'm not sure I grok the details from reading back [1]. The implication seems to be the lookup side would have to rcu wait on associated inodes in the destroy side, which might be more of a concern about unconditional use of free_inode() as opposed to more selective rcu waiting for unlinked (or inactive) inodes. Dave would need to chime in further on that.. As it is, it looks to me that unlinked inodes unconditionally go to the inactive queues and thus the create side (xfs_iget_cache_hit(), if we're considering inode reuse) basically pushes on the queue and waits on the inode state to clear. Given that, ISTM it shouldn't be that much functional pain to introduce an rcu delay somewhere before an inactive inode becomes reclaimable (and thus reusable). I think the impediment to something like this has been more performance related. An inode alloc/free workload can turn into a continuous reuse of the same batch of inodes, over and over. Therefore an rcu wait on iget reuse can become effectively unconditional and slow things down quite a bit (hence my previous, untested thought around making it conditional and potentially amortizing the cost). I had played with a more selective grace period in the teardown side for inactive inodes via queue_rcu_work(), since that's an easy place to inject an rcu delay/callback, but that had some performance impact on sustained file removal that might require retuning other bits.. Brian [1] https://lore.kernel.org/linux-fsdevel/20211114231834.GM449541@dread.disaster.area/#t > static void destroy_inode(struct inode *inode) > { > const struct super_operations *ops = inode->i_sb->s_op; > > BUG_ON(!list_empty(&inode->i_lru)); > __destroy_inode(inode); > if (ops->destroy_inode) { > ops->destroy_inode(inode); > if (!ops->free_inode) > return; > } > inode->free_inode = ops->free_inode; > call_rcu(&inode->i_rcu, i_callback); > } > > with > > static void i_callback(struct rcu_head *head) > { > struct inode *inode = container_of(head, struct inode, i_rcu); > if (inode->free_inode) > inode->free_inode(inode); > else > free_inode_nonrcu(inode); > } > > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode(). If both > are present, ->destroy_inode() will be called synchronously, followed > by ->free_inode() from RCU callback, so you can have both - moving just > the "finally mark for reuse" part into ->free_inode() would be OK. > Any blocking stuff (if any) can be left in ->destroy_inode()... >
On Mon, Jan 17, 2022 at 06:10:36PM +0000, Al Viro wrote: > On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote: > > > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode(). If both > > are present, ->destroy_inode() will be called synchronously, followed > > by ->free_inode() from RCU callback, so you can have both - moving just > > the "finally mark for reuse" part into ->free_inode() would be OK. > > Any blocking stuff (if any) can be left in ->destroy_inode()... > > BTW, we *do* have a problem with ext4 fast symlinks. Pathwalk assumes that > strings it parses are not changing under it. There are rather delicate > dances in dcache lookups re possibility of ->d_name contents changing under > it, but the search key is assumed to be stable. > > What's more, there's a correctness issue even if we do not oops. Currently > we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from > the stack. After all, we'd just finished traversing what used to be the > contents of a symlink that used to be in the right place. It might have been > unlinked while we'd been traversing it, but that's not a correctness issue. > > But that critically depends upon the contents not getting mangled. If it > *can* be screwed by such unlink, we risk successful lookup leading to the > wrong place, with nothing to tell us that it's happening. We could handle > that by adding a check to fs/namei.c:put_link(), and propagating the error > to callers. It's not impossible, but it won't be pretty. > > And that assumes we avoid oopsen on string changing under us in the first > place. Which might or might not be true - I hadn't finished the audit yet. > Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods - > we need to make sure that e.g. everything called by ->d_hash() instances > is OK with strings changing right under them. Including utf8_to_utf32(), > crc32_le(), utf8_casefold_hash(), etc. And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that one and the call chains there are deep enough for me to miss something) have the "bugger the contents of string returned by RCU ->get_link() if unlink() happens" problem. I would very much prefer to have them deal with that crap, especially since I don't see why does ext4_evict_inode() need to do that memset() - can't we simply check ->i_op in ext4_can_truncate() and be done with that?
On Mon, Jan 17, 2022 at 07:48:49PM +0000, Al Viro wrote: > > But that critically depends upon the contents not getting mangled. If it > > *can* be screwed by such unlink, we risk successful lookup leading to the > > wrong place, with nothing to tell us that it's happening. We could handle > > that by adding a check to fs/namei.c:put_link(), and propagating the error > > to callers. It's not impossible, but it won't be pretty. > > > > And that assumes we avoid oopsen on string changing under us in the first > > place. Which might or might not be true - I hadn't finished the audit yet. > > Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods - > > we need to make sure that e.g. everything called by ->d_hash() instances > > is OK with strings changing right under them. Including utf8_to_utf32(), > > crc32_le(), utf8_casefold_hash(), etc. > > And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that one and > the call chains there are deep enough for me to miss something) have the > "bugger the contents of string returned by RCU ->get_link() if unlink() > happens" problem. > > I would very much prefer to have them deal with that crap, especially > since I don't see why does ext4_evict_inode() need to do that memset() - > can't we simply check ->i_op in ext4_can_truncate() and be done with > that? This reuse-without-delay has another fun side, AFAICS. Suppose the new use for inode comes with the same ->i_op (i.e. it's a symlink again) and it happens right after ->get_link() has returned the pointer to body. We are already past whatever checks we might add in pick_link(). And the pointer is still valid. So we end up quietly traversing the body of completely unrelated symlink that never had been anywhere near any directory we might be looking at. With no indication of anything going wrong - just a successful resolution with bogus result. Could XFS folks explain what exactly goes wrong if we make actual marking inode as ready for reuse RCU-delayed, by shifting just that into ->free_inode()? Why would we need any extra synchronize_rcu() anywhere?
On Tue, 2022-01-18 at 01:32 +0000, Al Viro wrote: > On Mon, Jan 17, 2022 at 07:48:49PM +0000, Al Viro wrote: > > > But that critically depends upon the contents not getting > > > mangled. If it > > > *can* be screwed by such unlink, we risk successful lookup > > > leading to the > > > wrong place, with nothing to tell us that it's happening. We > > > could handle > > > that by adding a check to fs/namei.c:put_link(), and propagating > > > the error > > > to callers. It's not impossible, but it won't be pretty. > > > > > > And that assumes we avoid oopsen on string changing under us in > > > the first > > > place. Which might or might not be true - I hadn't finished the > > > audit yet. > > > Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs > > > methods - > > > we need to make sure that e.g. everything called by ->d_hash() > > > instances > > > is OK with strings changing right under them. Including > > > utf8_to_utf32(), > > > crc32_le(), utf8_casefold_hash(), etc. > > > > And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that > > one and > > the call chains there are deep enough for me to miss something) > > have the > > "bugger the contents of string returned by RCU ->get_link() if > > unlink() > > happens" problem. > > > > I would very much prefer to have them deal with that crap, > > especially > > since I don't see why does ext4_evict_inode() need to do that > > memset() - > > can't we simply check ->i_op in ext4_can_truncate() and be done > > with > > that? > > This reuse-without-delay has another fun side, AFAICS. Suppose the > new use > for inode comes with the same ->i_op (i.e. it's a symlink again) and > it > happens right after ->get_link() has returned the pointer to body. > > We are already past whatever checks we might add in pick_link(). And > the > pointer is still valid. So we end up quietly traversing the body of > completely unrelated symlink that never had been anywhere near any > directory > we might be looking at. With no indication of anything going wrong - > just > a successful resolution with bogus result. Wouldn't that case be caught by the unlazy call since ->get_link() needs to return -ECHILD for the rcu case now (in xfs anyway)? > > Could XFS folks explain what exactly goes wrong if we make actual > marking > inode as ready for reuse RCU-delayed, by shifting just that into > ->free_inode()? Why would we need any extra synchronize_rcu() > anywhere?
On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote: > On Mon, Jan 17, 2022 at 09:35:58AM -0500, Brian Foster wrote: > > > To Al's question, at the end of the day there is no rcu delay involved > > with inode reuse in XFS. We do use call_rcu() for eventual freeing of > > inodes (see __xfs_inode_free()), but inode reuse occurs for inodes that > > have been put into a "reclaim" state before getting to the point of > > freeing the struct inode memory. This lead to the long discussion [1] > > Ian references around ways to potentially deal with that. I think the > > TLDR of that thread is there are various potential options for > > improvement, such as to rcu wait on inode creation/reuse (either > > explicitly or via more open coded grace period cookie tracking), to rcu > > wait somewhere in the destroy sequence before inodes become reuse > > candidates, etc., but none of them seemingly agreeable for varying > > reasons (IIRC mostly stemming from either performance or compexity) [2]. > > > > The change that has been made so far in XFS is to turn rcuwalk for > > symlinks off once again, which looks like landed in Linus' tree as > > commit 7b7820b83f23 ("xfs: don't expose internal symlink metadata > > buffers to the vfs"). The hope is that between that patch and this > > prospective vfs tweak, we can have a couple incremental fixes that at > > least address the practical problem users have been running into (which > > is a crash due to a NULL ->get_link() callback pointer due to inode > > reuse). The inode reuse vs. rcu thing might still be a broader problem, > > but AFAIA that mechanism has been in place in XFS on Linux pretty much > > forever. > > My problem with that is that pathname resolution very much relies upon > the assumption that any inode it observes will *not* change its nature > until the final rcu_read_unlock(). Papering over ->i_op->get_link reads > in symlink case might be sufficient at the moment (I'm still not certain > about that, though), but that's rather brittle. E.g. if some XFS change > down the road adds ->permission() on some inodes, you'll get the same > problem in do_inode_permission(). We also have places where we rely upon > sample ->d_seq > fetch ->d_flags > fetch ->d_inode > validate ->d_seq > ... > assume that inode type matches the information in flags > > How painful would it be to make xfs_destroy_inode() a ->free_inode() instance? > IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU > callback context? AIUI, not very close at all, I'm pretty sure we can't put it under RCU callback context at all because xfs_fs_destroy_inode() can take sleeping locks, perform transactions, do IO, run rcu_read_lock() critical sections, etc. This means that needs to run an a full task context and so can't run from RCU callback context at all. > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode(). If both > are present, ->destroy_inode() will be called synchronously, followed > by ->free_inode() from RCU callback, so you can have both - moving just > the "finally mark for reuse" part into ->free_inode() would be OK. > Any blocking stuff (if any) can be left in ->destroy_inode()... Yup, that's pretty much what we already do, except that we run the RCU-delayed part of freeing the inode once XFS has finished with the inode internally and the background inode GC reclaims it. Cheers, Dave.
On Tue, Jan 18, 2022 at 10:31:53AM +0800, Ian Kent wrote: > Wouldn't that case be caught by the unlazy call since ->get_link() > needs to return -ECHILD for the rcu case now (in xfs anyway)? *shrug* that'd solve the problem, all right, but it's a serious overkill in several respects.
On Tue, Jan 18, 2022 at 02:00:41PM +1100, Dave Chinner wrote: > > IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU > > callback context? > > AIUI, not very close at all, > > I'm pretty sure we can't put it under RCU callback context at all > because xfs_fs_destroy_inode() can take sleeping locks, perform > transactions, do IO, run rcu_read_lock() critical sections, etc. > This means that needs to run an a full task context and so can't run > from RCU callback context at all. Umm... AFAICS, this pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); spin_lock(&pag->pag_ici_lock); spin_lock(&ip->i_flags_lock); trace_xfs_inode_set_reclaimable(ip); ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING); ip->i_flags |= XFS_IRECLAIMABLE; xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG); spin_unlock(&ip->i_flags_lock); spin_unlock(&pag->pag_ici_lock); xfs_perag_put(pag); in the end of xfs_inodegc_set_reclaimable() could go into ->free_inode() just fine. It's xfs_inodegc_queue() I'm not sure about - the part about flush_work() in there... I'm not familiar with that code; could you point me towards some docs/old postings/braindump/whatnot?
On Tue, Jan 18, 2022 at 03:17:13AM +0000, Al Viro wrote: > On Tue, Jan 18, 2022 at 02:00:41PM +1100, Dave Chinner wrote: > > > IOW, how far is xfs_inode_mark_reclaimable() from being callable in RCU > > > callback context? > > > > AIUI, not very close at all, > > > > I'm pretty sure we can't put it under RCU callback context at all > > because xfs_fs_destroy_inode() can take sleeping locks, perform > > transactions, do IO, run rcu_read_lock() critical sections, etc. > > This means that needs to run an a full task context and so can't run > > from RCU callback context at all. > > Umm... AFAICS, this > pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); > spin_lock(&pag->pag_ici_lock); > spin_lock(&ip->i_flags_lock); > > trace_xfs_inode_set_reclaimable(ip); > ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING); > ip->i_flags |= XFS_IRECLAIMABLE; > xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino), > XFS_ICI_RECLAIM_TAG); > > spin_unlock(&ip->i_flags_lock); > spin_unlock(&pag->pag_ici_lock); > xfs_perag_put(pag); > > in the end of xfs_inodegc_set_reclaimable() could go into ->free_inode() > just fine. No, that just creates a black hole where the VFS inode has been destroyed but the XFS inode cache doesn't know it's been trashed. Hence setting XFS_IRECLAIMABLE needs to remain in the during ->destroy_inode, otherwise the ->lookup side of the cache will think that are currently still in use by the VFS and hand them straight back out without going through the inode recycling code. i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS part of the inode has been torn down, and that it must go back through VFS re-initialisation before it can be re-instantiated as a VFS inode. It would also mean that the inode will need to go through two RCU grace periods before it gets reclaimed, because XFS uses RCU protected inode cache lookups internally (e.g. for clustering dirty inode writeback) and so freeing the inode from the internal XFS inode cache requires RCU freeing... > It's xfs_inodegc_queue() I'm not sure about - the part > about flush_work() in there... That's the bit that prevents unbound queuing of inodes that require inactivation because of the problems that causes memory reclaim and performance. It blocks waiting for xfs_inactive() calls to complete via xfs_inodegc_worker(), and it's the xfs_inactive() calls that do all the IO, transactions, locking, etc. So if we block waiting for them to complete in RCU callback context, we're effectively inverting the current locking order for entire filesystem w.r.t. RCU... Also worth considering: if we are processing the unlink of an inode with tens or hundreds of millions of extents in xfs_inactive(), that flush_work() call could block for *minutes* waiting on inactivation to run the millions of transactions needed to free that inode. Regardless of lock ordering, we don't want to block RCU grace period callback work for that long.... > I'm not familiar with that code; could you point me towards some > docs/old postings/braindump/whatnot? Commit ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") explains the per-cpu queuing mechanism that is being used in this code. Cheers, Dave.
On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote: > No, that just creates a black hole where the VFS inode has been > destroyed but the XFS inode cache doesn't know it's been trashed. > Hence setting XFS_IRECLAIMABLE needs to remain in the during > ->destroy_inode, otherwise the ->lookup side of the cache will think > that are currently still in use by the VFS and hand them straight > back out without going through the inode recycling code. > > i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS > part of the inode has been torn down, and that it must go back > through VFS re-initialisation before it can be re-instantiated as a > VFS inode. OK... > It would also mean that the inode will need to go through two RCU > grace periods before it gets reclaimed, because XFS uses RCU > protected inode cache lookups internally (e.g. for clustering dirty > inode writeback) and so freeing the inode from the internal > XFS inode cache requires RCU freeing... Wait a minute. Where is that RCU delay of yours, relative to xfs_vn_unlink() and xfs_vn_rename() (for target)? And where does it happen in case of e.g. open() + unlink() + close()?
On Mon, Jan 17, 2022 at 06:10:36PM +0000, Al Viro wrote: > On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote: > > > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode(). If both > > are present, ->destroy_inode() will be called synchronously, followed > > by ->free_inode() from RCU callback, so you can have both - moving just > > the "finally mark for reuse" part into ->free_inode() would be OK. > > Any blocking stuff (if any) can be left in ->destroy_inode()... > > BTW, we *do* have a problem with ext4 fast symlinks. Pathwalk assumes that > strings it parses are not changing under it. There are rather delicate > dances in dcache lookups re possibility of ->d_name contents changing under > it, but the search key is assumed to be stable. > > What's more, there's a correctness issue even if we do not oops. Currently > we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from > the stack. After all, we'd just finished traversing what used to be the > contents of a symlink that used to be in the right place. It might have been > unlinked while we'd been traversing it, but that's not a correctness issue. > > But that critically depends upon the contents not getting mangled. If it > *can* be screwed by such unlink, we risk successful lookup leading to the Out of curiosity: whether or not it can get mangled depends on the filesystem and how it implements fast symlinks or do fast symlinks currently guarantee that contents are mangled?
On Tue, Jan 18, 2022 at 01:32:23AM +0000, Al Viro wrote: > On Mon, Jan 17, 2022 at 07:48:49PM +0000, Al Viro wrote: > > > But that critically depends upon the contents not getting mangled. If it > > > *can* be screwed by such unlink, we risk successful lookup leading to the > > > wrong place, with nothing to tell us that it's happening. We could handle > > > that by adding a check to fs/namei.c:put_link(), and propagating the error > > > to callers. It's not impossible, but it won't be pretty. > > > > > > And that assumes we avoid oopsen on string changing under us in the first > > > place. Which might or might not be true - I hadn't finished the audit yet. > > > Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods - > > > we need to make sure that e.g. everything called by ->d_hash() instances > > > is OK with strings changing right under them. Including utf8_to_utf32(), > > > crc32_le(), utf8_casefold_hash(), etc. > > > > And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that one and > > the call chains there are deep enough for me to miss something) have the > > "bugger the contents of string returned by RCU ->get_link() if unlink() > > happens" problem. > > > > I would very much prefer to have them deal with that crap, especially > > since I don't see why does ext4_evict_inode() need to do that memset() - > > can't we simply check ->i_op in ext4_can_truncate() and be done with > > that? > > This reuse-without-delay has another fun side, AFAICS. Suppose the new use > for inode comes with the same ->i_op (i.e. it's a symlink again) and it > happens right after ->get_link() has returned the pointer to body. > Yep, I had reproduced this explicitly when playing around with some instrumented delays and whatnot in the code. This and the similar variant of just returning internal/non-string data fork metadata via ->get_link() is why I asked to restore old behavior of returning -ECHILD for inline symlinks. > We are already past whatever checks we might add in pick_link(). And the > pointer is still valid. So we end up quietly traversing the body of > completely unrelated symlink that never had been anywhere near any directory > we might be looking at. With no indication of anything going wrong - just > a successful resolution with bogus result. > > Could XFS folks explain what exactly goes wrong if we make actual marking > inode as ready for reuse RCU-delayed, by shifting just that into > ->free_inode()? Why would we need any extra synchronize_rcu() anywhere? > Dave already chimed in on why we probably don't want ->free_inode() across the board. I don't think there's a functional problem with a more selective injection of an rcu delay on the INACTIVE -> RECLAIMABLE transition, based on the reasoning specified earlier (i.e., the iget side already blocks on INACTIVE, so it's just a matter of a longer delay). Most of that long thread I previously linked to was us discussing pretty much how to do something like that with minimal performance impact. The experiment I ran to measure performance was use of queue_rcu_work() for inactive inode processing. That resulted in a performance hit to single threaded sequential file removal, but could be mitigated by increasing the queue size (which may or may not have other side effects). Dave suggested a more async approach to track the current grace period in the inode and refer to it at lookup/alloc time, but that is notably more involved and isn't clear if/how much it mitigates rcu delays. IIUC, your thought here is to introduce an rcu delay on the destroy side, but after the inactive processing rather than before it (as my previous experiment did). IOW, basically invoke xfs_inodegc_set_reclaimable() as an rcu callback via xfs_inodegc_worker(), yes? If so, that seems like a potentially reasonable option to me since it pulls the delay out of the inactivation processing pipeline. I suspect the tradeoff with that is it might be slightly less efficient than doing it earlier because we've lost any grace period transitions that have occurred since before the inode was queued and processed, but OTOH this might isolate the impact of that delay to the inode reuse path. Maybe there's room for a simple optimization there in cases where a gp may have expired already since the inode was first queued. Hmm.. maybe I'll give that a try to see if/how much impact there may be on an inode alloc/free workload.. Brian
On Tue, Jan 18, 2022 at 09:29:11AM +0100, Christian Brauner wrote: > On Mon, Jan 17, 2022 at 06:10:36PM +0000, Al Viro wrote: > > On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote: > > > > > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode(). If both > > > are present, ->destroy_inode() will be called synchronously, followed > > > by ->free_inode() from RCU callback, so you can have both - moving just > > > the "finally mark for reuse" part into ->free_inode() would be OK. > > > Any blocking stuff (if any) can be left in ->destroy_inode()... > > > > BTW, we *do* have a problem with ext4 fast symlinks. Pathwalk assumes that > > strings it parses are not changing under it. There are rather delicate > > dances in dcache lookups re possibility of ->d_name contents changing under > > it, but the search key is assumed to be stable. > > > > What's more, there's a correctness issue even if we do not oops. Currently > > we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from > > the stack. After all, we'd just finished traversing what used to be the > > contents of a symlink that used to be in the right place. It might have been > > unlinked while we'd been traversing it, but that's not a correctness issue. > > > > But that critically depends upon the contents not getting mangled. If it > > *can* be screwed by such unlink, we risk successful lookup leading to the > > Out of curiosity: whether or not it can get mangled depends on the > filesystem and how it implements fast symlinks or do fast symlinks > currently guarantee that contents are mangled? Not sure if I understand your question correctly... Filesystems should guarantee that the contents of string returned by ->get_link() (or pointed to by ->i_link) remains unchanged for as long as we are looking at it (until fs/namei.c:put_link() that drops it or fs/namei.c:drop_links() in the end of pathwalk). Fast symlinks or not - doesn't matter. The only cases where that does not hold (there are two of them in the entire kernel) happen to be fast symlinks. Both cases are bugs. ext4 case is actually easy to fix - the only reason it ends up mangling the string is the way ext4_truncate() implements its check for victim being a fast symlink (and thus needing no work). It gets disrupted by zeroing ->i_size, which we need to do in this case (inode removal). That's not hard to get right. A plenty of other fast symlink variants (starting with ext2 ones, BTW) do not step into anything of that sort. IIRC, we used to have at least some cases (orangefs, perhaps?) where revalidate on a symlink might (with confused or malicious server) end up modifying the contents, possibly right under somebody else walking that symlink. Also a bug...
On Tue, Jan 18, 2022 at 08:47:53AM -0500, Brian Foster wrote: > On Tue, Jan 18, 2022 at 01:32:23AM +0000, Al Viro wrote: > > On Mon, Jan 17, 2022 at 07:48:49PM +0000, Al Viro wrote: > > > > But that critically depends upon the contents not getting mangled. If it > > > > *can* be screwed by such unlink, we risk successful lookup leading to the > > > > wrong place, with nothing to tell us that it's happening. We could handle > > > > that by adding a check to fs/namei.c:put_link(), and propagating the error > > > > to callers. It's not impossible, but it won't be pretty. > > > > > > > > And that assumes we avoid oopsen on string changing under us in the first > > > > place. Which might or might not be true - I hadn't finished the audit yet. > > > > Note that it's *NOT* just fs/namei.c + fs/dcache.c + some fs methods - > > > > we need to make sure that e.g. everything called by ->d_hash() instances > > > > is OK with strings changing right under them. Including utf8_to_utf32(), > > > > crc32_le(), utf8_casefold_hash(), etc. > > > > > > And AFAICS, ext4, xfs and possibly ubifs (I'm unfamiliar with that one and > > > the call chains there are deep enough for me to miss something) have the > > > "bugger the contents of string returned by RCU ->get_link() if unlink() > > > happens" problem. > > > > > > I would very much prefer to have them deal with that crap, especially > > > since I don't see why does ext4_evict_inode() need to do that memset() - > > > can't we simply check ->i_op in ext4_can_truncate() and be done with > > > that? > > > > This reuse-without-delay has another fun side, AFAICS. Suppose the new use > > for inode comes with the same ->i_op (i.e. it's a symlink again) and it > > happens right after ->get_link() has returned the pointer to body. > > > > Yep, I had reproduced this explicitly when playing around with some > instrumented delays and whatnot in the code. This and the similar > variant of just returning internal/non-string data fork metadata via > ->get_link() is why I asked to restore old behavior of returning -ECHILD > for inline symlinks. > > > We are already past whatever checks we might add in pick_link(). And the > > pointer is still valid. So we end up quietly traversing the body of > > completely unrelated symlink that never had been anywhere near any directory > > we might be looking at. With no indication of anything going wrong - just > > a successful resolution with bogus result. > > > > Could XFS folks explain what exactly goes wrong if we make actual marking > > inode as ready for reuse RCU-delayed, by shifting just that into > > ->free_inode()? Why would we need any extra synchronize_rcu() anywhere? > > > > Dave already chimed in on why we probably don't want ->free_inode() > across the board. I don't think there's a functional problem with a more > selective injection of an rcu delay on the INACTIVE -> RECLAIMABLE > transition, based on the reasoning specified earlier (i.e., the iget > side already blocks on INACTIVE, so it's just a matter of a longer > delay). > > Most of that long thread I previously linked to was us discussing pretty > much how to do something like that with minimal performance impact. The > experiment I ran to measure performance was use of queue_rcu_work() for > inactive inode processing. That resulted in a performance hit to single > threaded sequential file removal, but could be mitigated by increasing > the queue size (which may or may not have other side effects). Dave > suggested a more async approach to track the current grace period in the > inode and refer to it at lookup/alloc time, but that is notably more > involved and isn't clear if/how much it mitigates rcu delays. > > IIUC, your thought here is to introduce an rcu delay on the destroy > side, but after the inactive processing rather than before it (as my > previous experiment did). IOW, basically invoke > xfs_inodegc_set_reclaimable() as an rcu callback via > xfs_inodegc_worker(), yes? If so, that seems like a potentially > reasonable option to me since it pulls the delay out of the inactivation > processing pipeline. I suspect the tradeoff with that is it might be > slightly less efficient than doing it earlier because we've lost any > grace period transitions that have occurred since before the inode was > queued and processed, but OTOH this might isolate the impact of that > delay to the inode reuse path. Maybe there's room for a simple > optimization there in cases where a gp may have expired already since > the inode was first queued. Hmm.. maybe I'll give that a try to see > if/how much impact there may be on an inode alloc/free workload.. > Here are results from a few quick tests running a tight inode alloc/free loop against an XFS filesystem with some concurrent kernel source tree recursive ls activity running on a separate (XFS) filesystem in the background. The loop runs for 60s and reports how many create/unlink cycles it was able to perform in that time: Baseline (xfs for-next, v5.16.0-rc5): ~34k cycles inactive -> reclaimable grace period: ~29k cycles unconditional iget rcu sync: ~4400 cycles Note that I did get a lockdep splat from _set_reclaimable() in rcu context that I've ignored for now because the test completed. That aside, that looks like about a ~15% or so hit from baseline. The last test inserts an unconditional synchronize_rcu() in the iget recycle path to provide a reference for the likely worst case implementation. If I go back to the inactive -> reclaimable grace period variant and also insert a start_poll_synchronize_rcu() and poll_state_synchronize_rcu() pair across the inactive processing sequence, I start seeing numbers closer to ~36k cycles. IOW, the xfs_inodegc_inactivate() helper looks something like this: if (poll_state_synchronize_rcu(ip->i_destroy_gp)) xfs_inodegc_set_reclaimable(ip); else call_rcu(&VFS_I(ip)->i_rcu, xfs_inodegc_set_reclaimable_callback); ... to skip the rcu grace period if one had already passed while the inode sat on the inactivation queue and was processed. However my box went haywire shortly after with rcu stall reports or something if I let that continue to run longer, so I'll probably have to look into that lockdep splat (complaining about &pag->pag_ici_lock in rcu context, perhaps needs to become irq safe?) or see if something else is busted.. Brian > Brian
On Tue, Jan 18, 2022 at 01:25:15PM -0500, Brian Foster wrote: > If I go back to the inactive -> reclaimable grace period variant and > also insert a start_poll_synchronize_rcu() and > poll_state_synchronize_rcu() pair across the inactive processing > sequence, I start seeing numbers closer to ~36k cycles. IOW, the > xfs_inodegc_inactivate() helper looks something like this: > > if (poll_state_synchronize_rcu(ip->i_destroy_gp)) > xfs_inodegc_set_reclaimable(ip); > else > call_rcu(&VFS_I(ip)->i_rcu, xfs_inodegc_set_reclaimable_callback); > > ... to skip the rcu grace period if one had already passed while the > inode sat on the inactivation queue and was processed. > > However my box went haywire shortly after with rcu stall reports or > something if I let that continue to run longer, so I'll probably have to > look into that lockdep splat (complaining about &pag->pag_ici_lock in > rcu context, perhaps needs to become irq safe?) or see if something else > is busted.. Umm... Could you (or Dave) describe where does the mainline do the RCU delay mentioned several times in these threads, in case of * unlink() * overwriting rename() * open() + unlink() + close() (that one, obviously, for regular files) The thing is, if we already do have an RCU delay in there, there might be a better solution - making sure it happens downstream of d_drop() (in case when dentry has other references) or dentry going negative (in case we are holding the sole reference to it). If we can do that, RCU'd dcache lookups won't run into inode reuse at all. Sure, right now d_delete() comes last *and* we want the target inode to stay around past the call of ->unlink(). But if you look at do_unlinkat() you'll see an interesting-looking wart with ihold/iput around vfs_unlink(). Not sure if you remember the story on that one; basically, it's "we'd rather have possible IO on inode freeing to happen outside of the lock on parent". nfsd and mqueue do the same thing; ksmbd does not. OTOH, ksmbd appears to force the "make victim go unhashed, sole reference or not". ecryptfs definitely does that forcing (deliberately so). That area could use some rethinking, and if we can deal with the inode reuse delay while we are at it... Overwriting rename() is also interesting in that respect, of course. I can go and try to RTFS my way through xfs iget-related code, but I'd obviously prefer to do so with at least some overview of that thing from the folks familiar with it. Seeing that it's a lockless search structure, missing something subtle there is all too easy, especially with the lookup-by-fhandle stuff in the mix...
On Tue, Jan 18, 2022 at 07:20:35PM +0000, Al Viro wrote: > On Tue, Jan 18, 2022 at 01:25:15PM -0500, Brian Foster wrote: > > > If I go back to the inactive -> reclaimable grace period variant and > > also insert a start_poll_synchronize_rcu() and > > poll_state_synchronize_rcu() pair across the inactive processing > > sequence, I start seeing numbers closer to ~36k cycles. IOW, the > > xfs_inodegc_inactivate() helper looks something like this: > > > > if (poll_state_synchronize_rcu(ip->i_destroy_gp)) > > xfs_inodegc_set_reclaimable(ip); > > else > > call_rcu(&VFS_I(ip)->i_rcu, xfs_inodegc_set_reclaimable_callback); > > > > ... to skip the rcu grace period if one had already passed while the > > inode sat on the inactivation queue and was processed. > > > > However my box went haywire shortly after with rcu stall reports or > > something if I let that continue to run longer, so I'll probably have to > > look into that lockdep splat (complaining about &pag->pag_ici_lock in > > rcu context, perhaps needs to become irq safe?) or see if something else > > is busted.. > > Umm... Could you (or Dave) describe where does the mainline do the > RCU delay mentioned several times in these threads, in case of > * unlink() > * overwriting rename() > * open() + unlink() + close() (that one, obviously, for regular files) > If you're referring to the existing rcu delay in XFS, I suspect that refers to xfs_reclaim_inode(). The last bits of that function remove the inode from the perag tree and then calls __xfs_inode_free(), which frees the inode via rcu callback. BTW, I think the experiment above is either going to require an audit to make the various _set_reclaimable() locks irq safe or something a bit more ugly like putting the inode back on a workqueue after the rcu delay to make the state transition. Given the incremental improvement from using start_poll_synchronize_rcu(), I replaced the above destroy side code with a cond_synchronize_rcu(ip->i_destroy_gp) call on the iget/recycle side and see similar results (~36k cycles per 60s, but so far without any explosions). Brian > The thing is, if we already do have an RCU delay in there, there might be > a better solution - making sure it happens downstream of d_drop() (in case > when dentry has other references) or dentry going negative (in case we are > holding the sole reference to it). > > If we can do that, RCU'd dcache lookups won't run into inode reuse at all. > Sure, right now d_delete() comes last *and* we want the target inode to stay > around past the call of ->unlink(). But if you look at do_unlinkat() you'll > see an interesting-looking wart with ihold/iput around vfs_unlink(). Not > sure if you remember the story on that one; basically, it's "we'd rather > have possible IO on inode freeing to happen outside of the lock on parent". > > nfsd and mqueue do the same thing; ksmbd does not. OTOH, ksmbd appears to > force the "make victim go unhashed, sole reference or not". ecryptfs > definitely does that forcing (deliberately so). > > That area could use some rethinking, and if we can deal with the inode reuse > delay while we are at it... > > Overwriting rename() is also interesting in that respect, of course. > > I can go and try to RTFS my way through xfs iget-related code, but I'd > obviously prefer to do so with at least some overview of that thing > from the folks familiar with it. Seeing that it's a lockless search > structure, missing something subtle there is all too easy, especially > with the lookup-by-fhandle stuff in the mix... >
On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote: > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote: > > > No, that just creates a black hole where the VFS inode has been > > destroyed but the XFS inode cache doesn't know it's been trashed. > > Hence setting XFS_IRECLAIMABLE needs to remain in the during > > ->destroy_inode, otherwise the ->lookup side of the cache will think > > that are currently still in use by the VFS and hand them straight > > back out without going through the inode recycling code. > > > > i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS > > part of the inode has been torn down, and that it must go back > > through VFS re-initialisation before it can be re-instantiated as a > > VFS inode. > > OK... > > > It would also mean that the inode will need to go through two RCU > > grace periods before it gets reclaimed, because XFS uses RCU > > protected inode cache lookups internally (e.g. for clustering dirty > > inode writeback) and so freeing the inode from the internal > > XFS inode cache requires RCU freeing... > > Wait a minute. Where is that RCU delay of yours, relative to > xfs_vn_unlink() and xfs_vn_rename() (for target)? Both of those drop the inode on an on-disk unlinked list. When the last reference goes away, ->destroy_inode then runs inactivation. Inactivation then runs transactions to free all the space attached to the inode and then removes the inode from the unlinked list and frees it. It then goes into the XFS_IRECLAIMABLE state and is dirty in memory. It can't be reclaimed until the inode is written to disk or the whole inode cluster is freed and the inode marked XFS_ISTALE (so won't get written back). At that point, a background inode reclaim thread (runs every 5s) does a RCU protected lockless radix tree walk to find XFS_IRECLAIMABLE inodes (via radix tree tags). If they are clean, it moves them to XFS_IRECLAIM state, deletes them from the radix tree and frees them via a call_rcu() callback. If memory reclaim comes along sooner than this, the ->free_cached_objects() superblock shrinker callback runs that RCU protected lockless radix tree walk to find XFS_IRECLAIMABLE inodes. > And where does > it happen in case of e.g. open() + unlink() + close()? Same thing - close() drops the last reference, the unlinked inode goes through inactivation, then moves into the XFS_IRECLAIMABLE state. The problem is not -quite- open-unlink-close. The problem case is the reallocation of an on-disk inode in the case of unlink-close-open(O_CREATE) operations because of the on-disk inode allocator policy of aggressive reuse of recently freed inodes. In that case the xfs_iget() lookup will reinstantiate the inode via xfs_iget_recycle() and the inode will change identity between VFS instantiations. This is where a RCU grace period is absolutely required, and we don't currently have one. The bug was introduced with RCU freeing of inodes (what, 15 years ago now?) and it's only recently that we've realised this bug exists via code inspection. We really have no evidence that it's actually been tripped over in the wild.... Unfortunately, the simple fix of adding syncronize_rcu() to xfs_iget_recycle() causes significant performance regressions because we hit this path quite frequently when workloads use lots of temporary files - the on-disk inode allocator policy tends towards aggressive re-use of inodes for small sets of temporary files. The problem XFS is trying to address is that the VFS inode lifecycle does not cater for filesystems that need to both dirty and then clean unlinked inodes between iput_final() and ->destroy_inode. It's too late to be able to put the inode back on the LRU once we've decided to drop the inode if we need to dirty it again. ANd because evict() is part of the non-blocking memory reclaim, we aren't supposed to block for arbitrarily long periods of time or create unbound memory demand processing inode eviction (both of which XFS can do in inactivation). IOWs, XFS can't free the inode until it's journal releases the internal reference on the dirty inode. ext4 doesn't track inodes in it's journal - it only tracks inode buffers that contain the changes made to the inode, so once the transaction is committed in ext4_evict_inode() the inode can be immediately freed via either ->destroy_inode or ->free_inode. That option does not exist for XFS because we have to wait for the journal to finish with the inode before it can be freed. Hence all the background reclaim stuff. We've recently solved several of the problems we need to solve to reduce the mismatch; avoiding blocking on inode writeback in reclaim and background inactivation are two of the major pieces of work we needed done before we could even consider more closely aligning XFS to the VFS inode cache life cycle model. The next step is to move the background inode inactivation triggers up into ->drop_inode so we can catch inodes that need to be dirtied by the filesysetm before they have been marked for eviction by the VFS. This will allow us to keep the inode on the VFS LRU (probably marked with I_WILL_FREE so everyone else keeps away from it) whilst we are waiting for the background inactivation work to be done, the journal flushed and the metadata written back. Once clean, we can directly evict the inode from the VFS ourselves. This would mean we only get clean, reclaimable inodes hitting the evict() path, and so at that point we can just remove the inode directly from the XFS inode cache from either ->destroy_inode or ->free_inode and RCU free it. The recycling of in-memory inodes in xfs_iget_cache_hit can go away entirely because no inodes will linger in the XFS inode cache without being visible at the VFS layer as they do now... That's going to take a fair bit of work to realise, and I'm not sure yet exactly what mods are going to be needed to either the VFS inode infrastructure or the XFS inode cache. Cheers, Dave.
On Tue, Jan 18, 2022 at 04:04:50PM +0000, Al Viro wrote: > On Tue, Jan 18, 2022 at 09:29:11AM +0100, Christian Brauner wrote: > > On Mon, Jan 17, 2022 at 06:10:36PM +0000, Al Viro wrote: > > > On Mon, Jan 17, 2022 at 04:28:52PM +0000, Al Viro wrote: > > > > > > > IOW, ->free_inode() is RCU-delayed part of ->destroy_inode(). If both > > > > are present, ->destroy_inode() will be called synchronously, followed > > > > by ->free_inode() from RCU callback, so you can have both - moving just > > > > the "finally mark for reuse" part into ->free_inode() would be OK. > > > > Any blocking stuff (if any) can be left in ->destroy_inode()... > > > > > > BTW, we *do* have a problem with ext4 fast symlinks. Pathwalk assumes that > > > strings it parses are not changing under it. There are rather delicate > > > dances in dcache lookups re possibility of ->d_name contents changing under > > > it, but the search key is assumed to be stable. > > > > > > What's more, there's a correctness issue even if we do not oops. Currently > > > we do not recheck ->d_seq of symlink dentry when we dismiss the symlink from > > > the stack. After all, we'd just finished traversing what used to be the > > > contents of a symlink that used to be in the right place. It might have been > > > unlinked while we'd been traversing it, but that's not a correctness issue. > > > > > > But that critically depends upon the contents not getting mangled. If it > > > *can* be screwed by such unlink, we risk successful lookup leading to the > > > > Out of curiosity: whether or not it can get mangled depends on the > > filesystem and how it implements fast symlinks or do fast symlinks > > currently guarantee that contents are mangled? > > Not sure if I understand your question correctly... > > Filesystems should guarantee that the contents of string returned > by ->get_link() (or pointed to by ->i_link) remains unchanged for as long > as we are looking at it (until fs/namei.c:put_link() that drops it or > fs/namei.c:drop_links() in the end of pathwalk). Fast symlinks or not - > doesn't matter. Yep, got that. > > The only cases where that does not hold (there are two of them in > the entire kernel) happen to be fast symlinks. Both cases are bugs. Ok, that's what I was essentially after whether or not they were bugs in the filesystems or it's a generic bug. > ext4 case is actually easy to fix - the only reason it ends up mangling > the string is the way ext4_truncate() implements its check for victim > being a fast symlink (and thus needing no work). It gets disrupted > by zeroing ->i_size, which we need to do in this case (inode removal). > That's not hard to get right. Oh, I see, it zeroes i_size and erases i_data which obviously tramples the fast symlink contents. Given that ext4 makes use of i_flags for their ext4 inode containers why couldn't this just be sm like #define EXT4_FAST_SYMLINK 0x<some-free-value> if (EXT4_I(inode)->i_flags & EXT4_FAST_SYMLINK) return <im-a-fast-symlink>; ? Which seems simpler and more obvious to someone reading that code than logic based on substracting blocks or checking i_size.
On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote: > On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote: > > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote: > > > > > No, that just creates a black hole where the VFS inode has been > > > destroyed but the XFS inode cache doesn't know it's been trashed. > > > Hence setting XFS_IRECLAIMABLE needs to remain in the during > > > ->destroy_inode, otherwise the ->lookup side of the cache will think > > > that are currently still in use by the VFS and hand them straight > > > back out without going through the inode recycling code. > > > > > > i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS > > > part of the inode has been torn down, and that it must go back > > > through VFS re-initialisation before it can be re-instantiated as a > > > VFS inode. > > > > OK... > > > > > It would also mean that the inode will need to go through two RCU > > > grace periods before it gets reclaimed, because XFS uses RCU > > > protected inode cache lookups internally (e.g. for clustering dirty > > > inode writeback) and so freeing the inode from the internal > > > XFS inode cache requires RCU freeing... > > > > Wait a minute. Where is that RCU delay of yours, relative to > > xfs_vn_unlink() and xfs_vn_rename() (for target)? > > Both of those drop the inode on an on-disk unlinked list. When the > last reference goes away, ->destroy_inode then runs inactivation. > > Inactivation then runs transactions to free all the space attached > to the inode and then removes the inode from the unlinked list and > frees it. It then goes into the XFS_IRECLAIMABLE state and is dirty > in memory. It can't be reclaimed until the inode is written to disk > or the whole inode cluster is freed and the inode marked XFS_ISTALE > (so won't get written back). > > At that point, a background inode reclaim thread (runs every 5s) > does a RCU protected lockless radix tree walk to find > XFS_IRECLAIMABLE inodes (via radix tree tags). If they are clean, it > moves them to XFS_IRECLAIM state, deletes them from the radix tree > and frees them via a call_rcu() callback. > > If memory reclaim comes along sooner than this, the > ->free_cached_objects() superblock shrinker callback runs that RCU > protected lockless radix tree walk to find XFS_IRECLAIMABLE inodes. > > > And where does > > it happen in case of e.g. open() + unlink() + close()? > > Same thing - close() drops the last reference, the unlinked inode > goes through inactivation, then moves into the XFS_IRECLAIMABLE > state. > > The problem is not -quite- open-unlink-close. The problem case is > the reallocation of an on-disk inode in the case of > unlink-close-open(O_CREATE) operations because of the on-disk inode > allocator policy of aggressive reuse of recently freed inodes. In > that case the xfs_iget() lookup will reinstantiate the inode via > xfs_iget_recycle() and the inode will change identity between VFS > instantiations. > > This is where a RCU grace period is absolutely required, and we > don't currently have one. The bug was introduced with RCU freeing of > inodes (what, 15 years ago now?) and it's only recently that we've > realised this bug exists via code inspection. We really have no > evidence that it's actually been tripped over in the wild.... > To be fair, we have multiple reports of the NULL ->get_link() variant and my tests to this point to induce "unexpected" returns of that function don't manifest in as catastrophic a side effect, so might not be as immediately noticeable by users. I.e., returning non-string data doesn't seem to necessarily cause a crash in the vfs and the symlink to symlink variant is more of an unexpected redirection of a lookup. IOW, I think it's fairly logical to assume that if users are hitting the originally reported problem, they're likely dangerously close to the subsequent problems that have been identified from further inspection of the related code. I don't think this is purely a case of a "theoretical" problem that doesn't warrant some form of prioritized fix. > Unfortunately, the simple fix of adding syncronize_rcu() to > xfs_iget_recycle() causes significant performance regressions > because we hit this path quite frequently when workloads use lots of > temporary files - the on-disk inode allocator policy tends towards > aggressive re-use of inodes for small sets of temporary files. > > The problem XFS is trying to address is that the VFS inode lifecycle > does not cater for filesystems that need to both dirty and then > clean unlinked inodes between iput_final() and ->destroy_inode. It's > too late to be able to put the inode back on the LRU once we've > decided to drop the inode if we need to dirty it again. ANd because > evict() is part of the non-blocking memory reclaim, we aren't > supposed to block for arbitrarily long periods of time or create > unbound memory demand processing inode eviction (both of which XFS > can do in inactivation). > > IOWs, XFS can't free the inode until it's journal releases the > internal reference on the dirty inode. ext4 doesn't track inodes in > it's journal - it only tracks inode buffers that contain the changes > made to the inode, so once the transaction is committed in > ext4_evict_inode() the inode can be immediately freed via either > ->destroy_inode or ->free_inode. That option does not exist for XFS > because we have to wait for the journal to finish with the inode > before it can be freed. Hence all the background reclaim stuff. > > We've recently solved several of the problems we need to solve to > reduce the mismatch; avoiding blocking on inode writeback in reclaim > and background inactivation are two of the major pieces of work we > needed done before we could even consider more closely aligning XFS > to the VFS inode cache life cycle model. > The background inactivation work facilitates an incremental improvement by nature because destroyed inodes go directly to a queue instead of being processed synchronously. My most recent test to stamp the grace period info at inode destroy time and conditionally sync at reuse time shows pretty much no major cost because the common case is that a grace period has already expired by the time the queue populates, is processed and said inodes become reclaimable and reallocated. To go beyond just the performance result, if I open code the conditional sync for tracking purposes I only see something like 10-15 rcu waits out of the 36k allocation cycles. If I increase the background workload 4x, the allocation rate drops to ~33k cycles (which is still pretty much in line with baseline) and the rcu sync count increases to 70, which again is relatively nominal over tens of thousands of cycles. This all requires some more thorough testing, but I'm sure it won't be absolutely free for every possible workload or environment. But given that we know this infrastructure is fundamentally broken (by subtle compatibilities between XFS and the VFS that have evolved over time), will require some thought and time to fix properly in the filesystem, that users are running into problems very closely related to it, why not try to address the fundamental breakage if we can do so with an isolated change with minimal (but probably not zero) performance impact? I agree that the unconditional synchronize_rcu() on reuse approach is just not viable, but so far tests using cond_synchronize_rcu() seem fairly reasonable. Is there some other problem or concern with such an approach? Brian > The next step is to move the background inode inactivation triggers > up into ->drop_inode so we can catch inodes that need to be dirtied > by the filesysetm before they have been marked for eviction by the > VFS. This will allow us to keep the inode on the VFS LRU (probably > marked with I_WILL_FREE so everyone else keeps away from it) whilst > we are waiting for the background inactivation work to be done, the > journal flushed and the metadata written back. Once clean, we can > directly evict the inode from the VFS ourselves. > > This would mean we only get clean, reclaimable inodes hitting the > evict() path, and so at that point we can just remove the inode > directly from the XFS inode cache from either ->destroy_inode or > ->free_inode and RCU free it. The recycling of in-memory inodes in > xfs_iget_cache_hit can go away entirely because no inodes will > linger in the XFS inode cache without being visible at the VFS > layer as they do now... > > That's going to take a fair bit of work to realise, and I'm not sure > yet exactly what mods are going to be needed to either the VFS inode > infrastructure or the XFS inode cache. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Wed, Jan 19, 2022 at 09:08:22AM -0500, Brian Foster wrote: > On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote: > > On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote: > > > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote: > > Unfortunately, the simple fix of adding syncronize_rcu() to > > xfs_iget_recycle() causes significant performance regressions > > because we hit this path quite frequently when workloads use lots of > > temporary files - the on-disk inode allocator policy tends towards > > aggressive re-use of inodes for small sets of temporary files. > > > > The problem XFS is trying to address is that the VFS inode lifecycle > > does not cater for filesystems that need to both dirty and then > > clean unlinked inodes between iput_final() and ->destroy_inode. It's > > too late to be able to put the inode back on the LRU once we've > > decided to drop the inode if we need to dirty it again. ANd because > > evict() is part of the non-blocking memory reclaim, we aren't > > supposed to block for arbitrarily long periods of time or create > > unbound memory demand processing inode eviction (both of which XFS > > can do in inactivation). > > > > IOWs, XFS can't free the inode until it's journal releases the > > internal reference on the dirty inode. ext4 doesn't track inodes in > > it's journal - it only tracks inode buffers that contain the changes > > made to the inode, so once the transaction is committed in > > ext4_evict_inode() the inode can be immediately freed via either > > ->destroy_inode or ->free_inode. That option does not exist for XFS > > because we have to wait for the journal to finish with the inode > > before it can be freed. Hence all the background reclaim stuff. > > > > We've recently solved several of the problems we need to solve to > > reduce the mismatch; avoiding blocking on inode writeback in reclaim > > and background inactivation are two of the major pieces of work we > > needed done before we could even consider more closely aligning XFS > > to the VFS inode cache life cycle model. > > > > The background inactivation work facilitates an incremental improvement > by nature because destroyed inodes go directly to a queue instead of > being processed synchronously. My most recent test to stamp the grace > period info at inode destroy time and conditionally sync at reuse time > shows pretty much no major cost because the common case is that a grace > period has already expired by the time the queue populates, is processed > and said inodes become reclaimable and reallocated. Yup. Remember that I suggested these conditional variants in the first place - I do understand what this code does... > To go beyond just > the performance result, if I open code the conditional sync for tracking > purposes I only see something like 10-15 rcu waits out of the 36k > allocation cycles. If I increase the background workload 4x, the > allocation rate drops to ~33k cycles (which is still pretty much in line > with baseline) and the rcu sync count increases to 70, which again is > relatively nominal over tens of thousands of cycles. Yup. But that doesn't mean that the calls that trigger are free from impact. The cost and latency of waiting for an RCU grace period to expire goes up as the CPU count goes up. e.g. it requires every CPU running a task goes through a context switch before it returns. Hence if we end up with situations like, say, the ioend completion scheduling holdoffs, then that will prevent the RCU sync from returning for seconds. IOWs, we're effectively adding unpredictable and non-deterministic latency into the recycle path that is visible to userspace applications, and those latencies can be caused by subsystem functionality not related to XFS. Hence we need to carefully consider unexpected side-effects of adding a kernel global synchronisation point into a XFS icache lookup fast path, and these issues may not be immediately obvious from testing... > This all requires some more thorough testing, but I'm sure it won't be > absolutely free for every possible workload or environment. But given > that we know this infrastructure is fundamentally broken (by subtle > compatibilities between XFS and the VFS that have evolved over time), > will require some thought and time to fix properly in the filesystem, > that users are running into problems very closely related to it, why not > try to address the fundamental breakage if we can do so with an isolated > change with minimal (but probably not zero) performance impact? > > I agree that the unconditional synchronize_rcu() on reuse approach is > just not viable, but so far tests using cond_synchronize_rcu() seem > fairly reasonable. Is there some other problem or concern with such an > approach? Just that the impact of adding RCU sync points means that bad behaviour outside XFS have a new point where they can adversely impact on applications doing filesystem operations. As a temporary mitigation strategy I think it will probably be fine, but I'd much prefer we get rid of the need for such an RCU sync point rather than try to maintain a mitigation like this in fast path code forever. Cheers, Dave.
On Thu, Jan 20, 2022 at 09:07:47AM +1100, Dave Chinner wrote: > On Wed, Jan 19, 2022 at 09:08:22AM -0500, Brian Foster wrote: > > On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote: > > > On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote: > > > > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote: > > > Unfortunately, the simple fix of adding syncronize_rcu() to > > > xfs_iget_recycle() causes significant performance regressions > > > because we hit this path quite frequently when workloads use lots of > > > temporary files - the on-disk inode allocator policy tends towards > > > aggressive re-use of inodes for small sets of temporary files. > > > > > > The problem XFS is trying to address is that the VFS inode lifecycle > > > does not cater for filesystems that need to both dirty and then > > > clean unlinked inodes between iput_final() and ->destroy_inode. It's > > > too late to be able to put the inode back on the LRU once we've > > > decided to drop the inode if we need to dirty it again. ANd because > > > evict() is part of the non-blocking memory reclaim, we aren't > > > supposed to block for arbitrarily long periods of time or create > > > unbound memory demand processing inode eviction (both of which XFS > > > can do in inactivation). > > > > > > IOWs, XFS can't free the inode until it's journal releases the > > > internal reference on the dirty inode. ext4 doesn't track inodes in > > > it's journal - it only tracks inode buffers that contain the changes > > > made to the inode, so once the transaction is committed in > > > ext4_evict_inode() the inode can be immediately freed via either > > > ->destroy_inode or ->free_inode. That option does not exist for XFS > > > because we have to wait for the journal to finish with the inode > > > before it can be freed. Hence all the background reclaim stuff. > > > > > > We've recently solved several of the problems we need to solve to > > > reduce the mismatch; avoiding blocking on inode writeback in reclaim > > > and background inactivation are two of the major pieces of work we > > > needed done before we could even consider more closely aligning XFS > > > to the VFS inode cache life cycle model. > > > > > > > The background inactivation work facilitates an incremental improvement > > by nature because destroyed inodes go directly to a queue instead of > > being processed synchronously. My most recent test to stamp the grace > > period info at inode destroy time and conditionally sync at reuse time > > shows pretty much no major cost because the common case is that a grace > > period has already expired by the time the queue populates, is processed > > and said inodes become reclaimable and reallocated. > > Yup. Remember that I suggested these conditional variants in the > first place - I do understand what this code does... > > > To go beyond just > > the performance result, if I open code the conditional sync for tracking > > purposes I only see something like 10-15 rcu waits out of the 36k > > allocation cycles. If I increase the background workload 4x, the > > allocation rate drops to ~33k cycles (which is still pretty much in line > > with baseline) and the rcu sync count increases to 70, which again is > > relatively nominal over tens of thousands of cycles. > > Yup. But that doesn't mean that the calls that trigger are free from > impact. The cost and latency of waiting for an RCU grace period to > expire goes up as the CPU count goes up. e.g. it requires every CPU > running a task goes through a context switch before it returns. > Hence if we end up with situations like, say, the ioend completion > scheduling holdoffs, then that will prevent the RCU sync from > returning for seconds. > Sure... this is part of the reason the tests I've run so far have all tried to incorporate background rcuwalk activity, run on a higher cpu count box, etc. And from the XFS side of the coin, the iget code can invoke xfs_inodegc_queue_all() in the needs_inactive case before reclaimable state is a possibility, which queues a work on every cpu with pending inactive inodes. That is probably unlikely in the free/alloc case (since needs_inactive inodes are not yet free on disk), but the broader points are that the inactive processing work has to complete one way or another before reclaimable state is possible and that we can probably accommodate a synchronization point here if it's reasonably filtered. Otherwise... > IOWs, we're effectively adding unpredictable and non-deterministic > latency into the recycle path that is visible to userspace > applications, and those latencies can be caused by subsystem > functionality not related to XFS. Hence we need to carefully > consider unexpected side-effects of adding a kernel global > synchronisation point into a XFS icache lookup fast path, and these > issues may not be immediately obvious from testing... > ... agreed. I think at this point we've also discussed various potential ways to shift or minimize latency/cost further, so there's probably still room for refinement if such unexpected things crop up before... > > This all requires some more thorough testing, but I'm sure it won't be > > absolutely free for every possible workload or environment. But given > > that we know this infrastructure is fundamentally broken (by subtle > > compatibilities between XFS and the VFS that have evolved over time), > > will require some thought and time to fix properly in the filesystem, > > that users are running into problems very closely related to it, why not > > try to address the fundamental breakage if we can do so with an isolated > > change with minimal (but probably not zero) performance impact? > > > > I agree that the unconditional synchronize_rcu() on reuse approach is > > just not viable, but so far tests using cond_synchronize_rcu() seem > > fairly reasonable. Is there some other problem or concern with such an > > approach? > > Just that the impact of adding RCU sync points means that bad > behaviour outside XFS have a new point where they can adversely > impact on applications doing filesystem operations. > > As a temporary mitigation strategy I think it will probably be fine, > but I'd much prefer we get rid of the need for such an RCU sync > point rather than try to maintain a mitigation like this in fast > path code forever. > ... we end up here. All in all, this is intended to be a practical/temporary step toward functional correctness that minimizes performance impact and disruption (i.e. just as easy to remove when made unnecessary). Al, The caveat to this is I think the practicality of a conditional sync in the iget recycle code sort of depends on the queueing/batching nature of inactive inode processing in XFS. If you look at xfs_fs_destroy_inode() for example, you'll see this is all fairly recent feature/infrastructure code and that historically we completed most of this transition to reclaimable state before ->destroy_inode() returns. Therefore, the concern I have is that on older/stable kernels (where users are hitting this NULL ->get_link() BUG) the reuse code is far more likely to stall and slow down here with this change alone (see my earlier numbers on the unconditional sync_rcu() test for prospective worst case). For that reason, I'm not sure this is really a backportable solution. So getting back to your concern around Ian's patch being a stopgap/bandaid type solution, would you be willing to pull something like Ian's patch to the vfs if upstream XFS adopts this conditional rcu sync in the iget reuse code? I think that would ensure that no further bandaid fixes will be required in the vfs due to XFS inode reuse, but would also provide an isolated variant of the fix in the VFS that is more easily backported to stable kernels. Thoughts? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Thu, Jan 20, 2022 at 11:03:28AM -0500, Brian Foster wrote: > On Thu, Jan 20, 2022 at 09:07:47AM +1100, Dave Chinner wrote: > > On Wed, Jan 19, 2022 at 09:08:22AM -0500, Brian Foster wrote: > > > On Wed, Jan 19, 2022 at 10:25:47AM +1100, Dave Chinner wrote: > > > > On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote: > > > > > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote: > > > > Unfortunately, the simple fix of adding syncronize_rcu() to > > > > xfs_iget_recycle() causes significant performance regressions > > > > because we hit this path quite frequently when workloads use lots of > > > > temporary files - the on-disk inode allocator policy tends towards > > > > aggressive re-use of inodes for small sets of temporary files. > > > > > > > > The problem XFS is trying to address is that the VFS inode lifecycle > > > > does not cater for filesystems that need to both dirty and then > > > > clean unlinked inodes between iput_final() and ->destroy_inode. It's > > > > too late to be able to put the inode back on the LRU once we've > > > > decided to drop the inode if we need to dirty it again. ANd because > > > > evict() is part of the non-blocking memory reclaim, we aren't > > > > supposed to block for arbitrarily long periods of time or create > > > > unbound memory demand processing inode eviction (both of which XFS > > > > can do in inactivation). > > > > > > > > IOWs, XFS can't free the inode until it's journal releases the > > > > internal reference on the dirty inode. ext4 doesn't track inodes in > > > > it's journal - it only tracks inode buffers that contain the changes > > > > made to the inode, so once the transaction is committed in > > > > ext4_evict_inode() the inode can be immediately freed via either > > > > ->destroy_inode or ->free_inode. That option does not exist for XFS > > > > because we have to wait for the journal to finish with the inode > > > > before it can be freed. Hence all the background reclaim stuff. > > > > > > > > We've recently solved several of the problems we need to solve to > > > > reduce the mismatch; avoiding blocking on inode writeback in reclaim > > > > and background inactivation are two of the major pieces of work we > > > > needed done before we could even consider more closely aligning XFS > > > > to the VFS inode cache life cycle model. > > > > > > > > > > The background inactivation work facilitates an incremental improvement > > > by nature because destroyed inodes go directly to a queue instead of > > > being processed synchronously. My most recent test to stamp the grace > > > period info at inode destroy time and conditionally sync at reuse time > > > shows pretty much no major cost because the common case is that a grace > > > period has already expired by the time the queue populates, is processed > > > and said inodes become reclaimable and reallocated. > > > > Yup. Remember that I suggested these conditional variants in the > > first place - I do understand what this code does... > > > > > To go beyond just > > > the performance result, if I open code the conditional sync for tracking > > > purposes I only see something like 10-15 rcu waits out of the 36k > > > allocation cycles. If I increase the background workload 4x, the > > > allocation rate drops to ~33k cycles (which is still pretty much in line > > > with baseline) and the rcu sync count increases to 70, which again is > > > relatively nominal over tens of thousands of cycles. > > > > Yup. But that doesn't mean that the calls that trigger are free from > > impact. The cost and latency of waiting for an RCU grace period to > > expire goes up as the CPU count goes up. e.g. it requires every CPU > > running a task goes through a context switch before it returns. > > Hence if we end up with situations like, say, the ioend completion > > scheduling holdoffs, then that will prevent the RCU sync from > > returning for seconds. > > > > Sure... this is part of the reason the tests I've run so far have all > tried to incorporate background rcuwalk activity, run on a higher cpu > count box, etc. And from the XFS side of the coin, the iget code can > invoke xfs_inodegc_queue_all() in the needs_inactive case before > reclaimable state is a possibility, which queues a work on every cpu > with pending inactive inodes. That is probably unlikely in the > free/alloc case (since needs_inactive inodes are not yet free on disk), > but the broader points are that the inactive processing work has to > complete one way or another before reclaimable state is possible and > that we can probably accommodate a synchronization point here if it's > reasonably filtered. Otherwise... > > > IOWs, we're effectively adding unpredictable and non-deterministic > > latency into the recycle path that is visible to userspace > > applications, and those latencies can be caused by subsystem > > functionality not related to XFS. Hence we need to carefully > > consider unexpected side-effects of adding a kernel global > > synchronisation point into a XFS icache lookup fast path, and these > > issues may not be immediately obvious from testing... > > > > ... agreed. I think at this point we've also discussed various potential > ways to shift or minimize latency/cost further, so there's probably > still room for refinement if such unexpected things crop up before... > > > > This all requires some more thorough testing, but I'm sure it won't be > > > absolutely free for every possible workload or environment. But given > > > that we know this infrastructure is fundamentally broken (by subtle > > > compatibilities between XFS and the VFS that have evolved over time), > > > will require some thought and time to fix properly in the filesystem, > > > that users are running into problems very closely related to it, why not > > > try to address the fundamental breakage if we can do so with an isolated > > > change with minimal (but probably not zero) performance impact? > > > > > > I agree that the unconditional synchronize_rcu() on reuse approach is > > > just not viable, but so far tests using cond_synchronize_rcu() seem > > > fairly reasonable. Is there some other problem or concern with such an > > > approach? > > > > Just that the impact of adding RCU sync points means that bad > > behaviour outside XFS have a new point where they can adversely > > impact on applications doing filesystem operations. > > > > As a temporary mitigation strategy I think it will probably be fine, > > but I'd much prefer we get rid of the need for such an RCU sync > > point rather than try to maintain a mitigation like this in fast > > path code forever. > > > > ... we end up here. All in all, this is intended to be a > practical/temporary step toward functional correctness that minimizes > performance impact and disruption (i.e. just as easy to remove when made > unnecessary). > > Al, > > The caveat to this is I think the practicality of a conditional sync in > the iget recycle code sort of depends on the queueing/batching nature of > inactive inode processing in XFS. If you look at xfs_fs_destroy_inode() > for example, you'll see this is all fairly recent feature/infrastructure > code and that historically we completed most of this transition to > reclaimable state before ->destroy_inode() returns. Therefore, the > concern I have is that on older/stable kernels (where users are hitting > this NULL ->get_link() BUG) the reuse code is far more likely to stall > and slow down here with this change alone (see my earlier numbers on the > unconditional sync_rcu() test for prospective worst case). For that > reason, I'm not sure this is really a backportable solution. > > So getting back to your concern around Ian's patch being a > stopgap/bandaid type solution, would you be willing to pull something > like Ian's patch to the vfs if upstream XFS adopts this conditional rcu > sync in the iget reuse code? I think that would ensure that no further > bandaid fixes will be required in the vfs due to XFS inode reuse, but > would also provide an isolated variant of the fix in the VFS that is > more easily backported to stable kernels. Thoughts? > > Brian > Oh and I meant to throw up a diff of what I was testing for reference. My test code was significantly more hacked up with debug code and such, but if I clean it up a bit the change reduces to the diff below. Brian --- 8< --- diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index d019c98eb839..8a24ced4d73a 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -349,6 +349,10 @@ xfs_iget_recycle( spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); + ASSERT(ip->i_destroy_gp != 0); + cond_synchronize_rcu(ip->i_destroy_gp); + ip->i_destroy_gp = 0; + ASSERT(!rwsem_is_locked(&inode->i_rwsem)); error = xfs_reinit_inode(mp, inode); if (error) { @@ -2019,6 +2023,7 @@ xfs_inodegc_queue( trace_xfs_inode_set_need_inactive(ip); spin_lock(&ip->i_flags_lock); ip->i_flags |= XFS_NEED_INACTIVE; + ip->i_destroy_gp = start_poll_synchronize_rcu(); spin_unlock(&ip->i_flags_lock); gc = get_cpu_ptr(mp->m_inodegc); diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index c447bf04205a..a978da2332a0 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -75,6 +75,8 @@ typedef struct xfs_inode { spinlock_t i_ioend_lock; struct work_struct i_ioend_work; struct list_head i_ioend_list; + + unsigned long i_destroy_gp; } xfs_inode_t; /* Convert from vfs inode to xfs inode */
diff --git a/fs/namei.c b/fs/namei.c index 1f9d2187c765..37a7dba3083b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1760,8 +1760,11 @@ static const char *pick_link(struct nameidata *nd, struct path *link, if (!res) { const char * (*get)(struct dentry *, struct inode *, struct delayed_call *); - get = inode->i_op->get_link; + get = READ_ONCE(inode->i_op->get_link); if (nd->flags & LOOKUP_RCU) { + /* Does the inode still match the associated dentry? */ + if (unlikely(read_seqcount_retry(&link->dentry->d_seq, last->seq))) + return ERR_PTR(-ECHILD); res = get(NULL, inode, &last->done); if (res == ERR_PTR(-ECHILD) && try_to_unlazy(nd)) res = get(link->dentry, inode, &last->done);
When following a trailing symlink in rcu-walk mode it's possible for the dentry to become invalid between the last dentry seq lock check and getting the link (eg. an unlink) leading to a backtrace similar to this: crash> bt PID: 10964 TASK: ffff951c8aa92f80 CPU: 3 COMMAND: "TaniumCX" … #7 [ffffae44d0a6fbe0] page_fault at ffffffff8d6010fe [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: ffffae44d0a6fc90 RFLAGS: 00010246 RAX: ffffffff8da3cc80 RBX: ffffae44d0a6fd30 RCX: 0000000000000000 RDX: ffffae44d0a6fd98 RSI: ffff951aa9af3008 RDI: 0000000000000000 RBP: 0000000000000000 R8: ffffae44d0a6fb94 R9: 0000000000000000 R10: ffff951c95d8c318 R11: 0000000000080000 R12: ffffae44d0a6fd98 R13: ffff951aa9af3008 R14: ffff951c8c9eb840 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffffae44d0a6fc90] trailing_symlink at ffffffff8cf24e61 #9 [ffffae44d0a6fcc8] path_lookupat at ffffffff8cf261d1 #10 [ffffae44d0a6fd28] filename_lookup at ffffffff8cf2a700 #11 [ffffae44d0a6fe40] vfs_statx at ffffffff8cf1dbc4 #12 [ffffae44d0a6fe98] __do_sys_newstat at ffffffff8cf1e1f9 #13 [ffffae44d0a6ff38] do_syscall_64 at ffffffff8cc0420b Most of the time this is not a problem because the inode is unchanged while the rcu read lock is held. But xfs can re-use inodes which can result in the inode ->get_link() method becoming invalid (or NULL). This case needs to be checked for in fs/namei.c:get_link() and if detected the walk re-started. Signed-off-by: Ian Kent <raven@themaw.net> --- fs/namei.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)