Message ID | 20191031234618.15403-29-david@fromorbit.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, xfs: non-blocking inode reclaim | expand |
On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Looking up an unreferenced inode in the inode cache is a bit hairy. > We do this for inode invalidation and writeback clustering purposes, > which is all invisible to the VFS. Hence we can't take reference > counts to the inode and so must be very careful how we do it. > > There are several different places that all do the lookups and > checks slightly differently. Fundamentally, though, they are all > racy and inode reclaim has to block waiting for the inode lock if it > loses the race. This is not very optimal given all the work we;ve > already done to make reclaim non-blocking. > > We can make the reclaim process nonblocking with a couple of simple > changes. If we define the unreferenced lookup process in a way that > will either always grab an inode in a way that reclaim will notice > and skip, or will notice a reclaim has grabbed the inode so it can > skip the inode, then there is no need for reclaim to need to cycle > the inode ILOCK at all. > > Selecting an inode for reclaim is already non-blocking, so if the > ILOCK is held the inode will be skipped. If we ensure that reclaim > holds the ILOCK until the inode is freed, then we can do the same > thing in the unreferenced lookup to avoid inodes in reclaim. We can > do this simply by holding the ILOCK until the RCU grace period > expires and the inode freeing callback is run. As all unreferenced > lookups have to hold the rcu_read_lock(), we are guaranteed that > a reclaimed inode will be noticed as the trylock will fail. > ... > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/mrlock.h | 27 +++++++++ > fs/xfs/xfs_icache.c | 88 +++++++++++++++++++++-------- > fs/xfs/xfs_inode.c | 131 +++++++++++++++++++++----------------------- > 3 files changed, 153 insertions(+), 93 deletions(-) > > diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h > index 79155eec341b..1752a2592bcc 100644 > --- a/fs/xfs/mrlock.h > +++ b/fs/xfs/mrlock.h ... > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 11bf4768d491..45ee3b5cd873 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -106,6 +106,7 @@ xfs_inode_free_callback( > ip->i_itemp = NULL; > } > > + mrunlock_excl_non_owner(&ip->i_lock); > kmem_zone_free(xfs_inode_zone, ip); > } > > @@ -132,6 +133,7 @@ xfs_inode_free( > * free state. The ip->i_flags_lock provides the barrier against lookup > * races. > */ > + mrupdate_non_owner(&ip->i_lock); Can we tie these into the proper locking interface using flags? For example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER) or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps? > spin_lock(&ip->i_flags_lock); > ip->i_flags = XFS_IRECLAIM; > ip->i_ino = 0; > @@ -295,11 +297,24 @@ xfs_iget_cache_hit( > } > > /* > - * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode > - * from stomping over us while we recycle the inode. Remove it > - * from the LRU straight away so we can re-init the VFS inode. > + * Before we reinitialise the inode, we need to make sure > + * reclaim does not pull it out from underneath us. We already > + * hold the i_flags_lock, and because the XFS_IRECLAIM is not > + * set we know the inode is still on the LRU. However, the LRU > + * code may have just selected this inode to reclaim, so we need > + * to ensure we hold the i_flags_lock long enough for the > + * trylock in xfs_inode_reclaim_isolate() to fail. We do this by > + * removing the inode from the LRU, which will spin on the LRU > + * list locks until reclaim stops walking, at which point we > + * know there is no possible race between reclaim isolation and > + * this lookup. > + * Somewhat related to my question about the lru_lock on the earlier patch. > + * We also set the XFS_IRECLAIM flag here while trying to do the > + * re-initialisation to prevent multiple racing lookups on this > + * inode from all landing here at the same time. > */ > ip->i_flags |= XFS_IRECLAIM; > + list_lru_del(&mp->m_inode_lru, &inode->i_lru); > spin_unlock(&ip->i_flags_lock); > rcu_read_unlock(); > ... > @@ -1022,19 +1076,7 @@ xfs_dispose_inode( > spin_unlock(&pag->pag_ici_lock); > xfs_perag_put(pag); > > - /* > - * Here we do an (almost) spurious inode lock in order to coordinate > - * with inode cache radix tree lookups. This is because the lookup > - * can reference the inodes in the cache without taking references. > - * > - * We make that OK here by ensuring that we wait until the inode is > - * unlocked after the lookup before we go ahead and free it. > - * > - * XXX: need to check this is still true. Not sure it is. > - */ > - xfs_ilock(ip, XFS_ILOCK_EXCL); > xfs_qm_dqdetach(ip); > - xfs_iunlock(ip, XFS_ILOCK_EXCL); Ok, so I'm staring at this a bit more and think I'm missing something. If we put aside the change to hold ilock until the inode is freed, we basically have the following (simplified) flow as the inode goes from isolation to disposal: ilock (isolate) iflock set XFS_IRECLAIM ifunlock (disposal) iunlock radix delete ilock cycle (drain) rcu free What we're trying to eliminate is the ilock cycle to drain any concurrent unreferenced lookups from accessing the inode once it is freed. The free itself is still RCU protected. Looking over at the ifree path, we now have something like this: rcu_read_lock() radix lookup check XFS_IRECLAIM ilock if XFS_ISTALE, skip set XFS_ISTALE rcu_read_unlock() iflock /* return locked down inode */ Given that we set XFS_IRECLAIM under ilock, would we still need either the ilock cycle or to hold ilock through the RCU free if the ifree side (re)checked XFS_IRECLAIM after it has the ilock (but before it drops the rcu read lock)? ISTM we should either have a non-reclaim inode with ilock protection or a reclaim inode with RCU protection (so we can skip it before it frees), but I could easily be missing something here.. > > __xfs_inode_free(ip); > } > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 33edb18098ca..5c0be82195fc 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -2538,60 +2538,63 @@ xfs_ifree_get_one_inode( > if (!ip) > goto out_rcu_unlock; > > + Extra whitespace here. > + spin_lock(&ip->i_flags_lock); > + if (!ip->i_ino || ip->i_ino != inum || > + __xfs_iflags_test(ip, XFS_IRECLAIM)) > + goto out_iflags_unlock; > + > /* > - * because this is an RCU protected lookup, we could find a recently > - * freed or even reallocated inode during the lookup. We need to check > - * under the i_flags_lock for a valid inode here. Skip it if it is not > - * valid, the wrong inode or stale. > + * We've got the right inode and it isn't in reclaim but it might be > + * locked by someone else. In that case, we retry the inode rather than > + * skipping it completely as we have to process it with the cluster > + * being freed. > */ > - spin_lock(&ip->i_flags_lock); > - if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) { > + if (ip != free_ip && !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > spin_unlock(&ip->i_flags_lock); > - goto out_rcu_unlock; > + rcu_read_unlock(); > + delay(1); > + goto retry; > } > - spin_unlock(&ip->i_flags_lock); > > /* > - * Don't try to lock/unlock the current inode, but we _cannot_ skip the > - * other inodes that we did not find in the list attached to the buffer > - * and are not already marked stale. If we can't lock it, back off and > - * retry. > + * Inode is now pinned against reclaim until we unlock it. If the inode > + * is already marked stale, then it has already been flush locked and > + * attached to the buffer so we don't need to do anything more here. > */ > - if (ip != free_ip) { > - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > - rcu_read_unlock(); > - delay(1); > - goto retry; > - } > - > - /* > - * Check the inode number again in case we're racing with > - * freeing in xfs_reclaim_inode(). See the comments in that > - * function for more information as to why the initial check is > - * not sufficient. > - */ > - if (ip->i_ino != inum) { > + if (__xfs_iflags_test(ip, XFS_ISTALE)) { Is there a correctness reason for why we move the stale check to under ilock (in both iflush/ifree)? > + if (ip != free_ip) > xfs_iunlock(ip, XFS_ILOCK_EXCL); > - goto out_rcu_unlock; > - } > + goto out_iflags_unlock; > } > + __xfs_iflags_set(ip, XFS_ISTALE); > + spin_unlock(&ip->i_flags_lock); > rcu_read_unlock(); > > + /* > + * The flush lock will now hold off inode reclaim until the buffer > + * completion routine runs the xfs_istale_done callback and unlocks the > + * flush lock. Hence the caller can safely drop the ILOCK when it is > + * done attaching the inode to the cluster buffer. > + */ > xfs_iflock(ip); > - xfs_iflags_set(ip, XFS_ISTALE); > > /* > - * We don't need to attach clean inodes or those only with unlogged > - * changes (which we throw away, anyway). > + * We don't need to attach clean inodes to the buffer - they are marked > + * stale in memory now and will need to be re-initialised by inode > + * allocation before they can be reused. > */ > if (!ip->i_itemp || xfs_inode_clean(ip)) { > ASSERT(ip != free_ip); > xfs_ifunlock(ip); > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > + if (ip != free_ip) > + xfs_iunlock(ip, XFS_ILOCK_EXCL); There's an assert against this case just above, though I suppose there's nothing wrong with just keeping it and making the functional code more cautious. Brian > goto out_no_inode; > } > return ip; > > +out_iflags_unlock: > + spin_unlock(&ip->i_flags_lock); > out_rcu_unlock: > rcu_read_unlock(); > out_no_inode: > @@ -3519,44 +3522,40 @@ xfs_iflush_cluster( > continue; > > /* > - * because this is an RCU protected lookup, we could find a > - * recently freed or even reallocated inode during the lookup. > - * We need to check under the i_flags_lock for a valid inode > - * here. Skip it if it is not valid or the wrong inode. > + * See xfs_dispose_inode() for an explanation of the > + * tests here to avoid inode reclaim races. > */ > spin_lock(&cip->i_flags_lock); > if (!cip->i_ino || > - __xfs_iflags_test(cip, XFS_ISTALE)) { > + __xfs_iflags_test(cip, XFS_IRECLAIM)) { > spin_unlock(&cip->i_flags_lock); > continue; > } > > - /* > - * Once we fall off the end of the cluster, no point checking > - * any more inodes in the list because they will also all be > - * outside the cluster. > - */ > + /* ILOCK will pin the inode against reclaim */ > + if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) { > + spin_unlock(&cip->i_flags_lock); > + continue; > + } > + > + if (__xfs_iflags_test(cip, XFS_ISTALE)) { > + xfs_iunlock(cip, XFS_ILOCK_SHARED); > + spin_unlock(&cip->i_flags_lock); > + continue; > + } > + > + /* Lookup can find inodes outside the cluster being flushed. */ > if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) { > + xfs_iunlock(cip, XFS_ILOCK_SHARED); > spin_unlock(&cip->i_flags_lock); > break; > } > spin_unlock(&cip->i_flags_lock); > > /* > - * Do an un-protected check to see if the inode is dirty and > - * is a candidate for flushing. These checks will be repeated > - * later after the appropriate locks are acquired. > - */ > - if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0) > - continue; > - > - /* > - * Try to get locks. If any are unavailable or it is pinned, > + * If we can't get the flush lock now or the inode is pinned, > * then this inode cannot be flushed and is skipped. > */ > - > - if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) > - continue; > if (!xfs_iflock_nowait(cip)) { > xfs_iunlock(cip, XFS_ILOCK_SHARED); > continue; > @@ -3567,22 +3566,9 @@ xfs_iflush_cluster( > continue; > } > > - > /* > - * Check the inode number again, just to be certain we are not > - * racing with freeing in xfs_reclaim_inode(). See the comments > - * in that function for more information as to why the initial > - * check is not sufficient. > - */ > - if (!cip->i_ino) { > - xfs_ifunlock(cip); > - xfs_iunlock(cip, XFS_ILOCK_SHARED); > - continue; > - } > - > - /* > - * arriving here means that this inode can be flushed. First > - * re-check that it's dirty before flushing. > + * Arriving here means that this inode can be flushed. First > + * check that it's dirty before flushing. > */ > if (!xfs_inode_clean(cip)) { > int error; > @@ -3596,6 +3582,7 @@ xfs_iflush_cluster( > xfs_ifunlock(cip); > } > xfs_iunlock(cip, XFS_ILOCK_SHARED); > + /* unsafe to reference cip from here */ > } > > if (clcount) { > @@ -3634,7 +3621,11 @@ xfs_iflush_cluster( > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); > > - /* abort the corrupt inode, as it was not attached to the buffer */ > + /* > + * Abort the corrupt inode, as it was not attached to the buffer. It is > + * unlocked, but still pinned against reclaim by the flush lock so it is > + * safe to reference here until after the flush abort completes. > + */ > xfs_iflush_abort(cip, false); > kmem_free(cilist); > xfs_perag_put(pag); > -- > 2.24.0.rc0 >
On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote: > On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Looking up an unreferenced inode in the inode cache is a bit hairy. > > We do this for inode invalidation and writeback clustering purposes, > > which is all invisible to the VFS. Hence we can't take reference > > counts to the inode and so must be very careful how we do it. > > > > There are several different places that all do the lookups and > > checks slightly differently. Fundamentally, though, they are all > > racy and inode reclaim has to block waiting for the inode lock if it > > loses the race. This is not very optimal given all the work we;ve > > already done to make reclaim non-blocking. > > > > We can make the reclaim process nonblocking with a couple of simple > > changes. If we define the unreferenced lookup process in a way that > > will either always grab an inode in a way that reclaim will notice > > and skip, or will notice a reclaim has grabbed the inode so it can > > skip the inode, then there is no need for reclaim to need to cycle > > the inode ILOCK at all. > > > > Selecting an inode for reclaim is already non-blocking, so if the > > ILOCK is held the inode will be skipped. If we ensure that reclaim > > holds the ILOCK until the inode is freed, then we can do the same > > thing in the unreferenced lookup to avoid inodes in reclaim. We can > > do this simply by holding the ILOCK until the RCU grace period > > expires and the inode freeing callback is run. As all unreferenced > > lookups have to hold the rcu_read_lock(), we are guaranteed that > > a reclaimed inode will be noticed as the trylock will fail. > > > ... > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/mrlock.h | 27 +++++++++ > > fs/xfs/xfs_icache.c | 88 +++++++++++++++++++++-------- > > fs/xfs/xfs_inode.c | 131 +++++++++++++++++++++----------------------- > > 3 files changed, 153 insertions(+), 93 deletions(-) > > > > diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h > > index 79155eec341b..1752a2592bcc 100644 > > --- a/fs/xfs/mrlock.h > > +++ b/fs/xfs/mrlock.h > ... > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 11bf4768d491..45ee3b5cd873 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -106,6 +106,7 @@ xfs_inode_free_callback( > > ip->i_itemp = NULL; > > } > > > > + mrunlock_excl_non_owner(&ip->i_lock); > > kmem_zone_free(xfs_inode_zone, ip); > > } > > > > @@ -132,6 +133,7 @@ xfs_inode_free( > > * free state. The ip->i_flags_lock provides the barrier against lookup > > * races. > > */ > > + mrupdate_non_owner(&ip->i_lock); > > Can we tie these into the proper locking interface using flags? For > example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER) > or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps? I'd prefer not to make this part of the common locking interface - it's a one off special use case, not something we want to progate elsewhere into the code. Now that I think over it, I probably should have tagged this with patch with [RFC]. I think we should just get rid of the mrlock wrappers rather than add more, and that would simplify this a lot. > > spin_lock(&ip->i_flags_lock); > > ip->i_flags = XFS_IRECLAIM; > > ip->i_ino = 0; > > @@ -295,11 +297,24 @@ xfs_iget_cache_hit( > > } > > > > /* > > - * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode > > - * from stomping over us while we recycle the inode. Remove it > > - * from the LRU straight away so we can re-init the VFS inode. > > + * Before we reinitialise the inode, we need to make sure > > + * reclaim does not pull it out from underneath us. We already > > + * hold the i_flags_lock, and because the XFS_IRECLAIM is not > > + * set we know the inode is still on the LRU. However, the LRU > > + * code may have just selected this inode to reclaim, so we need > > + * to ensure we hold the i_flags_lock long enough for the > > + * trylock in xfs_inode_reclaim_isolate() to fail. We do this by > > + * removing the inode from the LRU, which will spin on the LRU > > + * list locks until reclaim stops walking, at which point we > > + * know there is no possible race between reclaim isolation and > > + * this lookup. > > + * > > Somewhat related to my question about the lru_lock on the earlier patch. *nod* The caveat here is that this is the slow path so spinning for a while doesn't really matter. > > @@ -1022,19 +1076,7 @@ xfs_dispose_inode( > > spin_unlock(&pag->pag_ici_lock); > > xfs_perag_put(pag); > > > > - /* > > - * Here we do an (almost) spurious inode lock in order to coordinate > > - * with inode cache radix tree lookups. This is because the lookup > > - * can reference the inodes in the cache without taking references. > > - * > > - * We make that OK here by ensuring that we wait until the inode is > > - * unlocked after the lookup before we go ahead and free it. > > - * > > - * XXX: need to check this is still true. Not sure it is. > > - */ > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > xfs_qm_dqdetach(ip); > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > Ok, so I'm staring at this a bit more and think I'm missing something. > If we put aside the change to hold ilock until the inode is freed, we > basically have the following (simplified) flow as the inode goes from > isolation to disposal: > > ilock (isolate) > iflock > set XFS_IRECLAIM > ifunlock (disposal) > iunlock > radix delete > ilock cycle (drain) > rcu free > > What we're trying to eliminate is the ilock cycle to drain any > concurrent unreferenced lookups from accessing the inode once it is > freed. The free itself is still RCU protected. > > Looking over at the ifree path, we now have something like this: > > rcu_read_lock() > radix lookup > check XFS_IRECLAIM > ilock > if XFS_ISTALE, skip > set XFS_ISTALE > rcu_read_unlock() > iflock > /* return locked down inode */ You missed a lock. rcu_read_lock() radix lookup >>> i_flags_lock check XFS_IRECLAIM ilock if XFS_ISTALE, skip set XFS_ISTALE >>> i_flags_unlock rcu_read_unlock() iflock > Given that we set XFS_IRECLAIM under ilock, would we still need either > the ilock cycle or to hold ilock through the RCU free if the ifree side > (re)checked XFS_IRECLAIM after it has the ilock (but before it drops the > rcu read lock)? We set XFS_IRECLAIM under the i_flags_lock. It is the combination of rcu_read_lock() and i_flags_lock() that provides the RCU lookup state barriers - the ILOCK is not part of that at all. The key point here is that once we've validated the inode we found in the radix tree under the i_flags_lock, we then take the ILOCK, thereby serialising the taking of the ILOCK here with the taking of the ILOCK in the reclaim isolation code. i.e. all the reclaim state serialisation is actually based around holding the i_flags_lock, not the ILOCK. Once we have grabbed the ILOCK under the i_flags_lock, we can drop the i_flags_lock knowing that reclaim will not be able isolate this inode and set XFS_IRECLAIM. > ISTM we should either have a non-reclaim inode with > ilock protection or a reclaim inode with RCU protection (so we can skip > it before it frees), but I could easily be missing something here.. Heh. Yeah, it's a complex dance, and it's all based around how RCU lookups and the i_flags_lock interact to provide coherent detection of freed inodes. I have a nagging feeling that this whole ILOCK-held-to-rcu-free game can be avoided. I need to walk myself through the lookup state machine again and determine if ordering the XFS_IRECLAIM flag check after greabbing the ILOCK is sufficient to prevent ifree/iflush lookups from accessing the inode outside the rcu_read_lock() context. If so, most of this patch will go away.... > > + * attached to the buffer so we don't need to do anything more here. > > */ > > - if (ip != free_ip) { > > - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > > - rcu_read_unlock(); > > - delay(1); > > - goto retry; > > - } > > - > > - /* > > - * Check the inode number again in case we're racing with > > - * freeing in xfs_reclaim_inode(). See the comments in that > > - * function for more information as to why the initial check is > > - * not sufficient. > > - */ > > - if (ip->i_ino != inum) { > > + if (__xfs_iflags_test(ip, XFS_ISTALE)) { > > Is there a correctness reason for why we move the stale check to under > ilock (in both iflush/ifree)? It's under the i_flags_lock, and so I moved it up under the lookup hold of the i_flags_lock so we don't need to cycle it again. > > /* > > - * We don't need to attach clean inodes or those only with unlogged > > - * changes (which we throw away, anyway). > > + * We don't need to attach clean inodes to the buffer - they are marked > > + * stale in memory now and will need to be re-initialised by inode > > + * allocation before they can be reused. > > */ > > if (!ip->i_itemp || xfs_inode_clean(ip)) { > > ASSERT(ip != free_ip); > > xfs_ifunlock(ip); > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > + if (ip != free_ip) > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > There's an assert against this case just above, though I suppose there's > nothing wrong with just keeping it and making the functional code more > cautious. *nod* It follows Darrick's lead of making sure that production kernels don't do something stupid because of some whacky corruption we didn't expect to ever see. Cheers, Dave.
On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote: > > Can we tie these into the proper locking interface using flags? For > > example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER) > > or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps? > > I'd prefer not to make this part of the common locking interface - > it's a one off special use case, not something we want to progate > elsewhere into the code. > > Now that I think over it, I probably should have tagged this with > patch with [RFC]. I think we should just get rid of the mrlock > wrappers rather than add more, and that would simplify this a lot. Yes, killing off the mrlock wrappers would be very helpful. The only thing we use them for is asserts on the locking state. We could either switch to lockdep_assert_held*, or just open code the write locked bit. While it is a little more ugly I'd tend towards the latter given that the locking asserts are too useful to require lockdep builds with their performance impact.
On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote: > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote: > > On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Looking up an unreferenced inode in the inode cache is a bit hairy. > > > We do this for inode invalidation and writeback clustering purposes, > > > which is all invisible to the VFS. Hence we can't take reference > > > counts to the inode and so must be very careful how we do it. > > > > > > There are several different places that all do the lookups and > > > checks slightly differently. Fundamentally, though, they are all > > > racy and inode reclaim has to block waiting for the inode lock if it > > > loses the race. This is not very optimal given all the work we;ve > > > already done to make reclaim non-blocking. > > > > > > We can make the reclaim process nonblocking with a couple of simple > > > changes. If we define the unreferenced lookup process in a way that > > > will either always grab an inode in a way that reclaim will notice > > > and skip, or will notice a reclaim has grabbed the inode so it can > > > skip the inode, then there is no need for reclaim to need to cycle > > > the inode ILOCK at all. > > > > > > Selecting an inode for reclaim is already non-blocking, so if the > > > ILOCK is held the inode will be skipped. If we ensure that reclaim > > > holds the ILOCK until the inode is freed, then we can do the same > > > thing in the unreferenced lookup to avoid inodes in reclaim. We can > > > do this simply by holding the ILOCK until the RCU grace period > > > expires and the inode freeing callback is run. As all unreferenced > > > lookups have to hold the rcu_read_lock(), we are guaranteed that > > > a reclaimed inode will be noticed as the trylock will fail. > > > > > ... > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/mrlock.h | 27 +++++++++ > > > fs/xfs/xfs_icache.c | 88 +++++++++++++++++++++-------- > > > fs/xfs/xfs_inode.c | 131 +++++++++++++++++++++----------------------- > > > 3 files changed, 153 insertions(+), 93 deletions(-) > > > > > > diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h > > > index 79155eec341b..1752a2592bcc 100644 > > > --- a/fs/xfs/mrlock.h > > > +++ b/fs/xfs/mrlock.h > > ... > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index 11bf4768d491..45ee3b5cd873 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -106,6 +106,7 @@ xfs_inode_free_callback( > > > ip->i_itemp = NULL; > > > } > > > > > > + mrunlock_excl_non_owner(&ip->i_lock); > > > kmem_zone_free(xfs_inode_zone, ip); > > > } > > > > > > @@ -132,6 +133,7 @@ xfs_inode_free( > > > * free state. The ip->i_flags_lock provides the barrier against lookup > > > * races. > > > */ > > > + mrupdate_non_owner(&ip->i_lock); > > > > Can we tie these into the proper locking interface using flags? For > > example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER) > > or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps? > > I'd prefer not to make this part of the common locking interface - > it's a one off special use case, not something we want to progate > elsewhere into the code. > What urks me about this is that it obfuscates rather than highlights that fact because I have no idea what mrtryupdate_non_owner() actually does without looking it up. We could easily name a flag XFS_ILOCK_PENDING_RECLAIM or something similarly ridiculous to make it blindingly obvious it should only be used in a special context. > Now that I think over it, I probably should have tagged this with > patch with [RFC]. I think we should just get rid of the mrlock > wrappers rather than add more, and that would simplify this a lot. > Yeah, FWIW I've been reviewing this patch as a WIP on top of all of the nonblocking bits as opposed to being some fundamental part of that work. That aside, I also agree that cleaning up these wrappers might address that concern because something like: /* open code ilock because ... */ down_write_trylock_non_owner(&ip->i_lock); ... is at least readable code. > > > > spin_lock(&ip->i_flags_lock); > > > ip->i_flags = XFS_IRECLAIM; > > > ip->i_ino = 0; > > > @@ -295,11 +297,24 @@ xfs_iget_cache_hit( > > > } > > > > > > /* > > > - * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode > > > - * from stomping over us while we recycle the inode. Remove it > > > - * from the LRU straight away so we can re-init the VFS inode. > > > + * Before we reinitialise the inode, we need to make sure > > > + * reclaim does not pull it out from underneath us. We already > > > + * hold the i_flags_lock, and because the XFS_IRECLAIM is not > > > + * set we know the inode is still on the LRU. However, the LRU > > > + * code may have just selected this inode to reclaim, so we need > > > + * to ensure we hold the i_flags_lock long enough for the > > > + * trylock in xfs_inode_reclaim_isolate() to fail. We do this by > > > + * removing the inode from the LRU, which will spin on the LRU > > > + * list locks until reclaim stops walking, at which point we > > > + * know there is no possible race between reclaim isolation and > > > + * this lookup. > > > + * > > > > Somewhat related to my question about the lru_lock on the earlier patch. > > *nod* > > The caveat here is that this is the slow path so spinning for a > while doesn't really matter. > > > > @@ -1022,19 +1076,7 @@ xfs_dispose_inode( > > > spin_unlock(&pag->pag_ici_lock); > > > xfs_perag_put(pag); > > > > > > - /* > > > - * Here we do an (almost) spurious inode lock in order to coordinate > > > - * with inode cache radix tree lookups. This is because the lookup > > > - * can reference the inodes in the cache without taking references. > > > - * > > > - * We make that OK here by ensuring that we wait until the inode is > > > - * unlocked after the lookup before we go ahead and free it. > > > - * > > > - * XXX: need to check this is still true. Not sure it is. > > > - */ > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > > xfs_qm_dqdetach(ip); > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > Ok, so I'm staring at this a bit more and think I'm missing something. > > If we put aside the change to hold ilock until the inode is freed, we > > basically have the following (simplified) flow as the inode goes from > > isolation to disposal: > > > > ilock (isolate) > > iflock > > set XFS_IRECLAIM > > ifunlock (disposal) > > iunlock > > radix delete > > ilock cycle (drain) > > rcu free > > > > What we're trying to eliminate is the ilock cycle to drain any > > concurrent unreferenced lookups from accessing the inode once it is > > freed. The free itself is still RCU protected. > > > > Looking over at the ifree path, we now have something like this: > > > > rcu_read_lock() > > radix lookup > > check XFS_IRECLAIM > > ilock > > if XFS_ISTALE, skip > > set XFS_ISTALE > > rcu_read_unlock() > > iflock > > /* return locked down inode */ > > You missed a lock. > > rcu_read_lock() > radix lookup > >>> i_flags_lock > check XFS_IRECLAIM > ilock > if XFS_ISTALE, skip > set XFS_ISTALE > >>> i_flags_unlock > rcu_read_unlock() > iflock > > > Given that we set XFS_IRECLAIM under ilock, would we still need either > > the ilock cycle or to hold ilock through the RCU free if the ifree side > > (re)checked XFS_IRECLAIM after it has the ilock (but before it drops the > > rcu read lock)? > > We set XFS_IRECLAIM under the i_flags_lock. > > It is the combination of rcu_read_lock() and i_flags_lock() that > provides the RCU lookup state barriers - the ILOCK is not part of > that at all. > > The key point here is that once we've validated the inode we found > in the radix tree under the i_flags_lock, we then take the ILOCK, > thereby serialising the taking of the ILOCK here with the taking of > the ILOCK in the reclaim isolation code. > > i.e. all the reclaim state serialisation is actually based around > holding the i_flags_lock, not the ILOCK. > > Once we have grabbed the ILOCK under the i_flags_lock, we can > drop the i_flags_lock knowing that reclaim will not be able isolate > this inode and set XFS_IRECLAIM. > Hmm, Ok. I knew i_flags_lock was in there when I wrote this up. I intentionally left it out as a simplification because it wasn't clear to me that it was a critical part of the lookup. I'll keep this in mind the next time I walk through this code. > > ISTM we should either have a non-reclaim inode with > > ilock protection or a reclaim inode with RCU protection (so we can skip > > it before it frees), but I could easily be missing something here.. > > Heh. Yeah, it's a complex dance, and it's all based around how > RCU lookups and the i_flags_lock interact to provide coherent > detection of freed inodes. > > I have a nagging feeling that this whole ILOCK-held-to-rcu-free game > can be avoided. I need to walk myself through the lookup state > machine again and determine if ordering the XFS_IRECLAIM flag check > after greabbing the ILOCK is sufficient to prevent ifree/iflush > lookups from accessing the inode outside the rcu_read_lock() > context. > That's pretty much what I was wondering... > If so, most of this patch will go away.... > > > > + * attached to the buffer so we don't need to do anything more here. > > > */ > > > - if (ip != free_ip) { > > > - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > > > - rcu_read_unlock(); > > > - delay(1); > > > - goto retry; > > > - } > > > - > > > - /* > > > - * Check the inode number again in case we're racing with > > > - * freeing in xfs_reclaim_inode(). See the comments in that > > > - * function for more information as to why the initial check is > > > - * not sufficient. > > > - */ > > > - if (ip->i_ino != inum) { > > > + if (__xfs_iflags_test(ip, XFS_ISTALE)) { > > > > Is there a correctness reason for why we move the stale check to under > > ilock (in both iflush/ifree)? > > It's under the i_flags_lock, and so I moved it up under the lookup > hold of the i_flags_lock so we don't need to cycle it again. > Yeah, but in both cases it looks like it moved to under the ilock as well, which comes after i_flags_lock. IOW, why grab ilock for stale inodes when we're just going to skip them? Brian > > > /* > > > - * We don't need to attach clean inodes or those only with unlogged > > > - * changes (which we throw away, anyway). > > > + * We don't need to attach clean inodes to the buffer - they are marked > > > + * stale in memory now and will need to be re-initialised by inode > > > + * allocation before they can be reused. > > > */ > > > if (!ip->i_itemp || xfs_inode_clean(ip)) { > > > ASSERT(ip != free_ip); > > > xfs_ifunlock(ip); > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > + if (ip != free_ip) > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > There's an assert against this case just above, though I suppose there's > > nothing wrong with just keeping it and making the functional code more > > cautious. > > *nod* > > It follows Darrick's lead of making sure that production kernels > don't do something stupid because of some whacky corruption we > didn't expect to ever see. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Fri, Nov 15, 2019 at 12:26:00PM -0500, Brian Foster wrote: > On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote: > > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote: > > If so, most of this patch will go away.... > > > > > > + * attached to the buffer so we don't need to do anything more here. > > > > */ > > > > - if (ip != free_ip) { > > > > - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > > > > - rcu_read_unlock(); > > > > - delay(1); > > > > - goto retry; > > > > - } > > > > - > > > > - /* > > > > - * Check the inode number again in case we're racing with > > > > - * freeing in xfs_reclaim_inode(). See the comments in that > > > > - * function for more information as to why the initial check is > > > > - * not sufficient. > > > > - */ > > > > - if (ip->i_ino != inum) { > > > > + if (__xfs_iflags_test(ip, XFS_ISTALE)) { > > > > > > Is there a correctness reason for why we move the stale check to under > > > ilock (in both iflush/ifree)? > > > > It's under the i_flags_lock, and so I moved it up under the lookup > > hold of the i_flags_lock so we don't need to cycle it again. > > > > Yeah, but in both cases it looks like it moved to under the ilock as > well, which comes after i_flags_lock. IOW, why grab ilock for stale > inodes when we're just going to skip them? Because I was worrying about serialising against reclaim before changing the state of the inode. i.e. if the inode has already been isolated by not yet disposed of, we shouldn't touch the inode state at all. Serialisation against reclaim in this patch is via the ILOCK, hence we need to do that before setting ISTALE.... IOWs, ISTALE is not protected by ILOCK, we just can't modify the inode state until after we've gained the ILOCK to protect against reclaim.... Cheers, Dave.
On Mon, Nov 18, 2019 at 12:00:47PM +1100, Dave Chinner wrote: > On Fri, Nov 15, 2019 at 12:26:00PM -0500, Brian Foster wrote: > > On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote: > > > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote: > > > If so, most of this patch will go away.... > > > > > > > > + * attached to the buffer so we don't need to do anything more here. > > > > > */ > > > > > - if (ip != free_ip) { > > > > > - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > > > > > - rcu_read_unlock(); > > > > > - delay(1); > > > > > - goto retry; > > > > > - } > > > > > - > > > > > - /* > > > > > - * Check the inode number again in case we're racing with > > > > > - * freeing in xfs_reclaim_inode(). See the comments in that > > > > > - * function for more information as to why the initial check is > > > > > - * not sufficient. > > > > > - */ > > > > > - if (ip->i_ino != inum) { > > > > > + if (__xfs_iflags_test(ip, XFS_ISTALE)) { > > > > > > > > Is there a correctness reason for why we move the stale check to under > > > > ilock (in both iflush/ifree)? > > > > > > It's under the i_flags_lock, and so I moved it up under the lookup > > > hold of the i_flags_lock so we don't need to cycle it again. > > > > > > > Yeah, but in both cases it looks like it moved to under the ilock as > > well, which comes after i_flags_lock. IOW, why grab ilock for stale > > inodes when we're just going to skip them? > > Because I was worrying about serialising against reclaim before > changing the state of the inode. i.e. if the inode has already been > isolated by not yet disposed of, we shouldn't touch the inode state > at all. Serialisation against reclaim in this patch is via the > ILOCK, hence we need to do that before setting ISTALE.... > Yeah, I think my question still isn't clear... I'm not talking about setting ISTALE. The code I referenced above is where we test for it and skip the inode if it is already set. For example, the code referenced above in xfs_ifree_get_one_inode() currently does the following with respect to i_flags_lock, ILOCK and XFS_ISTALE: ... spin_lock(i_flags_lock) xfs_ilock_nowait(XFS_ILOCK_EXCL) if !XFS_ISTALE skip set XFS_ISTALE ... The reclaim isolate code does this, however: spin_trylock(i_flags_lock) if !XFS_ISTALE skip xfs_ilock(XFS_ILOCK_EXCL) ... So my question is why not do something like the following in the _get_one_inode() case? ... spin_lock(i_flags_lock) if !XFS_ISTALE skip xfs_ilock_nowait(XFS_ILOCK_EXCL) set XFS_ISTALE ... IOW, what is the need, if any, to acquire ilock in the iflush/ifree paths before testing for XFS_ISTALE? Is there some specific intermediate state I'm missing or is this just unintentional? The reason I ask is ilock failure triggers that ugly delay(1) and retry thing, so it seems slightly weird to allow that for a stale inode we're ultimately going to skip (regardless of whether that would actually ever occur). Brian > IOWs, ISTALE is not protected by ILOCK, we just can't modify the > inode state until after we've gained the ILOCK to protect against > reclaim.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Tue, Nov 19, 2019 at 10:13:44AM -0500, Brian Foster wrote: > On Mon, Nov 18, 2019 at 12:00:47PM +1100, Dave Chinner wrote: > > On Fri, Nov 15, 2019 at 12:26:00PM -0500, Brian Foster wrote: > > > On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote: > > > > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote: > > > > If so, most of this patch will go away.... > > > > > > > > > > + * attached to the buffer so we don't need to do anything more here. > > > > > > */ > > > > > > - if (ip != free_ip) { > > > > > > - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > > > > > > - rcu_read_unlock(); > > > > > > - delay(1); > > > > > > - goto retry; > > > > > > - } > > > > > > - > > > > > > - /* > > > > > > - * Check the inode number again in case we're racing with > > > > > > - * freeing in xfs_reclaim_inode(). See the comments in that > > > > > > - * function for more information as to why the initial check is > > > > > > - * not sufficient. > > > > > > - */ > > > > > > - if (ip->i_ino != inum) { > > > > > > + if (__xfs_iflags_test(ip, XFS_ISTALE)) { > > > > > > > > > > Is there a correctness reason for why we move the stale check to under > > > > > ilock (in both iflush/ifree)? > > > > > > > > It's under the i_flags_lock, and so I moved it up under the lookup > > > > hold of the i_flags_lock so we don't need to cycle it again. > > > > > > > > > > Yeah, but in both cases it looks like it moved to under the ilock as > > > well, which comes after i_flags_lock. IOW, why grab ilock for stale > > > inodes when we're just going to skip them? > > > > Because I was worrying about serialising against reclaim before > > changing the state of the inode. i.e. if the inode has already been > > isolated by not yet disposed of, we shouldn't touch the inode state > > at all. Serialisation against reclaim in this patch is via the > > ILOCK, hence we need to do that before setting ISTALE.... > > > > Yeah, I think my question still isn't clear... I'm not talking about > setting ISTALE. The code I referenced above is where we test for it and > skip the inode if it is already set. For example, the code referenced > above in xfs_ifree_get_one_inode() currently does the following with > respect to i_flags_lock, ILOCK and XFS_ISTALE: > > ... > spin_lock(i_flags_lock) > xfs_ilock_nowait(XFS_ILOCK_EXCL) > if !XFS_ISTALE > skip > set XFS_ISTALE > ... There is another place in xfs_ifree_cluster that sets ISTALE without the ILOCK held, so the ILOCK is being used here for a different purpose... > The reclaim isolate code does this, however: > > spin_trylock(i_flags_lock) > if !XFS_ISTALE > skip > xfs_ilock(XFS_ILOCK_EXCL) > ... Which is fine, because we're not trying to avoid racing with reclaim here. :) i.e. all we need is the i_flags lock to check the ISTALE flag safely. > So my question is why not do something like the following in the > _get_one_inode() case? > > ... > spin_lock(i_flags_lock) > if !XFS_ISTALE > skip > xfs_ilock_nowait(XFS_ILOCK_EXCL) > set XFS_ISTALE > ... Because, like I said, I focussed on the lookup racing with reclaim first. The above code could be used, but it puts object internal state checks before we really know whether the object is safe to access and whether we can trust it. I'm just following a basic RCU/lockless lookup principle here: don't try to use object state before you've fully validated that the object is live and guaranteed that it can be safely referenced. > IOW, what is the need, if any, to acquire ilock in the iflush/ifree > paths before testing for XFS_ISTALE? Is there some specific intermediate > state I'm missing or is this just unintentional? It's entirely intentional - validate and claim the object we've found in the lockless lookup, then run the code that checks/changes the object state. Smashing state checks and lockless lookup validation together is a nasty landmine to leave behind... Cheers, Dave.
On Wed, Nov 20, 2019 at 08:18:34AM +1100, Dave Chinner wrote: > On Tue, Nov 19, 2019 at 10:13:44AM -0500, Brian Foster wrote: > > On Mon, Nov 18, 2019 at 12:00:47PM +1100, Dave Chinner wrote: > > > On Fri, Nov 15, 2019 at 12:26:00PM -0500, Brian Foster wrote: > > > > On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote: > > > > > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote: > > > > > If so, most of this patch will go away.... > > > > > > > > > > > > + * attached to the buffer so we don't need to do anything more here. > > > > > > > */ > > > > > > > - if (ip != free_ip) { > > > > > > > - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > > > > > > > - rcu_read_unlock(); > > > > > > > - delay(1); > > > > > > > - goto retry; > > > > > > > - } > > > > > > > - > > > > > > > - /* > > > > > > > - * Check the inode number again in case we're racing with > > > > > > > - * freeing in xfs_reclaim_inode(). See the comments in that > > > > > > > - * function for more information as to why the initial check is > > > > > > > - * not sufficient. > > > > > > > - */ > > > > > > > - if (ip->i_ino != inum) { > > > > > > > + if (__xfs_iflags_test(ip, XFS_ISTALE)) { > > > > > > > > > > > > Is there a correctness reason for why we move the stale check to under > > > > > > ilock (in both iflush/ifree)? > > > > > > > > > > It's under the i_flags_lock, and so I moved it up under the lookup > > > > > hold of the i_flags_lock so we don't need to cycle it again. > > > > > > > > > > > > > Yeah, but in both cases it looks like it moved to under the ilock as > > > > well, which comes after i_flags_lock. IOW, why grab ilock for stale > > > > inodes when we're just going to skip them? > > > > > > Because I was worrying about serialising against reclaim before > > > changing the state of the inode. i.e. if the inode has already been > > > isolated by not yet disposed of, we shouldn't touch the inode state > > > at all. Serialisation against reclaim in this patch is via the > > > ILOCK, hence we need to do that before setting ISTALE.... > > > > > > > Yeah, I think my question still isn't clear... I'm not talking about > > setting ISTALE. The code I referenced above is where we test for it and > > skip the inode if it is already set. For example, the code referenced > > above in xfs_ifree_get_one_inode() currently does the following with > > respect to i_flags_lock, ILOCK and XFS_ISTALE: > > > > ... > > spin_lock(i_flags_lock) > > xfs_ilock_nowait(XFS_ILOCK_EXCL) > > if !XFS_ISTALE > > skip > > set XFS_ISTALE > > ... > > There is another place in xfs_ifree_cluster that sets ISTALE without > the ILOCK held, so the ILOCK is being used here for a different > purpose... > > > The reclaim isolate code does this, however: > > > > spin_trylock(i_flags_lock) > > if !XFS_ISTALE > > skip > > xfs_ilock(XFS_ILOCK_EXCL) > > ... > > Which is fine, because we're not trying to avoid racing with reclaim > here. :) i.e. all we need is the i_flags lock to check the ISTALE > flag safely. > > > So my question is why not do something like the following in the > > _get_one_inode() case? > > > > ... > > spin_lock(i_flags_lock) > > if !XFS_ISTALE > > skip > > xfs_ilock_nowait(XFS_ILOCK_EXCL) > > set XFS_ISTALE > > ... > > Because, like I said, I focussed on the lookup racing with reclaim > first. The above code could be used, but it puts object internal > state checks before we really know whether the object is safe to > access and whether we can trust it. > > I'm just following a basic RCU/lockless lookup principle here: > don't try to use object state before you've fully validated that the > object is live and guaranteed that it can be safely referenced. > > > IOW, what is the need, if any, to acquire ilock in the iflush/ifree > > paths before testing for XFS_ISTALE? Is there some specific intermediate > > state I'm missing or is this just unintentional? > > It's entirely intentional - validate and claim the object we've > found in the lockless lookup, then run the code that checks/changes > the object state. Smashing state checks and lockless lookup > validation together is a nasty landmine to leave behind... > Ok, so this is intentional, but the purpose is simplification vs. technically being part of the lookup dance. I'm not sure I see the advantage given that IMO this trades off one landmine for another, but I'm not worried that much about it as long as the code is correct. I guess we'll see how things change after reevaluation of the whole holding ilock across contexts behavior, but if we do end up with a similar pattern in the iflush/ifree paths please document that explicitly in the comments. Otherwise in a patch that swizzles this code around and explicitly plays games with ilock, the intent of this particular change is not clear to somebody reading the code IMO. In fact, I think it might be interesting to see if we could define a couple helpers (located closer to the reclaim code) to perform an unreferenced lookup/release of an inode, but that is secondary to nailing down the fundamental rules. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h index 79155eec341b..1752a2592bcc 100644 --- a/fs/xfs/mrlock.h +++ b/fs/xfs/mrlock.h @@ -75,4 +75,31 @@ static inline void mrdemote(mrlock_t *mrp) downgrade_write(&mrp->mr_lock); } +/* special locking cases for inode reclaim */ +static inline void mrupdate_non_owner(mrlock_t *mrp) +{ + down_write_non_owner(&mrp->mr_lock); +#if defined(DEBUG) || defined(XFS_WARN) + mrp->mr_writer = 1; +#endif +} + +static inline int mrtryupdate_non_owner(mrlock_t *mrp) +{ + if (!down_write_trylock_non_owner(&mrp->mr_lock)) + return 0; +#if defined(DEBUG) || defined(XFS_WARN) + mrp->mr_writer = 1; +#endif + return 1; +} + +static inline void mrunlock_excl_non_owner(mrlock_t *mrp) +{ +#if defined(DEBUG) || defined(XFS_WARN) + mrp->mr_writer = 0; +#endif + up_write_non_owner(&mrp->mr_lock); +} + #endif /* __XFS_SUPPORT_MRLOCK_H__ */ diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 11bf4768d491..45ee3b5cd873 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -106,6 +106,7 @@ xfs_inode_free_callback( ip->i_itemp = NULL; } + mrunlock_excl_non_owner(&ip->i_lock); kmem_zone_free(xfs_inode_zone, ip); } @@ -132,6 +133,7 @@ xfs_inode_free( * free state. The ip->i_flags_lock provides the barrier against lookup * races. */ + mrupdate_non_owner(&ip->i_lock); spin_lock(&ip->i_flags_lock); ip->i_flags = XFS_IRECLAIM; ip->i_ino = 0; @@ -295,11 +297,24 @@ xfs_iget_cache_hit( } /* - * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode - * from stomping over us while we recycle the inode. Remove it - * from the LRU straight away so we can re-init the VFS inode. + * Before we reinitialise the inode, we need to make sure + * reclaim does not pull it out from underneath us. We already + * hold the i_flags_lock, and because the XFS_IRECLAIM is not + * set we know the inode is still on the LRU. However, the LRU + * code may have just selected this inode to reclaim, so we need + * to ensure we hold the i_flags_lock long enough for the + * trylock in xfs_inode_reclaim_isolate() to fail. We do this by + * removing the inode from the LRU, which will spin on the LRU + * list locks until reclaim stops walking, at which point we + * know there is no possible race between reclaim isolation and + * this lookup. + * + * We also set the XFS_IRECLAIM flag here while trying to do the + * re-initialisation to prevent multiple racing lookups on this + * inode from all landing here at the same time. */ ip->i_flags |= XFS_IRECLAIM; + list_lru_del(&mp->m_inode_lru, &inode->i_lru); spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); @@ -314,6 +329,7 @@ xfs_iget_cache_hit( spin_lock(&ip->i_flags_lock); wake = !!__xfs_iflags_test(ip, XFS_INEW); ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM); + list_lru_add(&mp->m_inode_lru, &inode->i_lru); if (wake) wake_up_bit(&ip->i_flags, __XFS_INEW_BIT); ASSERT(ip->i_flags & XFS_IRECLAIMABLE); @@ -330,7 +346,6 @@ xfs_iget_cache_hit( spin_lock(&ip->i_flags_lock); ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS; ip->i_flags |= XFS_INEW; - list_lru_del(&mp->m_inode_lru, &inode->i_lru); inode->i_state = I_NEW; ip->i_sick = 0; ip->i_checked = 0; @@ -610,8 +625,7 @@ xfs_icache_inode_is_allocated( /* * The inode lookup is done in batches to keep the amount of lock traffic and * radix tree lookups to a minimum. The batch size is a trade off between - * lookup reduction and stack usage. This is in the reclaim path, so we can't - * be too greedy. + * lookup reduction and stack usage. */ #define XFS_LOOKUP_BATCH 32 @@ -916,8 +930,13 @@ xfs_inode_reclaim_isolate( goto out_unlock_flags; } + /* + * If we are going to reclaim this inode, it will be unlocked by the + * RCU callback and so is in a different context. Hence we need to use + * special non-owner trylock semantics for XFS_ILOCK_EXCL here. + */ ret = LRU_SKIP; - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) + if (!mrtryupdate_non_owner(&ip->i_lock)) goto out_unlock_flags; if (!__xfs_iflock_nowait(ip)) { @@ -960,7 +979,7 @@ xfs_inode_reclaim_isolate( out_ifunlock: __xfs_ifunlock(ip); out_unlock_inode: - xfs_iunlock(ip, XFS_ILOCK_EXCL); + mrunlock_excl_non_owner(&ip->i_lock); out_unlock_flags: spin_unlock(&ip->i_flags_lock); @@ -969,6 +988,41 @@ xfs_inode_reclaim_isolate( return ret; } +/* + * We are passed a locked inode to dispose of. + * + * To avoid race conditions with lookups that don't take references, we do + * not drop the XFS_ILOCK_EXCL until the RCU callback that frees the inode. + * This means that any attempt to lock the inode during the current RCU grace + * period will fail, and hence we do not need any synchonisation here to wait + * for code that pins unreferenced inodes with the XFS_ILOCK to drain. + * + * This requires code that requires such pins to do the following under a single + * rcu_read_lock() context: + * + * - rcu_read_lock + * - find the inode via radix tree lookup + * - take the ip->i_flags_lock + * - check ip->i_ino != 0 + * - check XFS_IRECLAIM is not set + * - call xfs_ilock_nowait(ip, XFS_ILOCK_[SHARED|EXCL]) to lock the inode + * - drop ip->i_flags_lock + * - rcu_read_unlock() + * + * Only if all this succeeds and the caller has the inode locked and protected + * against it being freed until the ilock is released. If the XFS_IRECLAIM flag + * is set or xfs_ilock_nowait() fails, then the caller must either skip the + * inode and move on to the next inode (gang lookup) or drop the rcu_read_lock + * and start the entire inode lookup process again (individual lookup). + * + * This works because i_flags_lock serialises against + * xfs_inode_reclaim_isolate() - if the lookup wins the race on i_flags_lock and + * XFS_IRECLAIM is not set, then it will be able to lock the inode and hold off + * reclaim. If the isolate function wins the race, it will lock the inode and + * set the XFS_IRECLAIM flag if it is going to free the inode and this will + * prevent the lookup callers from succeeding in getting unreferenced pin via + * the ILOCK. + */ static void xfs_dispose_inode( struct xfs_inode *ip) @@ -977,11 +1031,14 @@ xfs_dispose_inode( struct xfs_perag *pag; xfs_ino_t ino; + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); ASSERT(xfs_isiflocked(ip)); ASSERT(xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE) || XFS_FORCED_SHUTDOWN(mp)); ASSERT(ip->i_ino != 0); + XFS_STATS_INC(mp, xs_ig_reclaims); + /* * Process the shutdown reclaim work we deferred from the LRU isolation * callback before we go any further. @@ -1008,9 +1065,6 @@ xfs_dispose_inode( ip->i_flags = XFS_IRECLAIM; ip->i_ino = 0; spin_unlock(&ip->i_flags_lock); - xfs_iunlock(ip, XFS_ILOCK_EXCL); - - XFS_STATS_INC(mp, xs_ig_reclaims); /* * Remove the inode from the per-AG radix tree. @@ -1022,19 +1076,7 @@ xfs_dispose_inode( spin_unlock(&pag->pag_ici_lock); xfs_perag_put(pag); - /* - * Here we do an (almost) spurious inode lock in order to coordinate - * with inode cache radix tree lookups. This is because the lookup - * can reference the inodes in the cache without taking references. - * - * We make that OK here by ensuring that we wait until the inode is - * unlocked after the lookup before we go ahead and free it. - * - * XXX: need to check this is still true. Not sure it is. - */ - xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_qm_dqdetach(ip); - xfs_iunlock(ip, XFS_ILOCK_EXCL); __xfs_inode_free(ip); } diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 33edb18098ca..5c0be82195fc 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2538,60 +2538,63 @@ xfs_ifree_get_one_inode( if (!ip) goto out_rcu_unlock; + + spin_lock(&ip->i_flags_lock); + if (!ip->i_ino || ip->i_ino != inum || + __xfs_iflags_test(ip, XFS_IRECLAIM)) + goto out_iflags_unlock; + /* - * because this is an RCU protected lookup, we could find a recently - * freed or even reallocated inode during the lookup. We need to check - * under the i_flags_lock for a valid inode here. Skip it if it is not - * valid, the wrong inode or stale. + * We've got the right inode and it isn't in reclaim but it might be + * locked by someone else. In that case, we retry the inode rather than + * skipping it completely as we have to process it with the cluster + * being freed. */ - spin_lock(&ip->i_flags_lock); - if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) { + if (ip != free_ip && !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { spin_unlock(&ip->i_flags_lock); - goto out_rcu_unlock; + rcu_read_unlock(); + delay(1); + goto retry; } - spin_unlock(&ip->i_flags_lock); /* - * Don't try to lock/unlock the current inode, but we _cannot_ skip the - * other inodes that we did not find in the list attached to the buffer - * and are not already marked stale. If we can't lock it, back off and - * retry. + * Inode is now pinned against reclaim until we unlock it. If the inode + * is already marked stale, then it has already been flush locked and + * attached to the buffer so we don't need to do anything more here. */ - if (ip != free_ip) { - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { - rcu_read_unlock(); - delay(1); - goto retry; - } - - /* - * Check the inode number again in case we're racing with - * freeing in xfs_reclaim_inode(). See the comments in that - * function for more information as to why the initial check is - * not sufficient. - */ - if (ip->i_ino != inum) { + if (__xfs_iflags_test(ip, XFS_ISTALE)) { + if (ip != free_ip) xfs_iunlock(ip, XFS_ILOCK_EXCL); - goto out_rcu_unlock; - } + goto out_iflags_unlock; } + __xfs_iflags_set(ip, XFS_ISTALE); + spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); + /* + * The flush lock will now hold off inode reclaim until the buffer + * completion routine runs the xfs_istale_done callback and unlocks the + * flush lock. Hence the caller can safely drop the ILOCK when it is + * done attaching the inode to the cluster buffer. + */ xfs_iflock(ip); - xfs_iflags_set(ip, XFS_ISTALE); /* - * We don't need to attach clean inodes or those only with unlogged - * changes (which we throw away, anyway). + * We don't need to attach clean inodes to the buffer - they are marked + * stale in memory now and will need to be re-initialised by inode + * allocation before they can be reused. */ if (!ip->i_itemp || xfs_inode_clean(ip)) { ASSERT(ip != free_ip); xfs_ifunlock(ip); - xfs_iunlock(ip, XFS_ILOCK_EXCL); + if (ip != free_ip) + xfs_iunlock(ip, XFS_ILOCK_EXCL); goto out_no_inode; } return ip; +out_iflags_unlock: + spin_unlock(&ip->i_flags_lock); out_rcu_unlock: rcu_read_unlock(); out_no_inode: @@ -3519,44 +3522,40 @@ xfs_iflush_cluster( continue; /* - * because this is an RCU protected lookup, we could find a - * recently freed or even reallocated inode during the lookup. - * We need to check under the i_flags_lock for a valid inode - * here. Skip it if it is not valid or the wrong inode. + * See xfs_dispose_inode() for an explanation of the + * tests here to avoid inode reclaim races. */ spin_lock(&cip->i_flags_lock); if (!cip->i_ino || - __xfs_iflags_test(cip, XFS_ISTALE)) { + __xfs_iflags_test(cip, XFS_IRECLAIM)) { spin_unlock(&cip->i_flags_lock); continue; } - /* - * Once we fall off the end of the cluster, no point checking - * any more inodes in the list because they will also all be - * outside the cluster. - */ + /* ILOCK will pin the inode against reclaim */ + if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) { + spin_unlock(&cip->i_flags_lock); + continue; + } + + if (__xfs_iflags_test(cip, XFS_ISTALE)) { + xfs_iunlock(cip, XFS_ILOCK_SHARED); + spin_unlock(&cip->i_flags_lock); + continue; + } + + /* Lookup can find inodes outside the cluster being flushed. */ if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) { + xfs_iunlock(cip, XFS_ILOCK_SHARED); spin_unlock(&cip->i_flags_lock); break; } spin_unlock(&cip->i_flags_lock); /* - * Do an un-protected check to see if the inode is dirty and - * is a candidate for flushing. These checks will be repeated - * later after the appropriate locks are acquired. - */ - if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0) - continue; - - /* - * Try to get locks. If any are unavailable or it is pinned, + * If we can't get the flush lock now or the inode is pinned, * then this inode cannot be flushed and is skipped. */ - - if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED)) - continue; if (!xfs_iflock_nowait(cip)) { xfs_iunlock(cip, XFS_ILOCK_SHARED); continue; @@ -3567,22 +3566,9 @@ xfs_iflush_cluster( continue; } - /* - * Check the inode number again, just to be certain we are not - * racing with freeing in xfs_reclaim_inode(). See the comments - * in that function for more information as to why the initial - * check is not sufficient. - */ - if (!cip->i_ino) { - xfs_ifunlock(cip); - xfs_iunlock(cip, XFS_ILOCK_SHARED); - continue; - } - - /* - * arriving here means that this inode can be flushed. First - * re-check that it's dirty before flushing. + * Arriving here means that this inode can be flushed. First + * check that it's dirty before flushing. */ if (!xfs_inode_clean(cip)) { int error; @@ -3596,6 +3582,7 @@ xfs_iflush_cluster( xfs_ifunlock(cip); } xfs_iunlock(cip, XFS_ILOCK_SHARED); + /* unsafe to reference cip from here */ } if (clcount) { @@ -3634,7 +3621,11 @@ xfs_iflush_cluster( xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); - /* abort the corrupt inode, as it was not attached to the buffer */ + /* + * Abort the corrupt inode, as it was not attached to the buffer. It is + * unlocked, but still pinned against reclaim by the flush lock so it is + * safe to reference here until after the flush abort completes. + */ xfs_iflush_abort(cip, false); kmem_free(cilist); xfs_perag_put(pag);