diff mbox series

xfs: fix incorrect i_nlink caused by inode racing

Message ID 20221107143648.GA2013250@ceph-admin (mailing list archive)
State Superseded, archived
Headers show
Series xfs: fix incorrect i_nlink caused by inode racing | expand

Commit Message

Long Li Nov. 7, 2022, 2:36 p.m. UTC
The following error occurred during the fsstress test:

XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2925

The problem was that inode race condition causes incorrect i_nlink to be
written to disk, and then it is read into memory. Consider the following
call graph, inodes that are marked as both XFS_IFLUSHING and
XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
may be set to 1.

  xfsaild
      xfs_inode_item_push
          xfs_iflush_cluster
              xfs_iflush
                  xfs_inode_to_disk

  xfs_iget
      xfs_iget_cache_hit
          xfs_iget_recycle
              xfs_reinit_inode
  	          inode_init_always

So skip inodes that being flushed and markded as XFS_IRECLAIMABLE, prevent
concurrent read and write to inodes.

Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/xfs_icache.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Darrick J. Wong Nov. 7, 2022, 4:38 p.m. UTC | #1
On Mon, Nov 07, 2022 at 10:36:48PM +0800, Long Li wrote:
> The following error occurred during the fsstress test:

> XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2925

What kernel is this?  xfs_inode.c line 2925 is in the middle of
xfs_rename and doesn't have any assertions on nlink.

The only assertion on nlink in the entire xfs codebase is in xfs_remove,
and that's not what's going on here.

<confused>

--D

> The problem was that inode race condition causes incorrect i_nlink to be
> written to disk, and then it is read into memory. Consider the following
> call graph, inodes that are marked as both XFS_IFLUSHING and
> XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
> value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
> may be set to 1.
> 
>   xfsaild
>       xfs_inode_item_push
>           xfs_iflush_cluster
>               xfs_iflush
>                   xfs_inode_to_disk
> 
>   xfs_iget
>       xfs_iget_cache_hit
>           xfs_iget_recycle
>               xfs_reinit_inode
>   	          inode_init_always
> 
> So skip inodes that being flushed and markded as XFS_IRECLAIMABLE, prevent
> concurrent read and write to inodes.
> 
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/xfs/xfs_icache.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index eae7427062cf..cc68b0ff50ce 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -514,6 +514,11 @@ xfs_iget_cache_hit(
>  	    (ip->i_flags & XFS_IRECLAIMABLE))
>  		goto out_skip;
>  
> +	/* Skip inodes that being flushed */
> +	if ((ip->i_flags & XFS_IFLUSHING) &&
> +	    (ip->i_flags & XFS_IRECLAIMABLE))
> +		goto out_skip;
> +
>  	/* The inode fits the selection criteria; process it. */
>  	if (ip->i_flags & XFS_IRECLAIMABLE) {
>  		/* Drops i_flags_lock and RCU read lock. */
> -- 
> 2.31.1
>
Long Li Nov. 10, 2022, 1:42 a.m. UTC | #2
On Mon, Nov 07, 2022 at 08:38:45AM -0800, Darrick J. Wong wrote:
> On Mon, Nov 07, 2022 at 10:36:48PM +0800, Long Li wrote:
> > The following error occurred during the fsstress test:
> 
> > XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2925
> 
> What kernel is this?  xfs_inode.c line 2925 is in the middle of
> xfs_rename and doesn't have any assertions on nlink.
> 
> The only assertion on nlink in the entire xfs codebase is in xfs_remove,
> and that's not what's going on here.
> 
> <confused>

Sorry for the confusion, I found this issue in Linux 5.10, so the assertion
on nlink is in xfs_remove(). I've reproduced it on the mainline kernel, the
probability of this problem is very low and it is very difficult to reproduce.
The mainline kernel assertion error prints is as follows:

XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452

Thanks,
Long Li

> 
> --D
> 
> > The problem was that inode race condition causes incorrect i_nlink to be
> > written to disk, and then it is read into memory. Consider the following
> > call graph, inodes that are marked as both XFS_IFLUSHING and
> > XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
> > value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
> > may be set to 1.
> > 
> >   xfsaild
> >       xfs_inode_item_push
> >           xfs_iflush_cluster
> >               xfs_iflush
> >                   xfs_inode_to_disk
> > 
> >   xfs_iget
> >       xfs_iget_cache_hit
> >           xfs_iget_recycle
> >               xfs_reinit_inode
> >   	          inode_init_always
> > 
> > So skip inodes that being flushed and markded as XFS_IRECLAIMABLE, prevent
> > concurrent read and write to inodes.
> > 
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/xfs/xfs_icache.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index eae7427062cf..cc68b0ff50ce 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -514,6 +514,11 @@ xfs_iget_cache_hit(
> >  	    (ip->i_flags & XFS_IRECLAIMABLE))
> >  		goto out_skip;
> >  
> > +	/* Skip inodes that being flushed */
> > +	if ((ip->i_flags & XFS_IFLUSHING) &&
> > +	    (ip->i_flags & XFS_IRECLAIMABLE))
> > +		goto out_skip;
> > +
> >  	/* The inode fits the selection criteria; process it. */
> >  	if (ip->i_flags & XFS_IRECLAIMABLE) {
> >  		/* Drops i_flags_lock and RCU read lock. */
> > -- 
> > 2.31.1
> >
Dave Chinner Nov. 11, 2022, 8:52 p.m. UTC | #3
On Mon, Nov 07, 2022 at 10:36:48PM +0800, Long Li wrote:
> The following error occurred during the fsstress test:
> 
> XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2925
> 
> The problem was that inode race condition causes incorrect i_nlink to be
> written to disk, and then it is read into memory. Consider the following
> call graph, inodes that are marked as both XFS_IFLUSHING and
> XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
> value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
> may be set to 1.
> 
>   xfsaild
>       xfs_inode_item_push
>           xfs_iflush_cluster
>               xfs_iflush
>                   xfs_inode_to_disk
> 
>   xfs_iget
>       xfs_iget_cache_hit
>           xfs_iget_recycle
>               xfs_reinit_inode
>   	          inode_init_always
> 
> So skip inodes that being flushed and markded as XFS_IRECLAIMABLE, prevent
> concurrent read and write to inodes.

urk.

xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing
internal inode state and can race with other RCU protected inode
lookups. Have a look at what xfs_iflush_cluster() does - it
grabs the ILOCK_SHARED while under rcu + ip->i_flags_lock, and so
xfs_iflush/xfs_inode_to_disk() are protected from racing inode
updates (during transactions) by that lock.

Hence it looks to me that I_FLUSHING isn't the problem here - it's
that we have a transient modified inode state in xfs_reinit_inode()
that is externally visisble...

Cheers,

Dave.
Long Li Nov. 14, 2022, 1:34 p.m. UTC | #4
On Sat, Nov 12, 2022 at 07:52:50AM +1100, Dave Chinner wrote:
> On Mon, Nov 07, 2022 at 10:36:48PM +0800, Long Li wrote:
> > The following error occurred during the fsstress test:
> > 
> > XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2925
> > 
> > The problem was that inode race condition causes incorrect i_nlink to be
> > written to disk, and then it is read into memory. Consider the following
> > call graph, inodes that are marked as both XFS_IFLUSHING and
> > XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
> > value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
> > may be set to 1.
> > 
> >   xfsaild
> >       xfs_inode_item_push
> >           xfs_iflush_cluster
> >               xfs_iflush
> >                   xfs_inode_to_disk
> > 
> >   xfs_iget
> >       xfs_iget_cache_hit
> >           xfs_iget_recycle
> >               xfs_reinit_inode
> >   	          inode_init_always
> > 
> > So skip inodes that being flushed and markded as XFS_IRECLAIMABLE, prevent
> > concurrent read and write to inodes.
> 
> urk.
> 
> xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing
> internal inode state and can race with other RCU protected inode
> lookups. Have a look at what xfs_iflush_cluster() does - it
> grabs the ILOCK_SHARED while under rcu + ip->i_flags_lock, and so
> xfs_iflush/xfs_inode_to_disk() are protected from racing inode
> updates (during transactions) by that lock.
> 
> Hence it looks to me that I_FLUSHING isn't the problem here - it's
> that we have a transient modified inode state in xfs_reinit_inode()
> that is externally visisble...

Before xfs_reinit_inode(), XFS_IRECLAIM will be set in ip->i_flags, this 
looks like can prevent race with other RCU protected inode lookups.  
Can it be considered that don't modifying the information about the on-disk
values in the VFS inode in xfs_reinit_inode()? if so lock can be avoided.

Thanks,
Long Li
Dave Chinner Nov. 15, 2022, 12:23 a.m. UTC | #5
On Mon, Nov 14, 2022 at 09:34:17PM +0800, Long Li wrote:
> On Sat, Nov 12, 2022 at 07:52:50AM +1100, Dave Chinner wrote:
> > On Mon, Nov 07, 2022 at 10:36:48PM +0800, Long Li wrote:
> > > The following error occurred during the fsstress test:
> > > 
> > > XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2925
> > > 
> > > The problem was that inode race condition causes incorrect i_nlink to be
> > > written to disk, and then it is read into memory. Consider the following
> > > call graph, inodes that are marked as both XFS_IFLUSHING and
> > > XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
> > > value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
> > > may be set to 1.
> > > 
> > >   xfsaild
> > >       xfs_inode_item_push
> > >           xfs_iflush_cluster
> > >               xfs_iflush
> > >                   xfs_inode_to_disk
> > > 
> > >   xfs_iget
> > >       xfs_iget_cache_hit
> > >           xfs_iget_recycle
> > >               xfs_reinit_inode
> > >   	          inode_init_always
> > > 
> > > So skip inodes that being flushed and markded as XFS_IRECLAIMABLE, prevent
> > > concurrent read and write to inodes.
> > 
> > urk.
> > 
> > xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing
> > internal inode state and can race with other RCU protected inode
> > lookups. Have a look at what xfs_iflush_cluster() does - it
> > grabs the ILOCK_SHARED while under rcu + ip->i_flags_lock, and so
> > xfs_iflush/xfs_inode_to_disk() are protected from racing inode
> > updates (during transactions) by that lock.
> > 
> > Hence it looks to me that I_FLUSHING isn't the problem here - it's
> > that we have a transient modified inode state in xfs_reinit_inode()
> > that is externally visisble...
> 
> Before xfs_reinit_inode(), XFS_IRECLAIM will be set in ip->i_flags, this 
> looks like can prevent race with other RCU protected inode lookups.  

That only protects against new lookups - it does not protect against the
IRECLAIM flag being set *after* the lookup in xfs_iflush_cluster()
whilst the inode is being flushed to the cluster buffer. That's why
xfs_iflush_cluster() does:

	rcu_read_lock()
	lookup inode
	spinlock(ip->i_flags_lock);
	check IRECLAIM|IFLUSHING
>>>>>>	xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)     <<<<<<<<
	set IFLUSHING
	spin_unlock(ip->i_flags_lock)
	rcu_read_unlock()

At this point, the only lock that is held is XFS_ILOCK_SHARED, and
it's the only lock that protects the inode state outside the lookup
scope against concurrent changes.

Essentially, xfs_reinit_inode() needs to add a:

	xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)

before it set IRECLAIM - if it fails to get the ILOCK_EXCL, then we
need to skip the inode, drop out of RCU scope, delay and retry the
lookup.

> Can it be considered that don't modifying the information about the on-disk
> values in the VFS inode in xfs_reinit_inode()? if so lock can be avoided.

We have to reinit the VFS inode because it has gone through
->destroy_inode and so the state has been trashed. We have to bring
it back as an I_NEW inode, which requires reinitialising everything.
THe issue is that we store inode state information (like nlink) in
the VFS inode instead of the XFS inode portion of the structure (to
minimise memory footprint), and that means xfs_reinit_inode() has a
transient state where the VFS inode is not correct. We can avoid
that simply by holding the XFS_ILOCK_EXCL, guaranteeing nothing in
XFS should be trying to read/modify the internal metadata state
while we are reinitialising the VFS inode portion of the
structure...

Cheers,

Dave.
Long Li Nov. 15, 2022, 2:33 p.m. UTC | #6
On Tue, Nov 15, 2022 at 11:23:13AM +1100, Dave Chinner wrote:
> On Mon, Nov 14, 2022 at 09:34:17PM +0800, Long Li wrote:
> > On Sat, Nov 12, 2022 at 07:52:50AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 07, 2022 at 10:36:48PM +0800, Long Li wrote:
> > > > The following error occurred during the fsstress test:
> > > > 
> > > > XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2925
> > > > 
> > > > The problem was that inode race condition causes incorrect i_nlink to be
> > > > written to disk, and then it is read into memory. Consider the following
> > > > call graph, inodes that are marked as both XFS_IFLUSHING and
> > > > XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
> > > > value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
> > > > may be set to 1.
> > > > 
> > > >   xfsaild
> > > >       xfs_inode_item_push
> > > >           xfs_iflush_cluster
> > > >               xfs_iflush
> > > >                   xfs_inode_to_disk
> > > > 
> > > >   xfs_iget
> > > >       xfs_iget_cache_hit
> > > >           xfs_iget_recycle
> > > >               xfs_reinit_inode
> > > >   	          inode_init_always
> > > > 
> > > > So skip inodes that being flushed and markded as XFS_IRECLAIMABLE, prevent
> > > > concurrent read and write to inodes.
> > > 
> > > urk.
> > > 
> > > xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing
> > > internal inode state and can race with other RCU protected inode
> > > lookups. Have a look at what xfs_iflush_cluster() does - it
> > > grabs the ILOCK_SHARED while under rcu + ip->i_flags_lock, and so
> > > xfs_iflush/xfs_inode_to_disk() are protected from racing inode
> > > updates (during transactions) by that lock.
> > > 
> > > Hence it looks to me that I_FLUSHING isn't the problem here - it's
> > > that we have a transient modified inode state in xfs_reinit_inode()
> > > that is externally visisble...
> > 
> > Before xfs_reinit_inode(), XFS_IRECLAIM will be set in ip->i_flags, this 
> > looks like can prevent race with other RCU protected inode lookups.  
> 
> That only protects against new lookups - it does not protect against the
> IRECLAIM flag being set *after* the lookup in xfs_iflush_cluster()
> whilst the inode is being flushed to the cluster buffer. That's why
> xfs_iflush_cluster() does:
> 
> 	rcu_read_lock()
> 	lookup inode
> 	spinlock(ip->i_flags_lock);
> 	check IRECLAIM|IFLUSHING
> >>>>>>	xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)     <<<<<<<<
> 	set IFLUSHING
> 	spin_unlock(ip->i_flags_lock)
> 	rcu_read_unlock()
> 
> At this point, the only lock that is held is XFS_ILOCK_SHARED, and
> it's the only lock that protects the inode state outside the lookup
> scope against concurrent changes.
> 
> Essentially, xfs_reinit_inode() needs to add a:
> 
> 	xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)
> 
> before it set IRECLAIM - if it fails to get the ILOCK_EXCL, then we
> need to skip the inode, drop out of RCU scope, delay and retry the
> lookup.
> 
> > Can it be considered that don't modifying the information about the on-disk
> > values in the VFS inode in xfs_reinit_inode()? if so lock can be avoided.
> 
> We have to reinit the VFS inode because it has gone through
> ->destroy_inode and so the state has been trashed. We have to bring
> it back as an I_NEW inode, which requires reinitialising everything.
> THe issue is that we store inode state information (like nlink) in
> the VFS inode instead of the XFS inode portion of the structure (to
> minimise memory footprint), and that means xfs_reinit_inode() has a
> transient state where the VFS inode is not correct. We can avoid
> that simply by holding the XFS_ILOCK_EXCL, guaranteeing nothing in
> XFS should be trying to read/modify the internal metadata state
> while we are reinitialising the VFS inode portion of the
> structure...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Thanks for the detailed and clear explanation, holding ILOCK_EXCL lock
in xfs_reinit_inode() can solve the problem simply, I will resend a 
patch. :)

Thanks,
Long Li
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index eae7427062cf..cc68b0ff50ce 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -514,6 +514,11 @@  xfs_iget_cache_hit(
 	    (ip->i_flags & XFS_IRECLAIMABLE))
 		goto out_skip;
 
+	/* Skip inodes that being flushed */
+	if ((ip->i_flags & XFS_IFLUSHING) &&
+	    (ip->i_flags & XFS_IRECLAIMABLE))
+		goto out_skip;
+
 	/* The inode fits the selection criteria; process it. */
 	if (ip->i_flags & XFS_IRECLAIMABLE) {
 		/* Drops i_flags_lock and RCU read lock. */