vfs: partially sanitize i_state zeroing on inode creation

Message ID	20240611041540.495840-1-mjguzik@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 73DA92570; Tue, 11 Jun 2024 04:15:56 +0000 (UTC) From: Mateusz Guzik <mjguzik@gmail.com> To: brauner@kernel.org Cc: viro@zeniv.linux.org.uk, jack@suse.cz, dave@fromorbit.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Mateusz Guzik <mjguzik@gmail.com> Subject: [PATCH] vfs: partially sanitize i_state zeroing on inode creation Date: Tue, 11 Jun 2024 06:15:40 +0200 Message-ID: <20240611041540.495840-1-mjguzik@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	vfs: partially sanitize i_state zeroing on inode creation \| expand vfs: partially sanitize i_state zeroing on inode creation

Mateusz Guzik June 11, 2024, 4:15 a.m. UTC

new_inode used to have the following:
	spin_lock(&inode_lock);
	inodes_stat.nr_inodes++;
	list_add(&inode->i_list, &inode_in_use);
	list_add(&inode->i_sb_list, &sb->s_inodes);
	inode->i_ino = ++last_ino;
	inode->i_state = 0;
	spin_unlock(&inode_lock);

over time things disappeared, got moved around or got replaced (global
inode lock with a per-inode lock), eventually this got reduced to:
	spin_lock(&inode->i_lock);
	inode->i_state = 0;
	spin_unlock(&inode->i_lock);

But the lock acquire here does not synchronize against anyone.

Additionally iget5_locked performs i_state = 0 assignment without any
locks to begin with and the two combined look confusing at best.

It looks like the current state is a leftover which was not cleaned up.

Ideally it would be an invariant that i_state == 0 to begin with, but
achieving that would require dealing with all filesystem alloc handlers
one by one.

In the meantime drop the misleading locking and move i_state zeroing to
alloc_inode so that others don't need to deal with it by hand.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

I diffed this against fs-next + my inode hash patch as it adds one
i_state = 0 case. Should that patch not be accepted this bit can be
easily dropped from this one.

I brought the entire thing up quite some time ago [1] and Dave Chinner
noted that perhaps the lock has a side effect of providing memory
barriers which otherwise would not be there and which are needed by
someone.

For new_inode and alloc_inode consumers all fences are already there
anyway due to immediate lock usage.

Arguably new_inode_pseudo escape without it but I don't find the code at
hand to be affected in any meanignful way -- the only 2 consumers
(get_pipe_inode and sock_alloc) perform numerous other stores to the
inode immediately after. By the time it gets added to anything looking
at i_state, flushing that should be handled by whatever thing which adds
it. Mentioning this just in case.

[1] https://lore.kernel.org/all/CAGudoHF_Y0shcU+AMRRdN5RQgs9L_HHvBH8D4K=7_0X72kYy2g@mail.gmail.com/

 fs/inode.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

Jan Kara June 11, 2024, 10:02 a.m. UTC | #1

On Tue 11-06-24 06:15:40, Mateusz Guzik wrote:
> new_inode used to have the following:
> 	spin_lock(&inode_lock);
> 	inodes_stat.nr_inodes++;
> 	list_add(&inode->i_list, &inode_in_use);
> 	list_add(&inode->i_sb_list, &sb->s_inodes);
> 	inode->i_ino = ++last_ino;
> 	inode->i_state = 0;
> 	spin_unlock(&inode_lock);
> 
> over time things disappeared, got moved around or got replaced (global
> inode lock with a per-inode lock), eventually this got reduced to:
> 	spin_lock(&inode->i_lock);
> 	inode->i_state = 0;
> 	spin_unlock(&inode->i_lock);
> 
> But the lock acquire here does not synchronize against anyone.
> 
> Additionally iget5_locked performs i_state = 0 assignment without any
> locks to begin with and the two combined look confusing at best.
> 
> It looks like the current state is a leftover which was not cleaned up.
> 
> Ideally it would be an invariant that i_state == 0 to begin with, but
> achieving that would require dealing with all filesystem alloc handlers
> one by one.
> 
> In the meantime drop the misleading locking and move i_state zeroing to
> alloc_inode so that others don't need to deal with it by hand.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

Good point. But the initialization would seem more natural in
inode_init_always(), wouldn't it? And that will also address your "FIXME"
comment.

								Honza

> ---
> 
> I diffed this against fs-next + my inode hash patch as it adds one
> i_state = 0 case. Should that patch not be accepted this bit can be
> easily dropped from this one.
> 
> I brought the entire thing up quite some time ago [1] and Dave Chinner
> noted that perhaps the lock has a side effect of providing memory
> barriers which otherwise would not be there and which are needed by
> someone.
> 
> For new_inode and alloc_inode consumers all fences are already there
> anyway due to immediate lock usage.
> 
> Arguably new_inode_pseudo escape without it but I don't find the code at
> hand to be affected in any meanignful way -- the only 2 consumers
> (get_pipe_inode and sock_alloc) perform numerous other stores to the
> inode immediately after. By the time it gets added to anything looking
> at i_state, flushing that should be handled by whatever thing which adds
> it. Mentioning this just in case.
> 
> [1] https://lore.kernel.org/all/CAGudoHF_Y0shcU+AMRRdN5RQgs9L_HHvBH8D4K=7_0X72kYy2g@mail.gmail.com/
> 
>  fs/inode.c | 15 +++++----------
>  1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 149adf8ab0ea..3967e68311a6 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -276,6 +276,10 @@ static struct inode *alloc_inode(struct super_block *sb)
>  		return NULL;
>  	}
>  
> +	/*
> +	 * FIXME: the code should be able to assert i_state == 0 instead.
> +	 */
> +	inode->i_state = 0;
>  	return inode;
>  }
>  
> @@ -1023,14 +1027,7 @@ EXPORT_SYMBOL(get_next_ino);
>   */
>  struct inode *new_inode_pseudo(struct super_block *sb)
>  {
> -	struct inode *inode = alloc_inode(sb);
> -
> -	if (inode) {
> -		spin_lock(&inode->i_lock);
> -		inode->i_state = 0;
> -		spin_unlock(&inode->i_lock);
> -	}
> -	return inode;
> +	return alloc_inode(sb);
>  }
>  
>  /**
> @@ -1254,7 +1251,6 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
>  		struct inode *new = alloc_inode(sb);
>  
>  		if (new) {
> -			new->i_state = 0;
>  			inode = inode_insert5(new, hashval, test, set, data);
>  			if (unlikely(inode != new))
>  				destroy_inode(new);
> @@ -1297,7 +1293,6 @@ struct inode *iget5_locked_rcu(struct super_block *sb, unsigned long hashval,
>  
>  	new = alloc_inode(sb);
>  	if (new) {
> -		new->i_state = 0;
>  		inode = inode_insert5(new, hashval, test, set, data);
>  		if (unlikely(inode != new))
>  			destroy_inode(new);
> -- 
> 2.43.0
>

Mateusz Guzik June 11, 2024, 10:23 a.m. UTC | #2

On Tue, Jun 11, 2024 at 12:02:22PM +0200, Jan Kara wrote:
> On Tue 11-06-24 06:15:40, Mateusz Guzik wrote:
> > new_inode used to have the following:
> > 	spin_lock(&inode_lock);
> > 	inodes_stat.nr_inodes++;
> > 	list_add(&inode->i_list, &inode_in_use);
> > 	list_add(&inode->i_sb_list, &sb->s_inodes);
> > 	inode->i_ino = ++last_ino;
> > 	inode->i_state = 0;
> > 	spin_unlock(&inode_lock);
> > 
> > over time things disappeared, got moved around or got replaced (global
> > inode lock with a per-inode lock), eventually this got reduced to:
> > 	spin_lock(&inode->i_lock);
> > 	inode->i_state = 0;
> > 	spin_unlock(&inode->i_lock);
> > 
> > But the lock acquire here does not synchronize against anyone.
> > 
> > Additionally iget5_locked performs i_state = 0 assignment without any
> > locks to begin with and the two combined look confusing at best.
> > 
> > It looks like the current state is a leftover which was not cleaned up.
> > 
> > Ideally it would be an invariant that i_state == 0 to begin with, but
> > achieving that would require dealing with all filesystem alloc handlers
> > one by one.
> > 
> > In the meantime drop the misleading locking and move i_state zeroing to
> > alloc_inode so that others don't need to deal with it by hand.
> > 
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> 
> Good point. But the initialization would seem more natural in
> inode_init_always(), wouldn't it? And that will also address your "FIXME"
> comment.
> 

My point is that by the time the inode is destroyed some of the fields
like i_state should be set to a well-known value, this one preferably
plain 0.

I did not patch inode_init_always because it is exported and xfs uses it
in 2 spots, only one of which zeroing the thing immediately after.
Another one is a little more involved, it probably would not be a
problem as the value is set altered later anyway, but I don't want to
mess with semantics of the func if it can be easily avoided.

Dave Chinner June 11, 2024, 10:56 a.m. UTC | #3

On Tue, Jun 11, 2024 at 12:23:59PM +0200, Mateusz Guzik wrote:
> On Tue, Jun 11, 2024 at 12:02:22PM +0200, Jan Kara wrote:
> > On Tue 11-06-24 06:15:40, Mateusz Guzik wrote:
> > > new_inode used to have the following:
> > > 	spin_lock(&inode_lock);
> > > 	inodes_stat.nr_inodes++;
> > > 	list_add(&inode->i_list, &inode_in_use);
> > > 	list_add(&inode->i_sb_list, &sb->s_inodes);
> > > 	inode->i_ino = ++last_ino;
> > > 	inode->i_state = 0;
> > > 	spin_unlock(&inode_lock);
> > > 
> > > over time things disappeared, got moved around or got replaced (global
> > > inode lock with a per-inode lock), eventually this got reduced to:
> > > 	spin_lock(&inode->i_lock);
> > > 	inode->i_state = 0;
> > > 	spin_unlock(&inode->i_lock);
> > > 
> > > But the lock acquire here does not synchronize against anyone.
> > > 
> > > Additionally iget5_locked performs i_state = 0 assignment without any
> > > locks to begin with and the two combined look confusing at best.
> > > 
> > > It looks like the current state is a leftover which was not cleaned up.
> > > 
> > > Ideally it would be an invariant that i_state == 0 to begin with, but
> > > achieving that would require dealing with all filesystem alloc handlers
> > > one by one.
> > > 
> > > In the meantime drop the misleading locking and move i_state zeroing to
> > > alloc_inode so that others don't need to deal with it by hand.
> > > 
> > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > 
> > Good point. But the initialization would seem more natural in
> > inode_init_always(), wouldn't it? And that will also address your "FIXME"
> > comment.
> > 
> 
> My point is that by the time the inode is destroyed some of the fields
> like i_state should be set to a well-known value, this one preferably
> plain 0.
>
> I did not patch inode_init_always because it is exported and xfs uses it
> in 2 spots, only one of which zeroing the thing immediately after.
> Another one is a little more involved, it probably would not be a
> problem as the value is set altered later anyway, but I don't want to
> mess with semantics of the func if it can be easily avoided.

Better to move the zeroing to inode_init_always(), do the basic
save/restore mod to xfs_reinit_inode(), and let us XFS people worry
about whether inode_init_always() is the right thing to be calling
in their code...

All you'd need to do in xfs_reinit_inode() is this

+	unsigned long	state = inode->i_state;

	.....
	error = inode_init_always(mp->m_super, inode);
	.....
+	inode->i_state = state;
	.....

And it should behave as expected.

-Dave.

Jan Kara June 11, 2024, 11:05 a.m. UTC | #4

On Tue 11-06-24 12:23:59, Mateusz Guzik wrote:
> On Tue, Jun 11, 2024 at 12:02:22PM +0200, Jan Kara wrote:
> > On Tue 11-06-24 06:15:40, Mateusz Guzik wrote:
> > > new_inode used to have the following:
> > > 	spin_lock(&inode_lock);
> > > 	inodes_stat.nr_inodes++;
> > > 	list_add(&inode->i_list, &inode_in_use);
> > > 	list_add(&inode->i_sb_list, &sb->s_inodes);
> > > 	inode->i_ino = ++last_ino;
> > > 	inode->i_state = 0;
> > > 	spin_unlock(&inode_lock);
> > > 
> > > over time things disappeared, got moved around or got replaced (global
> > > inode lock with a per-inode lock), eventually this got reduced to:
> > > 	spin_lock(&inode->i_lock);
> > > 	inode->i_state = 0;
> > > 	spin_unlock(&inode->i_lock);
> > > 
> > > But the lock acquire here does not synchronize against anyone.
> > > 
> > > Additionally iget5_locked performs i_state = 0 assignment without any
> > > locks to begin with and the two combined look confusing at best.
> > > 
> > > It looks like the current state is a leftover which was not cleaned up.
> > > 
> > > Ideally it would be an invariant that i_state == 0 to begin with, but
> > > achieving that would require dealing with all filesystem alloc handlers
> > > one by one.
> > > 
> > > In the meantime drop the misleading locking and move i_state zeroing to
> > > alloc_inode so that others don't need to deal with it by hand.
> > > 
> > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > 
> > Good point. But the initialization would seem more natural in
> > inode_init_always(), wouldn't it? And that will also address your "FIXME"
> > comment.
> > 
> 
> My point is that by the time the inode is destroyed some of the fields
> like i_state should be set to a well-known value, this one preferably
> plain 0.

Well, i_state is set to a more or less well defined value but it is not
zero. I don't see a performance difference in whether set it to 0 on
freeing or on allocation and on allocation it is actually much easier to
find when reading the code.

> I did not patch inode_init_always because it is exported and xfs uses it
> in 2 spots, only one of which zeroing the thing immediately after.
> Another one is a little more involved, it probably would not be a
> problem as the value is set altered later anyway, but I don't want to
> mess with semantics of the func if it can be easily avoided.

Well, I'd consider that as another good reason to actually clean this up.
Look, inode_init_always() is used in bcachefs and xfs. bcachefs sets
i_state to 0 just before calling inode_init_always(), xfs just after one
call of inode_init_always() and the call in xfs_reinit_inode() is used
only from xfs_iget_recycle() which sets i_state to I_NEW. So I claim that
moving i_state clearing to inode_init_always() will not cause any issue and
is actually desirable.

								Honza

Mateusz Guzik June 11, 2024, 11:26 a.m. UTC | #5

On Tue, Jun 11, 2024 at 08:56:40PM +1000, Dave Chinner wrote:
> On Tue, Jun 11, 2024 at 12:23:59PM +0200, Mateusz Guzik wrote:
> > I did not patch inode_init_always because it is exported and xfs uses it
> > in 2 spots, only one of which zeroing the thing immediately after.
> > Another one is a little more involved, it probably would not be a
> > problem as the value is set altered later anyway, but I don't want to
> > mess with semantics of the func if it can be easily avoided.
> 
> Better to move the zeroing to inode_init_always(), do the basic
> save/restore mod to xfs_reinit_inode(), and let us XFS people worry
> about whether inode_init_always() is the right thing to be calling
> in their code...
> 
> All you'd need to do in xfs_reinit_inode() is this
> 
> +	unsigned long	state = inode->i_state;
> 
> 	.....
> 	error = inode_init_always(mp->m_super, inode);
> 	.....
> +	inode->i_state = state;
> 	.....
> 
> And it should behave as expected.
> 

Ok, so what would be the logistics of submitting this?

Can I submit one patch which includes the above + i_state moved to
inode_init_always?

Do I instead ship a two-part patchset, starting with the xfs change and
stating it was your idea?

Something else?

Fwiw inode_init_always consumer rundown is:
- fs/inode.c which is automagically covered
- bcachefs pre-zeroing state before even calling inode_init_always
- xfs with one spot which zeroes immediately after the call
- xfs with one spot which possibly avoids zeroing

Mateusz Guzik June 11, 2024, 11:34 a.m. UTC | #6

On Tue, Jun 11, 2024 at 01:05:05PM +0200, Jan Kara wrote:
> On Tue 11-06-24 12:23:59, Mateusz Guzik wrote:
> > On Tue, Jun 11, 2024 at 12:02:22PM +0200, Jan Kara wrote:
> > > On Tue 11-06-24 06:15:40, Mateusz Guzik wrote:
> > > > new_inode used to have the following:
> > > > 	spin_lock(&inode_lock);
> > > > 	inodes_stat.nr_inodes++;
> > > > 	list_add(&inode->i_list, &inode_in_use);
> > > > 	list_add(&inode->i_sb_list, &sb->s_inodes);
> > > > 	inode->i_ino = ++last_ino;
> > > > 	inode->i_state = 0;
> > > > 	spin_unlock(&inode_lock);
> > > > 
> > > > over time things disappeared, got moved around or got replaced (global
> > > > inode lock with a per-inode lock), eventually this got reduced to:
> > > > 	spin_lock(&inode->i_lock);
> > > > 	inode->i_state = 0;
> > > > 	spin_unlock(&inode->i_lock);
> > > > 
> > > > But the lock acquire here does not synchronize against anyone.
> > > > 
> > > > Additionally iget5_locked performs i_state = 0 assignment without any
> > > > locks to begin with and the two combined look confusing at best.
> > > > 
> > > > It looks like the current state is a leftover which was not cleaned up.
> > > > 
> > > > Ideally it would be an invariant that i_state == 0 to begin with, but
> > > > achieving that would require dealing with all filesystem alloc handlers
> > > > one by one.
> > > > 
> > > > In the meantime drop the misleading locking and move i_state zeroing to
> > > > alloc_inode so that others don't need to deal with it by hand.
> > > > 
> > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > > 
> > > Good point. But the initialization would seem more natural in
> > > inode_init_always(), wouldn't it? And that will also address your "FIXME"
> > > comment.
> > > 
> > 
> > My point is that by the time the inode is destroyed some of the fields
> > like i_state should be set to a well-known value, this one preferably
> > plain 0.
> 
> Well, i_state is set to a more or less well defined value but it is not
> zero. I don't see a performance difference in whether set it to 0 on
> freeing or on allocation and on allocation it is actually much easier to
> find when reading the code.
> 

I was thinking more about assertion potential, not anything
perf-related, but it is a moot subject now.

> > I did not patch inode_init_always because it is exported and xfs uses it
> > in 2 spots, only one of which zeroing the thing immediately after.
> > Another one is a little more involved, it probably would not be a
> > problem as the value is set altered later anyway, but I don't want to
> > mess with semantics of the func if it can be easily avoided.
> 
> Well, I'd consider that as another good reason to actually clean this up.
> Look, inode_init_always() is used in bcachefs and xfs. bcachefs sets
> i_state to 0 just before calling inode_init_always(), xfs just after one
> call of inode_init_always() and the call in xfs_reinit_inode() is used
> only from xfs_iget_recycle() which sets i_state to I_NEW. So I claim that
> moving i_state clearing to inode_init_always() will not cause any issue and
> is actually desirable.
> 

Ok, see my reply to Dave's e-mail.

Just tell me how to ship this and I'll do the needful(tm).

Jan Kara June 11, 2024, 11:40 a.m. UTC | #7

On Tue 11-06-24 13:26:45, Mateusz Guzik wrote:
> On Tue, Jun 11, 2024 at 08:56:40PM +1000, Dave Chinner wrote:
> > On Tue, Jun 11, 2024 at 12:23:59PM +0200, Mateusz Guzik wrote:
> > > I did not patch inode_init_always because it is exported and xfs uses it
> > > in 2 spots, only one of which zeroing the thing immediately after.
> > > Another one is a little more involved, it probably would not be a
> > > problem as the value is set altered later anyway, but I don't want to
> > > mess with semantics of the func if it can be easily avoided.
> > 
> > Better to move the zeroing to inode_init_always(), do the basic
> > save/restore mod to xfs_reinit_inode(), and let us XFS people worry
> > about whether inode_init_always() is the right thing to be calling
> > in their code...
> > 
> > All you'd need to do in xfs_reinit_inode() is this
> > 
> > +	unsigned long	state = inode->i_state;
> > 
> > 	.....
> > 	error = inode_init_always(mp->m_super, inode);
> > 	.....
> > +	inode->i_state = state;
> > 	.....
> > 
> > And it should behave as expected.
> > 
> 
> Ok, so what would be the logistics of submitting this?
> 
> Can I submit one patch which includes the above + i_state moved to
> inode_init_always?
> 
> Do I instead ship a two-part patchset, starting with the xfs change and
> stating it was your idea?
> 
> Something else?

Well, I'd do it as 4 patches actually:

1) xfs i_state workaround in xfs_reinit_inode()
2) add i_state zeroing to inode_init_always(), drop pointless zeroing from
VFS.
3) drop now pointless zeroing from xfs
4) drop now pointless zeroing from bcachefs

This way also respective maintainers can easily ack the bits they care about.
I can live with two as you suggest since the changes are tiny but four is
IMHO a "proper" way to do things ;).

								Honza

vfs: partially sanitize i_state zeroing on inode creation

Commit Message

Comments

Patch