diff mbox series

fs: prevent data-race due to missing inode_lock when calling vfs_getattr

Message ID 20241117163719.39750-1-aha310510@gmail.com (mailing list archive)
State New
Headers show
Series fs: prevent data-race due to missing inode_lock when calling vfs_getattr | expand

Commit Message

Jeongjun Park Nov. 17, 2024, 4:37 p.m. UTC
Many filesystems lock inodes before calling vfs_getattr, so there is no
data-race for inodes. However, some functions in fs/stat.c that call
vfs_getattr do not lock inodes, so the data-race occurs.

Therefore, we need to apply a patch to remove the long-standing data-race
for inodes in some functions that do not lock inodes.

Cc: <stable@vger.kernel.org>
Fixes: da9aa5d96bfe ("fs: remove vfs_statx_fd")
Fixes: 0ef625bba6fb ("vfs: support statx(..., NULL, AT_EMPTY_PATH, ...)")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
---
 fs/stat.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

--

Comments

Al Viro Nov. 17, 2024, 4:55 p.m. UTC | #1
On Mon, Nov 18, 2024 at 01:37:19AM +0900, Jeongjun Park wrote:
> Many filesystems lock inodes before calling vfs_getattr, so there is no
> data-race for inodes. However, some functions in fs/stat.c that call
> vfs_getattr do not lock inodes, so the data-race occurs.
> 
> Therefore, we need to apply a patch to remove the long-standing data-race
> for inodes in some functions that do not lock inodes.

Why do we care?  Slapping even a shared lock on a _very_ hot path, with
possible considerable latency, would need more than "theoretically it's
a data race".
Dave Chinner Nov. 17, 2024, 10:52 p.m. UTC | #2
On Mon, Nov 18, 2024 at 01:37:19AM +0900, Jeongjun Park wrote:
> Many filesystems lock inodes before calling vfs_getattr, so there is no
> data-race for inodes. However, some functions in fs/stat.c that call
> vfs_getattr do not lock inodes, so the data-race occurs.
> 
> Therefore, we need to apply a patch to remove the long-standing data-race
> for inodes in some functions that do not lock inodes.

The lock does nothing useful here. The moment the lock is dropped,
the information in the stat buffer is out of date (i.e. stale) and
callers need to treat it as such. i.e. stat data is a point in time
snapshot of inode state and nothing more.

Holding the inode lock over the getattr call does not change this -
the information returned by getattr is not guaranteed to be up to
date by the time the caller reads it.

i.e. If a caller needs stat information to be serialised against
other operations on the inode, then it needs to hold the inode lock
itself....

-Dave.
Jeongjun Park Nov. 18, 2024, 6 a.m. UTC | #3
Hello,

> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> On Mon, Nov 18, 2024 at 01:37:19AM +0900, Jeongjun Park wrote:
>> Many filesystems lock inodes before calling vfs_getattr, so there is no
>> data-race for inodes. However, some functions in fs/stat.c that call
>> vfs_getattr do not lock inodes, so the data-race occurs.
>> 
>> Therefore, we need to apply a patch to remove the long-standing data-race
>> for inodes in some functions that do not lock inodes.
> 
> Why do we care?  Slapping even a shared lock on a _very_ hot path, with
> possible considerable latency, would need more than "theoretically it's
> a data race".

All the functions that added lock in this patch are called only via syscall,
so in most cases there will be no noticeable performance issue. And
this data-race is not a problem that only occurs in theory. It is
a bug that syzbot has been reporting for years. Many file systems that
exist in the kernel lock inode_lock before calling vfs_getattr, so
data-race does not occur, but only fs/stat.c has had a data-race
for years. This alone shows that adding inode_lock to some
functions is a good way to solve the problem without much 
performance degradation.

Regards,

Jeongjun Park
Al Viro Nov. 18, 2024, 7:03 a.m. UTC | #4
On Mon, Nov 18, 2024 at 03:00:39PM +0900, Jeongjun Park wrote:
> 
> Hello,
> 
> > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > 
> > On Mon, Nov 18, 2024 at 01:37:19AM +0900, Jeongjun Park wrote:
> >> Many filesystems lock inodes before calling vfs_getattr, so there is no
> >> data-race for inodes. However, some functions in fs/stat.c that call
> >> vfs_getattr do not lock inodes, so the data-race occurs.
> >> 
> >> Therefore, we need to apply a patch to remove the long-standing data-race
> >> for inodes in some functions that do not lock inodes.
> > 
> > Why do we care?  Slapping even a shared lock on a _very_ hot path, with
> > possible considerable latency, would need more than "theoretically it's
> > a data race".
> 
> All the functions that added lock in this patch are called only via syscall,
> so in most cases there will be no noticeable performance issue.

Pardon me, but I am unable to follow your reasoning.

> And
> this data-race is not a problem that only occurs in theory. It is
> a bug that syzbot has been reporting for years. Many file systems that
> exist in the kernel lock inode_lock before calling vfs_getattr, so
> data-race does not occur, but only fs/stat.c has had a data-race
> for years. This alone shows that adding inode_lock to some
> functions is a good way to solve the problem without much 
> performance degradation.

Explain.  First of all, these are, by far, the most frequent callers
of vfs_getattr(); what "many filesystems" are doing around their calls
of the same is irrelevant.  Which filesystems, BTW?  And which call
chains are you talking about?  Most of the filesystems never call it
at all.

Furthermore, on a lot of userland loads stat(2) is a very hot path -
it is called a lot.  And the rwsem in question has a plenty of takers -
both shared and exclusive.  The effect of piling a lot of threads
that grab it shared on top of the existing mix is not something
I am ready to predict without experiments - not beyond "likely to be
unpleasant, possibly very much so".

Finally, you have not offered any explanations of the reasons why
that data race matters - and "syzbot reporting" is not one.  It is
possible that actual observable bugs exist, but it would be useful
to have at least one of those described in details.

Please, spell your reasoning out.  Note that fetch overlapping with
store is *NOT* a bug in itself.  It may become such if you observe
an object in inconsistent state - e.g. on a 32bit architecture
reading a 64bit value in parallel with assignment to the same may
end up with a problem.  And yes, we do have just such a value
read there - inode size.  Which is why i_size_read() is used there,
with matching i_size_write() in the writers.

Details matter; what is and what is not an inconsistent state
really does depend upon the object you are talking about.
There's no way in hell for syzbot to be able to determine that.
Mateusz Guzik Nov. 20, 2024, 1:44 a.m. UTC | #5
On Mon, Nov 18, 2024 at 07:03:30AM +0000, Al Viro wrote:
> On Mon, Nov 18, 2024 at 03:00:39PM +0900, Jeongjun Park wrote:
> > All the functions that added lock in this patch are called only via syscall,
> > so in most cases there will be no noticeable performance issue.
> 
> Pardon me, but I am unable to follow your reasoning.
> 

I suspect the argument is that the overhead of issuing a syscall is big
enough that the extra cost of taking the lock trip wont be visible, but
that's not accurate -- atomics are measurable when added to syscalls,
even on modern CPUs.

> > And
> > this data-race is not a problem that only occurs in theory. It is
> > a bug that syzbot has been reporting for years. Many file systems that
> > exist in the kernel lock inode_lock before calling vfs_getattr, so
> > data-race does not occur, but only fs/stat.c has had a data-race
> > for years. This alone shows that adding inode_lock to some
> > functions is a good way to solve the problem without much 
> > performance degradation.
> 
> Explain.  First of all, these are, by far, the most frequent callers
> of vfs_getattr(); what "many filesystems" are doing around their calls
> of the same is irrelevant.  Which filesystems, BTW?  And which call
> chains are you talking about?  Most of the filesystems never call it
> at all.
> 
> Furthermore, on a lot of userland loads stat(2) is a very hot path -
> it is called a lot.  And the rwsem in question has a plenty of takers -
> both shared and exclusive.  The effect of piling a lot of threads
> that grab it shared on top of the existing mix is not something
> I am ready to predict without experiments - not beyond "likely to be
> unpleasant, possibly very much so".
> 
> Finally, you have not offered any explanations of the reasons why
> that data race matters - and "syzbot reporting" is not one.  It is
> possible that actual observable bugs exist, but it would be useful
> to have at least one of those described in details.
> 
[snip]

On the stock kernel it is at least theoretically possible to transiently
observe a state which is mid-update (as in not valid), but I was under
the impression this was known and considered not a problem.

Nonetheless, as an example say an inode is owned by 0:0 and is being
chowned to 1:1 and this is handled by setattr_copy.

The ids are updated one after another:
[snip]
        i_uid_update(idmap, attr, inode);
        i_gid_update(idmap, attr, inode);
[/snip]

So at least in principle it may be someone issuing getattr in parallel
will happen to spot 1:0 (as opposed to 0:0 or 1:1), which was never set
on the inode and is merely an artifact of hitting the timing.

This would be a bug, but I don't believe this is serious enough to
justify taking the inode lock to get out of. 

Worst case, if someone is to handle this, I guess the obvious approach
introduces a sequence counter to be modified around setattr.  Then
getattr could retry few times in case of seqc comparison failure and
only give up and take the lock afterwards. This would probably avoid
taking the lock in getattr for all real cases, even in face of racing
setattr.
Al Viro Nov. 20, 2024, 2:08 a.m. UTC | #6
On Wed, Nov 20, 2024 at 02:44:17AM +0100, Mateusz Guzik wrote:

> > Pardon me, but I am unable to follow your reasoning.
> > 
> 
> I suspect the argument is that the overhead of issuing a syscall is big
> enough that the extra cost of taking the lock trip wont be visible, but
> that's not accurate -- atomics are measurable when added to syscalls,
> even on modern CPUs.

Blocking is even more noticable, and the sucker can be contended.  And not
just by chmod() et.al. - write() will do it, for example.

> Nonetheless, as an example say an inode is owned by 0:0 and is being
> chowned to 1:1 and this is handled by setattr_copy.
> 
> The ids are updated one after another:
> [snip]
>         i_uid_update(idmap, attr, inode);
>         i_gid_update(idmap, attr, inode);
> [/snip]
> 
> So at least in principle it may be someone issuing getattr in parallel
> will happen to spot 1:0 (as opposed to 0:0 or 1:1), which was never set
> on the inode and is merely an artifact of hitting the timing.
> 
> This would be a bug, but I don't believe this is serious enough to
> justify taking the inode lock to get out of. 

If anything, such scenarios would be more interesting for permission checks...
Mateusz Guzik Nov. 20, 2024, 2:29 a.m. UTC | #7
On Wed, Nov 20, 2024 at 3:08 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Nov 20, 2024 at 02:44:17AM +0100, Mateusz Guzik wrote:
>
> > > Pardon me, but I am unable to follow your reasoning.
> > >
> >
> > I suspect the argument is that the overhead of issuing a syscall is big
> > enough that the extra cost of taking the lock trip wont be visible, but
> > that's not accurate -- atomics are measurable when added to syscalls,
> > even on modern CPUs.
>
> Blocking is even more noticable, and the sucker can be contended.  And not
> just by chmod() et.al. - write() will do it, for example.
>

Ye I was going for the best case scenario.

> > Nonetheless, as an example say an inode is owned by 0:0 and is being
> > chowned to 1:1 and this is handled by setattr_copy.
> >
> > The ids are updated one after another:
> > [snip]
> >         i_uid_update(idmap, attr, inode);
> >         i_gid_update(idmap, attr, inode);
> > [/snip]
> >
> > So at least in principle it may be someone issuing getattr in parallel
> > will happen to spot 1:0 (as opposed to 0:0 or 1:1), which was never set
> > on the inode and is merely an artifact of hitting the timing.
> >
> > This would be a bug, but I don't believe this is serious enough to
> > justify taking the inode lock to get out of.
>
> If anything, such scenarios would be more interesting for permission checks...

This indeed came up in that context, I can't be arsed to find the
specific e-mail. Somewhere around looking at eliding lockref in favor
of rcu-only operation I noted that inodes can arbitrarily change
during permission checks (including LSMs) and currently there are no
means to detect that. If memory serves Christian said this is known
and if LSMs want better it's their business to do it. fwiw I think for
perms some machinery (maybe with sequence counters) is warranted, but
I have no interest in fighting about the subject.
Christian Brauner Nov. 20, 2024, 9:19 a.m. UTC | #8
On Wed, Nov 20, 2024 at 02:44:17AM +0100, Mateusz Guzik wrote:
> On Mon, Nov 18, 2024 at 07:03:30AM +0000, Al Viro wrote:
> > On Mon, Nov 18, 2024 at 03:00:39PM +0900, Jeongjun Park wrote:
> > > All the functions that added lock in this patch are called only via syscall,
> > > so in most cases there will be no noticeable performance issue.
> > 
> > Pardon me, but I am unable to follow your reasoning.
> > 
> 
> I suspect the argument is that the overhead of issuing a syscall is big
> enough that the extra cost of taking the lock trip wont be visible, but
> that's not accurate -- atomics are measurable when added to syscalls,
> even on modern CPUs.
> 
> > > And
> > > this data-race is not a problem that only occurs in theory. It is
> > > a bug that syzbot has been reporting for years. Many file systems that
> > > exist in the kernel lock inode_lock before calling vfs_getattr, so
> > > data-race does not occur, but only fs/stat.c has had a data-race
> > > for years. This alone shows that adding inode_lock to some
> > > functions is a good way to solve the problem without much 
> > > performance degradation.
> > 
> > Explain.  First of all, these are, by far, the most frequent callers
> > of vfs_getattr(); what "many filesystems" are doing around their calls
> > of the same is irrelevant.  Which filesystems, BTW?  And which call
> > chains are you talking about?  Most of the filesystems never call it
> > at all.
> > 
> > Furthermore, on a lot of userland loads stat(2) is a very hot path -
> > it is called a lot.  And the rwsem in question has a plenty of takers -
> > both shared and exclusive.  The effect of piling a lot of threads
> > that grab it shared on top of the existing mix is not something
> > I am ready to predict without experiments - not beyond "likely to be
> > unpleasant, possibly very much so".
> > 
> > Finally, you have not offered any explanations of the reasons why
> > that data race matters - and "syzbot reporting" is not one.  It is
> > possible that actual observable bugs exist, but it would be useful
> > to have at least one of those described in details.
> > 
> [snip]
> 
> On the stock kernel it is at least theoretically possible to transiently
> observe a state which is mid-update (as in not valid), but I was under
> the impression this was known and considered not a problem.

Exactly.

> 
> Nonetheless, as an example say an inode is owned by 0:0 and is being
> chowned to 1:1 and this is handled by setattr_copy.
> 
> The ids are updated one after another:
> [snip]
>         i_uid_update(idmap, attr, inode);
>         i_gid_update(idmap, attr, inode);
> [/snip]
> 
> So at least in principle it may be someone issuing getattr in parallel
> will happen to spot 1:0 (as opposed to 0:0 or 1:1), which was never set
> on the inode and is merely an artifact of hitting the timing.
> 
> This would be a bug, but I don't believe this is serious enough to
> justify taking the inode lock to get out of. 

I don't think this is a serious issue. We don't guarantee consistent
snapshots and I don't see a reason why we should complicate setattr()
for that.
diff mbox series

Patch

diff --git a/fs/stat.c b/fs/stat.c
index 41e598376d7e..da532b611aa3 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -220,13 +220,21 @@  EXPORT_SYMBOL(vfs_getattr);
  */
 int vfs_fstat(int fd, struct kstat *stat)
 {
+	const struct path *path;
+	struct inode *inode;
 	struct fd f;
 	int error;
 
 	f = fdget_raw(fd);
 	if (!fd_file(f))
 		return -EBADF;
-	error = vfs_getattr(&fd_file(f)->f_path, stat, STATX_BASIC_STATS, 0);
+
+	path = &fd_file(f)->f_path;
+	inode = d_backing_inode(path->dentry);
+
+	inode_lock_shared(inode);
+	error = vfs_getattr(path, stat, STATX_BASIC_STATS, 0);
+	inode_unlock_shared(inode);
 	fdput(f);
 	return error;
 }
@@ -248,7 +256,11 @@  int getname_statx_lookup_flags(int flags)
 static int vfs_statx_path(struct path *path, int flags, struct kstat *stat,
 			  u32 request_mask)
 {
+	struct inode *inode = d_backing_inode(path->dentry);
+
+	inode_lock_shared(inode);
 	int error = vfs_getattr(path, stat, request_mask, flags);
+	inode_unlock_shared(inode);
 
 	if (request_mask & STATX_MNT_ID_UNIQUE) {
 		stat->mnt_id = real_mount(path->mnt)->mnt_id_unique;