diff mbox series

bcachefs: Fix sysfs warning in fstests generic/730,731

Message ID 20241012184239.3785089-1-kent.overstreet@linux.dev (mailing list archive)
State New
Headers show
Series bcachefs: Fix sysfs warning in fstests generic/730,731 | expand

Commit Message

Kent Overstreet Oct. 12, 2024, 6:42 p.m. UTC
sysfs warns if we're removing a symlink from a directory that's no
longer in sysfs; this is triggered by fstests generic/730, which
simulates hot removal of a block device.

This patch is however not a correct fix, since checking
kobj->state_in_sysfs on a kobj owned by another subsystem is racy.

A better fix would be to add the appropriate check to
sysfs_remove_link() - and sysfs_create_link() as well.

But kobject_add_internal()/kobject_del() do not as of today have locking
that would support that.

Note that the block/holder.c code appears to be subject to this race as
well.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc:  Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/super.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

Comments

Christoph Hellwig Oct. 14, 2024, 6:10 a.m. UTC | #1
On Sat, Oct 12, 2024 at 02:42:39PM -0400, Kent Overstreet wrote:
> sysfs warns if we're removing a symlink from a directory that's no
> longer in sysfs; this is triggered by fstests generic/730, which
> simulates hot removal of a block device.
> 
> This patch is however not a correct fix, since checking
> kobj->state_in_sysfs on a kobj owned by another subsystem is racy.
> 
> A better fix would be to add the appropriate check to
> sysfs_remove_link() - and sysfs_create_link() as well.

The proper fix is to not link to random other subsystems with
object lifetimes you can't know.  I'm not sure why you think adding
this link was ever allowed.
Greg Kroah-Hartman Oct. 14, 2024, 6:34 a.m. UTC | #2
On Mon, Oct 14, 2024 at 08:10:19AM +0200, Christoph Hellwig wrote:
> On Sat, Oct 12, 2024 at 02:42:39PM -0400, Kent Overstreet wrote:
> > sysfs warns if we're removing a symlink from a directory that's no
> > longer in sysfs; this is triggered by fstests generic/730, which
> > simulates hot removal of a block device.
> > 
> > This patch is however not a correct fix, since checking
> > kobj->state_in_sysfs on a kobj owned by another subsystem is racy.
> > 
> > A better fix would be to add the appropriate check to
> > sysfs_remove_link() - and sysfs_create_link() as well.
> 
> The proper fix is to not link to random other subsystems with
> object lifetimes you can't know.  I'm not sure why you think adding
> this link was ever allowed.
> 

Odd, I never got the original patch that was sent here in the first
place...

Anyway, Christoph is right, this patch isn't ok.  You can't link outside
of the subdirectory in which you control in sysfs without a whole lot of
special cases and control.  The use of sysfs for filesystems is almost
always broken and tricky and full of race conditions (see many past
threads about this.)  Ideally we would fix this up by offering common
code for filesystems to use for sysfs (like we do for the driver
subsystems), but no one has gotten around to it for various reasons.

The only filesystem that I can see that attempts to do much like what
bcachefs does in sysfs is btrfs, but btrfs only seems to have one
symlink, while you have multiple ones pointing to the same block device.

I can't find any sysfs documentation in Documentation/ABI/ so I don't
really understand what it's attempting to do (and why isn't the tools
that check this screaming about that lack of documentation, that's
odd...)  Any hints as to what you are wishing to show here?

thanks,

greg k-h
Kent Overstreet Oct. 14, 2024, 6:51 a.m. UTC | #3
On Mon, Oct 14, 2024 at 08:34:06AM +0200, Greg Kroah-Hartman wrote:
> On Mon, Oct 14, 2024 at 08:10:19AM +0200, Christoph Hellwig wrote:
> > On Sat, Oct 12, 2024 at 02:42:39PM -0400, Kent Overstreet wrote:
> > > sysfs warns if we're removing a symlink from a directory that's no
> > > longer in sysfs; this is triggered by fstests generic/730, which
> > > simulates hot removal of a block device.
> > > 
> > > This patch is however not a correct fix, since checking
> > > kobj->state_in_sysfs on a kobj owned by another subsystem is racy.
> > > 
> > > A better fix would be to add the appropriate check to
> > > sysfs_remove_link() - and sysfs_create_link() as well.
> > 
> > The proper fix is to not link to random other subsystems with
> > object lifetimes you can't know.  I'm not sure why you think adding
> > this link was ever allowed.
> > 
> 
> Odd, I never got the original patch that was sent here in the first
> place...
> 
> Anyway, Christoph is right, this patch isn't ok.  You can't link outside
> of the subdirectory in which you control in sysfs without a whole lot of
> special cases and control.  The use of sysfs for filesystems is almost
> always broken and tricky and full of race conditions (see many past
> threads about this.)  Ideally we would fix this up by offering common
> code for filesystems to use for sysfs (like we do for the driver
> subsystems), but no one has gotten around to it for various reasons.

There was already past precedent with the block/holder.c code, and
userspace does depend on that for determining the topology of virtual
block devices.

And that really is what sysfs is for, determining the actual topology
and relationships between various devices - so if there's a relationship
between devices we need to be able to expose that.

I don't know why bcache never used the block/holder.c code (predates it,
perhaps?) - but that code has been carried over basically unchanged, and
we likely still depend on it (I'd have to dig around in tools...).

Re: the safety issues, I don't agree - provided you have a stable
reference to the underlying kobject, which we do, since we have the
block device open. The race is only w.r.t. kobj->state_in_sysfs, and
that could be handled easily within the sysfs/kobject code.
 
> The only filesystem that I can see that attempts to do much like what
> bcachefs does in sysfs is btrfs, but btrfs only seems to have one
> symlink, while you have multiple ones pointing to the same block device.

Not sure where you're seeing that? It's just a single backreference from
the block device to the filesystem object.

> I can't find any sysfs documentation in Documentation/ABI/ so I don't
> really understand what it's attempting to do (and why isn't the tools
> that check this screaming about that lack of documentation, that's
> odd...)  Any hints as to what you are wishing to show here?

Basically, it's the cleanest way (by far) for userspace to look up the
filesystem from the block device: given a path to a block device, stat
it to get the major:minor, then try to open
/sys/dev/block/major:minor/bcachefs/.

The alternative would be scanning through /proc/mounts, which is really
nasty - the format isn't particularly cleanly specified, it's racy, and
with containers systems are getting into the thousands of mounts these
days.
Greg Kroah-Hartman Oct. 16, 2024, 7 a.m. UTC | #4
[meta comment, Kent, I'm not getting your emails sent to me at all, they
aren't even showing up in the gmail spam box, so something is really off
on your server such that google is just rejecting them all?]

On Mon, Oct 14, 2024 at 02:51:23AM -0400, Kent Overstreet wrote:
> On Mon, Oct 14, 2024 at 08:34:06AM +0200, Greg Kroah-Hartman wrote:
> > On Mon, Oct 14, 2024 at 08:10:19AM +0200, Christoph Hellwig wrote:
> > > On Sat, Oct 12, 2024 at 02:42:39PM -0400, Kent Overstreet wrote:
> > > > sysfs warns if we're removing a symlink from a directory that's no
> > > > longer in sysfs; this is triggered by fstests generic/730, which
> > > > simulates hot removal of a block device.
> > > > 
> > > > This patch is however not a correct fix, since checking
> > > > kobj->state_in_sysfs on a kobj owned by another subsystem is racy.
> > > > 
> > > > A better fix would be to add the appropriate check to
> > > > sysfs_remove_link() - and sysfs_create_link() as well.
> > > 
> > > The proper fix is to not link to random other subsystems with
> > > object lifetimes you can't know.  I'm not sure why you think adding
> > > this link was ever allowed.
> > > 
> > 
> > Odd, I never got the original patch that was sent here in the first
> > place...
> > 
> > Anyway, Christoph is right, this patch isn't ok.  You can't link outside
> > of the subdirectory in which you control in sysfs without a whole lot of
> > special cases and control.  The use of sysfs for filesystems is almost
> > always broken and tricky and full of race conditions (see many past
> > threads about this.)  Ideally we would fix this up by offering common
> > code for filesystems to use for sysfs (like we do for the driver
> > subsystems), but no one has gotten around to it for various reasons.
> 
> There was already past precedent with the block/holder.c code, and
> userspace does depend on that for determining the topology of virtual
> block devices.

What tools use that?  What sysfs links are being created for it?

And yes, filesystems do poke around in sysfs, but they almost always do
so in a racy way, see this old link for examples of common problems:
	https://lore.kernel.org/all/20230406120716.80980-1-frank.li@vivo.com/#r

> And that really is what sysfs is for, determining the actual topology
> and relationships between various devices - so if there's a relationship
> between devices we need to be able to expose that.

I totally agree, that is what sysfs is for, but at the filesystem layer
you all are having to deal with "raw" kobjects and doing that gets
tricky and is easy to get wrong.

> Re: the safety issues, I don't agree - provided you have a stable
> reference to the underlying kobject, which we do, since we have the
> block device open. The race is only w.r.t. kobj->state_in_sysfs, and
> that could be handled easily within the sysfs/kobject code.

Handled how?

> > The only filesystem that I can see that attempts to do much like what
> > bcachefs does in sysfs is btrfs, but btrfs only seems to have one
> > symlink, while you have multiple ones pointing to the same block device.
> 
> Not sure where you're seeing that? It's just a single backreference from
> the block device to the filesystem object.

I see multiple symlinks being created in the code, I don't know what it
looks like on a running system, sorry.

> > I can't find any sysfs documentation in Documentation/ABI/ so I don't
> > really understand what it's attempting to do (and why isn't the tools
> > that check this screaming about that lack of documentation, that's
> > odd...)  Any hints as to what you are wishing to show here?
> 
> Basically, it's the cleanest way (by far) for userspace to look up the
> filesystem from the block device: given a path to a block device, stat
> it to get the major:minor, then try to open
> /sys/dev/block/major:minor/bcachefs/.

Can you document this properly in Documentation/ABI/ which is where all
sysfs files and symlinks are supposed to be documented?  We have a tool
that you can run at runtime to show all missing documentation entries,
scripts/get_abi.pl

> The alternative would be scanning through /proc/mounts, which is really
> nasty - the format isn't particularly cleanly specified, it's racy, and
> with containers systems are getting into the thousands of mounts these
> days.

How does all other filesystems do this?  Surely we are not relying on
each filesystem to create these symlinks, that's just not going to
work...

thanks,

greg k-h
Kent Overstreet Oct. 16, 2024, 9:49 a.m. UTC | #5
On Wed, Oct 16, 2024 at 09:00:42AM +0200, Greg Kroah-Hartman wrote:
> [meta comment, Kent, I'm not getting your emails sent to me at all, they
> aren't even showing up in the gmail spam box, so something is really off
> on your server such that google is just rejecting them all?]

I'm just using Migadu, if it persists it'll have to be escalated with
both Google and Migadu to get anything done, most likely.

On Wed, Oct 16, 2024 at 09:00:42AM +0200, Greg Kroah-Hartman wrote:
> > There was already past precedent with the block/holder.c code, and
> > userspace does depend on that for determining the topology of virtual
> > block devices.
> 
> What tools use that?  What sysfs links are being created for it?
> 
> And yes, filesystems do poke around in sysfs, but they almost always do
> so in a racy way, see this old link for examples of common problems:
> 	https://lore.kernel.org/all/20230406120716.80980-1-frank.li@vivo.com/#r

That doesn't appear to be at all relevant to this discussion. Most/all
of the major filesystems today do have objects in sysfs under /sys/fs,
which is what that thread was describing, and I know some of those
people are going to take issue if you're calling their code buggy.

> > And that really is what sysfs is for, determining the actual topology
> > and relationships between various devices - so if there's a relationship
> > between devices we need to be able to expose that.
> 
> I totally agree, that is what sysfs is for, but at the filesystem layer
> you all are having to deal with "raw" kobjects and doing that gets
> tricky and is easy to get wrong.

Well, you're the person who created the API.

> > Re: the safety issues, I don't agree - provided you have a stable
> > reference to the underlying kobject, which we do, since we have the
> > block device open. The race is only w.r.t. kobj->state_in_sysfs, and
> > that could be handled easily within the sysfs/kobject code.
> 
> Handled how?

Per-kobject lock, taken by kobject_add() and kobject_del(), to
synchronize kobj->state_in_sysfs and the actual VFS state;
sysfs_create_link() and sysfs_remove_link() takes the same lock. It's
not hard...

> > The alternative would be scanning through /proc/mounts, which is really
> > nasty - the format isn't particularly cleanly specified, it's racy, and
> > with containers systems are getting into the thousands of mounts these
> > days.
> 
> How does all other filesystems do this?  Surely we are not relying on
> each filesystem to create these symlinks, that's just not going to
> work...

Sysfs code is currently in no way standardized across filesystems. I
recently introduced standard vfs-layer ioctls for getting the UUID and
sysfs paths of mounted filesystems, but we're a long ways from any real
standardization.
diff mbox series

Patch

diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 843431e58cf5..f96355ecb296 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -184,6 +184,7 @@  static DEFINE_MUTEX(bch_fs_list_lock);
 
 DECLARE_WAIT_QUEUE_HEAD(bch2_read_only_wait);
 
+static void bch2_dev_unlink(struct bch_dev *);
 static void bch2_dev_free(struct bch_dev *);
 static int bch2_dev_alloc(struct bch_fs *, unsigned);
 static int bch2_dev_sysfs_online(struct bch_fs *, struct bch_dev *);
@@ -620,9 +621,7 @@  void __bch2_fs_stop(struct bch_fs *c)
 	up_write(&c->state_lock);
 
 	for_each_member_device(c, ca)
-		if (ca->kobj.state_in_sysfs &&
-		    ca->disk_sb.bdev)
-			sysfs_remove_link(bdev_kobj(ca->disk_sb.bdev), "bcachefs");
+		bch2_dev_unlink(ca);
 
 	if (c->kobj.state_in_sysfs)
 		kobject_del(&c->kobj);
@@ -1188,9 +1187,7 @@  static void bch2_dev_free(struct bch_dev *ca)
 {
 	cancel_work_sync(&ca->io_error_work);
 
-	if (ca->kobj.state_in_sysfs &&
-	    ca->disk_sb.bdev)
-		sysfs_remove_link(bdev_kobj(ca->disk_sb.bdev), "bcachefs");
+	bch2_dev_unlink(ca);
 
 	if (ca->kobj.state_in_sysfs)
 		kobject_del(&ca->kobj);
@@ -1227,10 +1224,7 @@  static void __bch2_dev_offline(struct bch_fs *c, struct bch_dev *ca)
 	percpu_ref_kill(&ca->io_ref);
 	wait_for_completion(&ca->io_ref_completion);
 
-	if (ca->kobj.state_in_sysfs) {
-		sysfs_remove_link(bdev_kobj(ca->disk_sb.bdev), "bcachefs");
-		sysfs_remove_link(&ca->kobj, "block");
-	}
+	bch2_dev_unlink(ca);
 
 	bch2_free_super(&ca->disk_sb);
 	bch2_dev_journal_exit(ca);
@@ -1252,6 +1246,26 @@  static void bch2_dev_io_ref_complete(struct percpu_ref *ref)
 	complete(&ca->io_ref_completion);
 }
 
+static void bch2_dev_unlink(struct bch_dev *ca)
+{
+	struct kobject *b;
+
+	/*
+	 * This is racy w.r.t. the underlying block device being hot-removed,
+	 * which removes it from sysfs.
+	 *
+	 * It'd be lovely if we had a way to handle this race, but the sysfs
+	 * code doesn't appear to provide a good method and block/holder.c is
+	 * susceptible as well:
+	 */
+	if (ca->kobj.state_in_sysfs &&
+	    ca->disk_sb.bdev &&
+	    (b = bdev_kobj(ca->disk_sb.bdev))->state_in_sysfs) {
+		sysfs_delete_link(b, &ca->kobj, "bcachefs");
+		sysfs_delete_link(&ca->kobj, b, "block");
+	}
+}
+
 static int bch2_dev_sysfs_online(struct bch_fs *c, struct bch_dev *ca)
 {
 	int ret;