mbox series

[0/2] btrfs-progs: forget removed devices

Message ID cover.1709231441.git.boris@bur.io (mailing list archive)
Headers show
Series btrfs-progs: forget removed devices | expand

Message

Boris Burkov Feb. 29, 2024, 6:36 p.m. UTC
To fix bugs in multi-dev filesystems not handling a devt changing under
them when a device gets destroyed and re-created with a different devt,
we need to forget devices as they get removed.

Modify scan -u to take advantage of the kernel taking unvalidated block
dev names and modify udev to invoke this scan -u on device remove.

Boris Burkov (2):
  btrfs-progs: allow btrfs device scan -u on dead dev
  btrfs-progs: add udev rule to forget removed device

 64-btrfs-rm.rules | 7 +++++++
 Makefile          | 2 +-
 cmds/device.c     | 2 +-
 3 files changed, 9 insertions(+), 2 deletions(-)
 create mode 100644 64-btrfs-rm.rules

Comments

Anand Jain March 1, 2024, 2:31 a.m. UTC | #1
On 3/1/24 00:06, Boris Burkov wrote:
> To fix bugs in multi-dev filesystems not handling a devt changing under
> them when a device gets destroyed and re-created with a different devt,
> we need to forget devices as they get removed.
> 
> Modify scan -u to take advantage of the kernel taking unvalidated block
> dev names and modify udev to invoke this scan -u on device remove.
> 

Unless we have a specific bug still present after the patch
"[PATCH] btrfs: validate device maj:min during open," can we
hold off on using the external udev trigger to clean up stale
devices in the btrfs kernel?

IMO, these loopholes have to be fixed in the kernel regardless.

Thanks, Anand

> Boris Burkov (2):
>    btrfs-progs: allow btrfs device scan -u on dead dev
>    btrfs-progs: add udev rule to forget removed device
> 
>   64-btrfs-rm.rules | 7 +++++++
>   Makefile          | 2 +-
>   cmds/device.c     | 2 +-
>   3 files changed, 9 insertions(+), 2 deletions(-)
>   create mode 100644 64-btrfs-rm.rules
>
David Sterba March 1, 2024, 11:54 a.m. UTC | #2
On Fri, Mar 01, 2024 at 08:01:44AM +0530, Anand Jain wrote:
> On 3/1/24 00:06, Boris Burkov wrote:
> > To fix bugs in multi-dev filesystems not handling a devt changing under
> > them when a device gets destroyed and re-created with a different devt,
> > we need to forget devices as they get removed.
> > 
> > Modify scan -u to take advantage of the kernel taking unvalidated block
> > dev names and modify udev to invoke this scan -u on device remove.
> > 
> 
> Unless we have a specific bug still present after the patch
> "[PATCH] btrfs: validate device maj:min during open," can we
> hold off on using the external udev trigger to clean up stale
> devices in the btrfs kernel?
> 
> IMO, these loopholes have to be fixed in the kernel regardless.

Agreed, spreading the solution to user space and depending on async
events would only make things harder to debug in the end.
Boris Burkov March 1, 2024, 3:44 p.m. UTC | #3
On Fri, Mar 01, 2024 at 12:54:42PM +0100, David Sterba wrote:
> On Fri, Mar 01, 2024 at 08:01:44AM +0530, Anand Jain wrote:
> > On 3/1/24 00:06, Boris Burkov wrote:
> > > To fix bugs in multi-dev filesystems not handling a devt changing under
> > > them when a device gets destroyed and re-created with a different devt,
> > > we need to forget devices as they get removed.
> > > 
> > > Modify scan -u to take advantage of the kernel taking unvalidated block
> > > dev names and modify udev to invoke this scan -u on device remove.
> > > 
> > 
> > Unless we have a specific bug still present after the patch
> > "[PATCH] btrfs: validate device maj:min during open," can we
> > hold off on using the external udev trigger to clean up stale
> > devices in the btrfs kernel?
> > 
> > IMO, these loopholes have to be fixed in the kernel regardless.
> 
> Agreed, spreading the solution to user space and depending on async
> events would only make things harder to debug in the end.

In my opinion, leaving the incorrect cache around after a device is
destroyed is a footgun and a recipe for future error. Patching it up
with Anand's kernel-only fix will fix the problem for now, but I would
not be at all surprised if there isn't a different more serious problem
waiting for us down the road. We're just looking at exactly devt and
temp_fsid, and I'm quite focused on breaking single device filesystems.
I spent a week or two hacking on that reproducer. Are we *confident* that
I couldn't do it again if I tried with all kinds of raid arrays, error
injection, races, etc? What about after we add three more features to
volume management and it's been two years?

Getting a notification on device destruction and removing our state for
that device is the correct fix. If things like this need to be
in-kernel, then why is udev the only mechanism for this kind of
communication? Can we get a callback from the block layer?

Boris
David Sterba March 4, 2024, 6:07 p.m. UTC | #4
On Fri, Mar 01, 2024 at 07:44:44AM -0800, Boris Burkov wrote:
> On Fri, Mar 01, 2024 at 12:54:42PM +0100, David Sterba wrote:
> > On Fri, Mar 01, 2024 at 08:01:44AM +0530, Anand Jain wrote:
> > > On 3/1/24 00:06, Boris Burkov wrote:
> > > > To fix bugs in multi-dev filesystems not handling a devt changing under
> > > > them when a device gets destroyed and re-created with a different devt,
> > > > we need to forget devices as they get removed.
> > > > 
> > > > Modify scan -u to take advantage of the kernel taking unvalidated block
> > > > dev names and modify udev to invoke this scan -u on device remove.
> > > > 
> > > 
> > > Unless we have a specific bug still present after the patch
> > > "[PATCH] btrfs: validate device maj:min during open," can we
> > > hold off on using the external udev trigger to clean up stale
> > > devices in the btrfs kernel?
> > > 
> > > IMO, these loopholes have to be fixed in the kernel regardless.
> > 
> > Agreed, spreading the solution to user space and depending on async
> > events would only make things harder to debug in the end.
> 
> In my opinion, leaving the incorrect cache around after a device is
> destroyed is a footgun and a recipe for future error.

I agree here partially, leaving stale cached entries can lead to bad
outcome, but including userspace as part of the solution is again making
it fragile and less reliable. We're not targeting the default and common
use case, that would work with or without the fixes or udev rules.

The udev event is inherently racy so if something tries to mount before
the umount udev rule starts, no change other than we now have yet
another factor to take into account when debugging.

> Patching it up
> with Anand's kernel-only fix will fix the problem for now,

This is a general stance kernel vs user space, we can't rely on state
correctnes on userspace, so kernel can get hints (like device scanning)
but refuse to mount if the devices are not present.

> but I would
> not be at all surprised if there isn't a different more serious problem
> waiting for us down the road. We're just looking at exactly devt and
> temp_fsid, and I'm quite focused on breaking single device filesystems.
> I spent a week or two hacking on that reproducer. Are we *confident* that
> I couldn't do it again if I tried with all kinds of raid arrays, error
> injection, races, etc? What about after we add three more features to
> volume management and it's been two years?

We can only add more safety checks or enforcing some feature using flags
or options. Adding features to a layer means we need to revisit how it
would go with the current state, something that we did with the
temp-fsid, now we're chasing out the bugs. We did not have a good test
coverage initially so it's more than what could be expected for a
post-release validation.

> Getting a notification on device destruction and removing our state for
> that device is the correct fix. If things like this need to be
> in-kernel, then why is udev the only mechanism for this kind of
> communication? Can we get a callback from the block layer?

I don't know, it sounds simple but such things can get tricky to get
right once it's spanning more layers and state changes. I've heared too
many horror stories from CPU or memory hotplug, the unexpected changes
in block devices sound too similar.

The approach we have is to store some state and validate it when needed,
but we apparently lack enough information sources to make the validation
100% reliable. I think we gained a lot of knowledge just from debugging
that and prototyping the counter examples but I'm observing we may not
have means for fixes.
Boris Burkov March 4, 2024, 9:27 p.m. UTC | #5
On Mon, Mar 04, 2024 at 07:07:36PM +0100, David Sterba wrote:
> On Fri, Mar 01, 2024 at 07:44:44AM -0800, Boris Burkov wrote:
> > On Fri, Mar 01, 2024 at 12:54:42PM +0100, David Sterba wrote:
> > > On Fri, Mar 01, 2024 at 08:01:44AM +0530, Anand Jain wrote:
> > > > On 3/1/24 00:06, Boris Burkov wrote:
> > > > > To fix bugs in multi-dev filesystems not handling a devt changing under
> > > > > them when a device gets destroyed and re-created with a different devt,
> > > > > we need to forget devices as they get removed.
> > > > > 
> > > > > Modify scan -u to take advantage of the kernel taking unvalidated block
> > > > > dev names and modify udev to invoke this scan -u on device remove.
> > > > > 
> > > > 
> > > > Unless we have a specific bug still present after the patch
> > > > "[PATCH] btrfs: validate device maj:min during open," can we
> > > > hold off on using the external udev trigger to clean up stale
> > > > devices in the btrfs kernel?
> > > > 
> > > > IMO, these loopholes have to be fixed in the kernel regardless.
> > > 
> > > Agreed, spreading the solution to user space and depending on async
> > > events would only make things harder to debug in the end.
> > 
> > In my opinion, leaving the incorrect cache around after a device is
> > destroyed is a footgun and a recipe for future error.
> 
> I agree here partially, leaving stale cached entries can lead to bad
> outcome, but including userspace as part of the solution is again making
> it fragile and less reliable. We're not targeting the default and common
> use case, that would work with or without the fixes or udev rules.
> 
> The udev event is inherently racy so if something tries to mount before
> the umount udev rule starts, no change other than we now have yet
> another factor to take into account when debugging.

That's a good point. I do agree that this is a race we need a solution
for, and is a good argument for an in-kernel fix.

> 
> > Patching it up
> > with Anand's kernel-only fix will fix the problem for now,
> 
> This is a general stance kernel vs user space, we can't rely on state
> correctnes on userspace, so kernel can get hints (like device scanning)
> but refuse to mount if the devices are not present.
> 
> > but I would
> > not be at all surprised if there isn't a different more serious problem
> > waiting for us down the road. We're just looking at exactly devt and
> > temp_fsid, and I'm quite focused on breaking single device filesystems.
> > I spent a week or two hacking on that reproducer. Are we *confident* that
> > I couldn't do it again if I tried with all kinds of raid arrays, error
> > injection, races, etc? What about after we add three more features to
> > volume management and it's been two years?
> 
> We can only add more safety checks or enforcing some feature using flags
> or options. Adding features to a layer means we need to revisit how it
> would go with the current state, something that we did with the
> temp-fsid, now we're chasing out the bugs. We did not have a good test
> coverage initially so it's more than what could be expected for a
> post-release validation.

I think it's fine for new features to hit unexpected bugs, that's pretty
much normal. All I'm saying is that  putting bandaids on bandaids is what
leads to a much higher likelihood of bugs any time something changes.

> 
> > Getting a notification on device destruction and removing our state for
> > that device is the correct fix. If things like this need to be
> > in-kernel, then why is udev the only mechanism for this kind of
> > communication? Can we get a callback from the block layer?
> 
> I don't know, it sounds simple but such things can get tricky to get
> right once it's spanning more layers and state changes. I've heared too
> many horror stories from CPU or memory hotplug, the unexpected changes
> in block devices sound too similar.
> 
> The approach we have is to store some state and validate it when needed,
> but we apparently lack enough information sources to make the validation
> 100% reliable. I think we gained a lot of knowledge just from debugging
> that and prototyping the counter examples but I'm observing we may not
> have means for fixes.

Agreed with everything you've said. I naively feel like a notification
model can work, but it does have a kind of spooky air to it for the
reasons you gave.

An idea that popped out of our group at Meta discussing this bug was to
use udev to manage state used by a btrfs specific version of mount. That
btrfs-mount would handle all the "scan before mount" considerations we
currently have. 

Ideally, it means we can get rid of the in-kernel cache entirely and so
we aren't afraid of races with userspace. Food for thought, at least.

Since a solution like that would be a pretty serious undertaking, I
think we probably do need a more pragmatic short term fix like Anand's.
What about refactoring it a bit to just trust the cache a lot less. It
would essentially be the same as Anand's patch, just without special
cases for mismatches. Basically, something like "create a new btrfs_device
no matter what"