Message ID | cover.1709231441.git.boris@bur.io (mailing list archive) |
---|---|
Headers | show |
Series | btrfs-progs: forget removed devices | expand |
On 3/1/24 00:06, Boris Burkov wrote: > To fix bugs in multi-dev filesystems not handling a devt changing under > them when a device gets destroyed and re-created with a different devt, > we need to forget devices as they get removed. > > Modify scan -u to take advantage of the kernel taking unvalidated block > dev names and modify udev to invoke this scan -u on device remove. > Unless we have a specific bug still present after the patch "[PATCH] btrfs: validate device maj:min during open," can we hold off on using the external udev trigger to clean up stale devices in the btrfs kernel? IMO, these loopholes have to be fixed in the kernel regardless. Thanks, Anand > Boris Burkov (2): > btrfs-progs: allow btrfs device scan -u on dead dev > btrfs-progs: add udev rule to forget removed device > > 64-btrfs-rm.rules | 7 +++++++ > Makefile | 2 +- > cmds/device.c | 2 +- > 3 files changed, 9 insertions(+), 2 deletions(-) > create mode 100644 64-btrfs-rm.rules >
On Fri, Mar 01, 2024 at 08:01:44AM +0530, Anand Jain wrote: > On 3/1/24 00:06, Boris Burkov wrote: > > To fix bugs in multi-dev filesystems not handling a devt changing under > > them when a device gets destroyed and re-created with a different devt, > > we need to forget devices as they get removed. > > > > Modify scan -u to take advantage of the kernel taking unvalidated block > > dev names and modify udev to invoke this scan -u on device remove. > > > > Unless we have a specific bug still present after the patch > "[PATCH] btrfs: validate device maj:min during open," can we > hold off on using the external udev trigger to clean up stale > devices in the btrfs kernel? > > IMO, these loopholes have to be fixed in the kernel regardless. Agreed, spreading the solution to user space and depending on async events would only make things harder to debug in the end.
On Fri, Mar 01, 2024 at 12:54:42PM +0100, David Sterba wrote: > On Fri, Mar 01, 2024 at 08:01:44AM +0530, Anand Jain wrote: > > On 3/1/24 00:06, Boris Burkov wrote: > > > To fix bugs in multi-dev filesystems not handling a devt changing under > > > them when a device gets destroyed and re-created with a different devt, > > > we need to forget devices as they get removed. > > > > > > Modify scan -u to take advantage of the kernel taking unvalidated block > > > dev names and modify udev to invoke this scan -u on device remove. > > > > > > > Unless we have a specific bug still present after the patch > > "[PATCH] btrfs: validate device maj:min during open," can we > > hold off on using the external udev trigger to clean up stale > > devices in the btrfs kernel? > > > > IMO, these loopholes have to be fixed in the kernel regardless. > > Agreed, spreading the solution to user space and depending on async > events would only make things harder to debug in the end. In my opinion, leaving the incorrect cache around after a device is destroyed is a footgun and a recipe for future error. Patching it up with Anand's kernel-only fix will fix the problem for now, but I would not be at all surprised if there isn't a different more serious problem waiting for us down the road. We're just looking at exactly devt and temp_fsid, and I'm quite focused on breaking single device filesystems. I spent a week or two hacking on that reproducer. Are we *confident* that I couldn't do it again if I tried with all kinds of raid arrays, error injection, races, etc? What about after we add three more features to volume management and it's been two years? Getting a notification on device destruction and removing our state for that device is the correct fix. If things like this need to be in-kernel, then why is udev the only mechanism for this kind of communication? Can we get a callback from the block layer? Boris
On Fri, Mar 01, 2024 at 07:44:44AM -0800, Boris Burkov wrote: > On Fri, Mar 01, 2024 at 12:54:42PM +0100, David Sterba wrote: > > On Fri, Mar 01, 2024 at 08:01:44AM +0530, Anand Jain wrote: > > > On 3/1/24 00:06, Boris Burkov wrote: > > > > To fix bugs in multi-dev filesystems not handling a devt changing under > > > > them when a device gets destroyed and re-created with a different devt, > > > > we need to forget devices as they get removed. > > > > > > > > Modify scan -u to take advantage of the kernel taking unvalidated block > > > > dev names and modify udev to invoke this scan -u on device remove. > > > > > > > > > > Unless we have a specific bug still present after the patch > > > "[PATCH] btrfs: validate device maj:min during open," can we > > > hold off on using the external udev trigger to clean up stale > > > devices in the btrfs kernel? > > > > > > IMO, these loopholes have to be fixed in the kernel regardless. > > > > Agreed, spreading the solution to user space and depending on async > > events would only make things harder to debug in the end. > > In my opinion, leaving the incorrect cache around after a device is > destroyed is a footgun and a recipe for future error. I agree here partially, leaving stale cached entries can lead to bad outcome, but including userspace as part of the solution is again making it fragile and less reliable. We're not targeting the default and common use case, that would work with or without the fixes or udev rules. The udev event is inherently racy so if something tries to mount before the umount udev rule starts, no change other than we now have yet another factor to take into account when debugging. > Patching it up > with Anand's kernel-only fix will fix the problem for now, This is a general stance kernel vs user space, we can't rely on state correctnes on userspace, so kernel can get hints (like device scanning) but refuse to mount if the devices are not present. > but I would > not be at all surprised if there isn't a different more serious problem > waiting for us down the road. We're just looking at exactly devt and > temp_fsid, and I'm quite focused on breaking single device filesystems. > I spent a week or two hacking on that reproducer. Are we *confident* that > I couldn't do it again if I tried with all kinds of raid arrays, error > injection, races, etc? What about after we add three more features to > volume management and it's been two years? We can only add more safety checks or enforcing some feature using flags or options. Adding features to a layer means we need to revisit how it would go with the current state, something that we did with the temp-fsid, now we're chasing out the bugs. We did not have a good test coverage initially so it's more than what could be expected for a post-release validation. > Getting a notification on device destruction and removing our state for > that device is the correct fix. If things like this need to be > in-kernel, then why is udev the only mechanism for this kind of > communication? Can we get a callback from the block layer? I don't know, it sounds simple but such things can get tricky to get right once it's spanning more layers and state changes. I've heared too many horror stories from CPU or memory hotplug, the unexpected changes in block devices sound too similar. The approach we have is to store some state and validate it when needed, but we apparently lack enough information sources to make the validation 100% reliable. I think we gained a lot of knowledge just from debugging that and prototyping the counter examples but I'm observing we may not have means for fixes.
On Mon, Mar 04, 2024 at 07:07:36PM +0100, David Sterba wrote: > On Fri, Mar 01, 2024 at 07:44:44AM -0800, Boris Burkov wrote: > > On Fri, Mar 01, 2024 at 12:54:42PM +0100, David Sterba wrote: > > > On Fri, Mar 01, 2024 at 08:01:44AM +0530, Anand Jain wrote: > > > > On 3/1/24 00:06, Boris Burkov wrote: > > > > > To fix bugs in multi-dev filesystems not handling a devt changing under > > > > > them when a device gets destroyed and re-created with a different devt, > > > > > we need to forget devices as they get removed. > > > > > > > > > > Modify scan -u to take advantage of the kernel taking unvalidated block > > > > > dev names and modify udev to invoke this scan -u on device remove. > > > > > > > > > > > > > Unless we have a specific bug still present after the patch > > > > "[PATCH] btrfs: validate device maj:min during open," can we > > > > hold off on using the external udev trigger to clean up stale > > > > devices in the btrfs kernel? > > > > > > > > IMO, these loopholes have to be fixed in the kernel regardless. > > > > > > Agreed, spreading the solution to user space and depending on async > > > events would only make things harder to debug in the end. > > > > In my opinion, leaving the incorrect cache around after a device is > > destroyed is a footgun and a recipe for future error. > > I agree here partially, leaving stale cached entries can lead to bad > outcome, but including userspace as part of the solution is again making > it fragile and less reliable. We're not targeting the default and common > use case, that would work with or without the fixes or udev rules. > > The udev event is inherently racy so if something tries to mount before > the umount udev rule starts, no change other than we now have yet > another factor to take into account when debugging. That's a good point. I do agree that this is a race we need a solution for, and is a good argument for an in-kernel fix. > > > Patching it up > > with Anand's kernel-only fix will fix the problem for now, > > This is a general stance kernel vs user space, we can't rely on state > correctnes on userspace, so kernel can get hints (like device scanning) > but refuse to mount if the devices are not present. > > > but I would > > not be at all surprised if there isn't a different more serious problem > > waiting for us down the road. We're just looking at exactly devt and > > temp_fsid, and I'm quite focused on breaking single device filesystems. > > I spent a week or two hacking on that reproducer. Are we *confident* that > > I couldn't do it again if I tried with all kinds of raid arrays, error > > injection, races, etc? What about after we add three more features to > > volume management and it's been two years? > > We can only add more safety checks or enforcing some feature using flags > or options. Adding features to a layer means we need to revisit how it > would go with the current state, something that we did with the > temp-fsid, now we're chasing out the bugs. We did not have a good test > coverage initially so it's more than what could be expected for a > post-release validation. I think it's fine for new features to hit unexpected bugs, that's pretty much normal. All I'm saying is that putting bandaids on bandaids is what leads to a much higher likelihood of bugs any time something changes. > > > Getting a notification on device destruction and removing our state for > > that device is the correct fix. If things like this need to be > > in-kernel, then why is udev the only mechanism for this kind of > > communication? Can we get a callback from the block layer? > > I don't know, it sounds simple but such things can get tricky to get > right once it's spanning more layers and state changes. I've heared too > many horror stories from CPU or memory hotplug, the unexpected changes > in block devices sound too similar. > > The approach we have is to store some state and validate it when needed, > but we apparently lack enough information sources to make the validation > 100% reliable. I think we gained a lot of knowledge just from debugging > that and prototyping the counter examples but I'm observing we may not > have means for fixes. Agreed with everything you've said. I naively feel like a notification model can work, but it does have a kind of spooky air to it for the reasons you gave. An idea that popped out of our group at Meta discussing this bug was to use udev to manage state used by a btrfs specific version of mount. That btrfs-mount would handle all the "scan before mount" considerations we currently have. Ideally, it means we can get rid of the in-kernel cache entirely and so we aren't afraid of races with userspace. Food for thought, at least. Since a solution like that would be a pretty serious undertaking, I think we probably do need a more pragmatic short term fix like Anand's. What about refactoring it a bit to just trust the cache a lot less. It would essentially be the same as Anand's patch, just without special cases for mismatches. Basically, something like "create a new btrfs_device no matter what"