Message ID | f40e347d5a4b4b28201b1a088d38a3c75dd10ebd.1709251328.git.boris@bur.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] btrfs/311: new test for devt change between mounts | expand |
On Fri, Mar 1, 2024 at 12:02 AM Boris Burkov <boris@bur.io> wrote: > > It is possible to confuse the btrfs device cache (fs_devices) by > starting with a multi-device filesystem, then removing and re-adding a > device in a way which changes its dev_t while the filesystem is > unmounted. After this procedure, if we remount, then we are in a funny > state where struct btrfs_device's "devt" field does not match the bd_dev > of the "bdev" field. I would say this is bad enough, as we have violated > a pretty clear invariant. > > But for style points, we can then remove the extra device from the fs, > making it a single device fs, which enables the "temp_fsid" feature, > which permits multiple separate mounts of different devices with the > same fsid. Since btrfs is confused and *thinks* there are different > devices (based on device->devt), it allows a second redundant mount of > the same device (not a bind mount!). This then allows us to corrupt the > original mount by doing stuff to the one that should be a bind mount. > > This is fixed by the combination of the kernel patch: > btrfs: support device name lookup in forget > and the btrfs-progs patches: > btrfs-progs: allow btrfs device scan -u on dead dev > btrfs-progs: add udev rule to forget removed device May I suggest to make this more readable, easier to the eye? My inserting blank lines and indenting the lines with the patch subjects, like for example: """ This is fixed by the combination of the kernel patch: btrfs: support device name lookup in forget and the btrfs-progs patches: btrfs-progs: allow btrfs device scan -u on dead dev btrfs-progs: add udev rule to forget removed device """ And these should be placed in the test case itself with: _fixed_by_git_commit btrfs-progs xxxxxxxxxx "btrfs-progs: allow btrfs device scan -u on dead dev" For btrfs-progs commits, and for the kernel: _fixed_by_kernel_commit xxxxxxxxxxxx "btrfs: support device name lookup in forget" I see however that there's discussion for those patches between you, Anand and David, and it seems there's a chance those patches won't be merged to fix this bug, especially the kernel one for which Anand submitted an alternative. If those are added to the test case itself, can always be updated later if needed. Also avoid putting the test number in the subject - there's no guarantee it will end up with that number when merged. > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > common/config | 1 + > common/rc | 4 ++ > tests/btrfs/311 | 101 ++++++++++++++++++++++++++++++++++++++++++++ > tests/btrfs/311.out | 2 + > 4 files changed, 108 insertions(+) > create mode 100755 tests/btrfs/311 > create mode 100644 tests/btrfs/311.out > > diff --git a/common/config b/common/config > index a3b15b96f..43b517fda 100644 > --- a/common/config > +++ b/common/config > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)" > export GZIP_PROG="$(type -P gzip)" > export BTRFS_IMAGE_PROG="$(type -P btrfs-image)" > export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical) > +export PARTED_PROG="$(type -P parted)" > > # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled. > # newer systems have udevadm command but older systems like RHEL5 don't. > diff --git a/common/rc b/common/rc > index 30c44dddd..8e009aca9 100644 > --- a/common/rc > +++ b/common/rc > @@ -5375,6 +5375,10 @@ _require_unshare() { > _notrun "unshare $*: command not found, should be in util-linux" > } > > +_require_parted() { These three last functions from common/rc use that style, the { after the function name in the same line, but everywhere else the { is on a new line, and that's the most common style in fstests. I wish we had some consistency. > + $PARTED_PROG --list &>/dev/null || _notrun "parted: command not found" Why not just call: _require_command "$PARTED_PROG" parted Could even do that in the test and no need for a common function in this file, as we do in many other tests (grep for "'_require_command'" in tests/generic for example). > +} > + > # Return a random file in a directory. A directory is *not* followed > # recursively. > _random_file() { > diff --git a/tests/btrfs/311 b/tests/btrfs/311 > new file mode 100755 > index 000000000..887c46ba0 > --- /dev/null > +++ b/tests/btrfs/311 > @@ -0,0 +1,101 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved. > +# > +# FS QA Test 311 > +# > +# Test an edge case of multi device volume management in btrfs. > +# If a device changes devt between mounts of a multi device fs, we can trick > +# btrfs into mounting the same device twice fully (not as a bind mount). From > +# there, it is trivial to induce corruption. > +# > +. ./common/preamble > +_begin_fstest auto quick volume Missing 'scrub' group. > + > +# real QA test starts here > +_supported_fs btrfs > +_require_test > +_require_parted > + > +_cleanup() { > + cd / > + umount $MNT > + umount $BIND > + losetup -d $DEV0 > + losetup -d $DEV1 > + losetup -d $DEV2 > + rm $IMG0 > + rm $IMG1 > + rm $IMG2 > +} > + > +IMG0=$TEST_DIR/$$.img0 > +IMG1=$TEST_DIR/$$.img1 > +IMG2=$TEST_DIR/$$.img2 > +truncate -s 1G $IMG0 > +truncate -s 1G $IMG1 > +truncate -s 1G $IMG2 > +DEV0=$(losetup -f $IMG0 --show) > +DEV1=$(losetup -f $IMG1 --show) > +DEV2=$(losetup -f $IMG2 --show) > +D0P1=$DEV0"p1" > +D1P1=$DEV1"p1" > +MNT=$TEST_DIR/mnt > +BIND=$TEST_DIR/bind > + > +# Setup partition table with one partition on each device. > +$PARTED_PROG $DEV0 'mktable gpt' --script > +$PARTED_PROG $DEV1 'mktable gpt' --script > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > + > +# mkfs with two devices to avoid clearing devices on close > +# single raid to allow removing DEV2. > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null Please redirect to the $seqres.full instead, and _fail if mkfs fails. That's what we do nowadays due to unpleasant surprises in the past. > + > +# Cycle mount the two device fs to populate both devices into the > +# stale device cache. > +mkdir -p $MNT > +mount $D0P1 $MNT Use the _mount() helper. > +umount $MNT We use $UMOUNT_PROG in fstests. > + > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date. > +$PARTED_PROG $DEV0 'rm 1' --script > +$PARTED_PROG $DEV1 'rm 1' --script > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > + > +# Mount with mismatched dev_t! > +mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point" Same here. > + > +# Remove the extra device to bring temp-fsid back in the fray. > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT > + > +# Create the would be bind mount. > +mkdir -p $BIND > +mount $D0P1 $BIND Same here. > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT) > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND) > +# If they're different, we are in trouble. > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show" > + > +# Now really prove it by corrupting the first mount with the second. > +for i in $(seq 20); do > + $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1 This is overriding the .full file, use >> > +done > +for i in $(seq 20); do > + $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1 Same here, this is overriding the .full file, use >> > +done > +sync Can we please have a comment mentioning why the sync is needed? > +find $BIND -type f -delete > +sync Same here. > + > +# This should blow up both mounts, if the writes somehow didn't overlap at all. > +$FSTRIM_PROG $BIND Since it's using the fstrim program and needs trim to be supported, the test misses a: _require_batched_discard "$TEST_DIR" at the top. > +echo 3 > /proc/sys/vm/drop_caches Please add a comment mentioning why we are dropping caches. > +$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1 The test passes whether scrub fails or succeeds, as it's redirecting stdout and stderr to the .full file and ignoring the exit status of the command. The ideal fstests way is to put the expected output in the golden output file and just call the command without redirecting anything. If that's not doable for some reason, at the very least do a (...) || echo "Scrub failed, check $seqres.full for details." Thanks. > + > +# success, all done > +echo "Silence is golden" > +status=0 > +exit > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out > new file mode 100644 > index 000000000..62f253029 > --- /dev/null > +++ b/tests/btrfs/311.out > @@ -0,0 +1,2 @@ > +QA output created by 311 > +Silence is golden > -- > 2.43.0 > >
On Tue, Mar 05, 2024 at 12:13:07PM +0000, Filipe Manana wrote: > On Fri, Mar 1, 2024 at 12:02 AM Boris Burkov <boris@bur.io> wrote: > > > > It is possible to confuse the btrfs device cache (fs_devices) by > > starting with a multi-device filesystem, then removing and re-adding a > > device in a way which changes its dev_t while the filesystem is > > unmounted. After this procedure, if we remount, then we are in a funny > > state where struct btrfs_device's "devt" field does not match the bd_dev > > of the "bdev" field. I would say this is bad enough, as we have violated > > a pretty clear invariant. > > > > But for style points, we can then remove the extra device from the fs, > > making it a single device fs, which enables the "temp_fsid" feature, > > which permits multiple separate mounts of different devices with the > > same fsid. Since btrfs is confused and *thinks* there are different > > devices (based on device->devt), it allows a second redundant mount of > > the same device (not a bind mount!). This then allows us to corrupt the > > original mount by doing stuff to the one that should be a bind mount. > > > > This is fixed by the combination of the kernel patch: > > btrfs: support device name lookup in forget > > and the btrfs-progs patches: > > btrfs-progs: allow btrfs device scan -u on dead dev > > btrfs-progs: add udev rule to forget removed device > > May I suggest to make this more readable, easier to the eye? > My inserting blank lines and indenting the lines with the patch > subjects, like for example: > > """ > This is fixed by the combination of the kernel patch: > > btrfs: support device name lookup in forget > > and the btrfs-progs patches: > > btrfs-progs: allow btrfs device scan -u on dead dev > btrfs-progs: add udev rule to forget removed device > """ > > And these should be placed in the test case itself with: > > _fixed_by_git_commit btrfs-progs xxxxxxxxxx "btrfs-progs: allow > btrfs device scan -u on dead dev" > > For btrfs-progs commits, and for the kernel: > > _fixed_by_kernel_commit xxxxxxxxxxxx "btrfs: support device name > lookup in forget" I'll do so going forward, thanks. > > I see however that there's discussion for those patches between you, > Anand and David, and it > seems there's a chance those patches won't be merged to fix this bug, > especially the kernel one > for which Anand submitted an alternative. If those are added to the > test case itself, can always be > updated later if needed. I'm leaning towards omitting this part from the commit message and we can update the test case once we pick a winner. Is that OK with you? > > Also avoid putting the test number in the subject - there's no > guarantee it will end up with that number when merged. > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > common/config | 1 + > > common/rc | 4 ++ > > tests/btrfs/311 | 101 ++++++++++++++++++++++++++++++++++++++++++++ > > tests/btrfs/311.out | 2 + > > 4 files changed, 108 insertions(+) > > create mode 100755 tests/btrfs/311 > > create mode 100644 tests/btrfs/311.out > > > > diff --git a/common/config b/common/config > > index a3b15b96f..43b517fda 100644 > > --- a/common/config > > +++ b/common/config > > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)" > > export GZIP_PROG="$(type -P gzip)" > > export BTRFS_IMAGE_PROG="$(type -P btrfs-image)" > > export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical) > > +export PARTED_PROG="$(type -P parted)" > > > > # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled. > > # newer systems have udevadm command but older systems like RHEL5 don't. > > diff --git a/common/rc b/common/rc > > index 30c44dddd..8e009aca9 100644 > > --- a/common/rc > > +++ b/common/rc > > @@ -5375,6 +5375,10 @@ _require_unshare() { > > _notrun "unshare $*: command not found, should be in util-linux" > > } > > > > +_require_parted() { > > These three last functions from common/rc use that style, the { after > the function name in the same line, > but everywhere else the { is on a new line, and that's the most common > style in fstests. > I wish we had some consistency. > > > + $PARTED_PROG --list &>/dev/null || _notrun "parted: command not found" > > Why not just call: > > _require_command "$PARTED_PROG" parted > > Could even do that in the test and no need for a common function in > this file, as we do in many other tests > (grep for "'_require_command'" in tests/generic for example). > > > +} > > + > > # Return a random file in a directory. A directory is *not* followed > > # recursively. > > _random_file() { > > diff --git a/tests/btrfs/311 b/tests/btrfs/311 > > new file mode 100755 > > index 000000000..887c46ba0 > > --- /dev/null > > +++ b/tests/btrfs/311 > > @@ -0,0 +1,101 @@ > > +#! /bin/bash > > +# SPDX-License-Identifier: GPL-2.0 > > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved. > > +# > > +# FS QA Test 311 > > +# > > +# Test an edge case of multi device volume management in btrfs. > > +# If a device changes devt between mounts of a multi device fs, we can trick > > +# btrfs into mounting the same device twice fully (not as a bind mount). From > > +# there, it is trivial to induce corruption. > > +# > > +. ./common/preamble > > +_begin_fstest auto quick volume > > Missing 'scrub' group. > > > + > > +# real QA test starts here > > +_supported_fs btrfs > > +_require_test > > +_require_parted > > + > > +_cleanup() { > > + cd / > > + umount $MNT > > + umount $BIND > > + losetup -d $DEV0 > > + losetup -d $DEV1 > > + losetup -d $DEV2 > > + rm $IMG0 > > + rm $IMG1 > > + rm $IMG2 > > +} > > + > > +IMG0=$TEST_DIR/$$.img0 > > +IMG1=$TEST_DIR/$$.img1 > > +IMG2=$TEST_DIR/$$.img2 > > +truncate -s 1G $IMG0 > > +truncate -s 1G $IMG1 > > +truncate -s 1G $IMG2 > > +DEV0=$(losetup -f $IMG0 --show) > > +DEV1=$(losetup -f $IMG1 --show) > > +DEV2=$(losetup -f $IMG2 --show) > > +D0P1=$DEV0"p1" > > +D1P1=$DEV1"p1" > > +MNT=$TEST_DIR/mnt > > +BIND=$TEST_DIR/bind > > + > > +# Setup partition table with one partition on each device. > > +$PARTED_PROG $DEV0 'mktable gpt' --script > > +$PARTED_PROG $DEV1 'mktable gpt' --script > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > > + > > +# mkfs with two devices to avoid clearing devices on close > > +# single raid to allow removing DEV2. > > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null > > Please redirect to the $seqres.full instead, and _fail if mkfs fails. > That's what we do nowadays due to unpleasant surprises in the past. > > > + > > +# Cycle mount the two device fs to populate both devices into the > > +# stale device cache. > > +mkdir -p $MNT > > +mount $D0P1 $MNT > > Use the _mount() helper. > > > +umount $MNT > > We use $UMOUNT_PROG in fstests. > > > + > > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date. > > +$PARTED_PROG $DEV0 'rm 1' --script > > +$PARTED_PROG $DEV1 'rm 1' --script > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > > + > > +# Mount with mismatched dev_t! > > +mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point" > > Same here. > > > + > > +# Remove the extra device to bring temp-fsid back in the fray. > > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT > > + > > +# Create the would be bind mount. > > +mkdir -p $BIND > > +mount $D0P1 $BIND > > Same here. > > > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT) > > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND) > > +# If they're different, we are in trouble. > > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show" > > + > > +# Now really prove it by corrupting the first mount with the second. > > +for i in $(seq 20); do > > + $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1 > > This is overriding the .full file, use >> > > > +done > > +for i in $(seq 20); do > > + $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1 > > Same here, this is overriding the .full file, use >> > > > +done > > +sync > > Can we please have a comment mentioning why the sync is needed? > > > +find $BIND -type f -delete > > +sync > > Same here. > > > + > > +# This should blow up both mounts, if the writes somehow didn't overlap at all. > > +$FSTRIM_PROG $BIND > > Since it's using the fstrim program and needs trim to be supported, > the test misses a: > > _require_batched_discard "$TEST_DIR" > > at the top. > > > +echo 3 > /proc/sys/vm/drop_caches > > Please add a comment mentioning why we are dropping caches. > > > +$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1 > > The test passes whether scrub fails or succeeds, as it's redirecting > stdout and stderr to the .full file and ignoring the exit status of > the command. > > The ideal fstests way is to put the expected output in the golden > output file and just call the command without redirecting anything. > If that's not doable for some reason, at the very least do a (...) || > echo "Scrub failed, check $seqres.full for details." > > Thanks. > > > + > > +# success, all done > > +echo "Silence is golden" > > +status=0 > > +exit > > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out > > new file mode 100644 > > index 000000000..62f253029 > > --- /dev/null > > +++ b/tests/btrfs/311.out > > @@ -0,0 +1,2 @@ > > +QA output created by 311 > > +Silence is golden > > -- > > 2.43.0 > > > >
On Thu, Mar 7, 2024 at 8:48 PM Boris Burkov <boris@bur.io> wrote: > > On Tue, Mar 05, 2024 at 12:13:07PM +0000, Filipe Manana wrote: > > On Fri, Mar 1, 2024 at 12:02 AM Boris Burkov <boris@bur.io> wrote: > > > > > > It is possible to confuse the btrfs device cache (fs_devices) by > > > starting with a multi-device filesystem, then removing and re-adding a > > > device in a way which changes its dev_t while the filesystem is > > > unmounted. After this procedure, if we remount, then we are in a funny > > > state where struct btrfs_device's "devt" field does not match the bd_dev > > > of the "bdev" field. I would say this is bad enough, as we have violated > > > a pretty clear invariant. > > > > > > But for style points, we can then remove the extra device from the fs, > > > making it a single device fs, which enables the "temp_fsid" feature, > > > which permits multiple separate mounts of different devices with the > > > same fsid. Since btrfs is confused and *thinks* there are different > > > devices (based on device->devt), it allows a second redundant mount of > > > the same device (not a bind mount!). This then allows us to corrupt the > > > original mount by doing stuff to the one that should be a bind mount. > > > > > > This is fixed by the combination of the kernel patch: > > btrfs: support device name lookup in forget > > > and the btrfs-progs patches: > > > btrfs-progs: allow btrfs device scan -u on dead dev > > > btrfs-progs: add udev rule to forget removed device > > > > May I suggest to make this more readable, easier to the eye? > > My inserting blank lines and indenting the lines with the patch > > subjects, like for example: > > > > """ > > This is fixed by the combination of the kernel patch: > > > > btrfs: support device name lookup in forget > > > > and the btrfs-progs patches: > > > > btrfs-progs: allow btrfs device scan -u on dead dev > > btrfs-progs: add udev rule to forget removed device > > """ > > > > And these should be placed in the test case itself with: > > > > _fixed_by_git_commit btrfs-progs xxxxxxxxxx "btrfs-progs: allow > > btrfs device scan -u on dead dev" > > > > For btrfs-progs commits, and for the kernel: > > > > _fixed_by_kernel_commit xxxxxxxxxxxx "btrfs: support device name > > lookup in forget" > > I'll do so going forward, thanks. > > > > > I see however that there's discussion for those patches between you, > > Anand and David, and it > > seems there's a chance those patches won't be merged to fix this bug, > > especially the kernel one > > for which Anand submitted an alternative. If those are added to the > > test case itself, can always be > > updated later if needed. > > I'm leaning towards omitting this part from the commit message and we > can update the test case once we pick a winner. Is that OK with you? Yes, thanks. > > > > > Also avoid putting the test number in the subject - there's no > > guarantee it will end up with that number when merged. > > > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > --- > > > common/config | 1 + > > > common/rc | 4 ++ > > > tests/btrfs/311 | 101 ++++++++++++++++++++++++++++++++++++++++++++ > > > tests/btrfs/311.out | 2 + > > > 4 files changed, 108 insertions(+) > > > create mode 100755 tests/btrfs/311 > > > create mode 100644 tests/btrfs/311.out > > > > > > diff --git a/common/config b/common/config > > > index a3b15b96f..43b517fda 100644 > > > --- a/common/config > > > +++ b/common/config > > > @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)" > > > export GZIP_PROG="$(type -P gzip)" > > > export BTRFS_IMAGE_PROG="$(type -P btrfs-image)" > > > export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical) > > > +export PARTED_PROG="$(type -P parted)" > > > > > > # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled. > > > # newer systems have udevadm command but older systems like RHEL5 don't. > > > diff --git a/common/rc b/common/rc > > > index 30c44dddd..8e009aca9 100644 > > > --- a/common/rc > > > +++ b/common/rc > > > @@ -5375,6 +5375,10 @@ _require_unshare() { > > > _notrun "unshare $*: command not found, should be in util-linux" > > > } > > > > > > +_require_parted() { > > > > These three last functions from common/rc use that style, the { after > > the function name in the same line, > > but everywhere else the { is on a new line, and that's the most common > > style in fstests. > > I wish we had some consistency. > > > > > + $PARTED_PROG --list &>/dev/null || _notrun "parted: command not found" > > > > Why not just call: > > > > _require_command "$PARTED_PROG" parted > > > > Could even do that in the test and no need for a common function in > > this file, as we do in many other tests > > (grep for "'_require_command'" in tests/generic for example). > > > > > +} > > > + > > > # Return a random file in a directory. A directory is *not* followed > > > # recursively. > > > _random_file() { > > > diff --git a/tests/btrfs/311 b/tests/btrfs/311 > > > new file mode 100755 > > > index 000000000..887c46ba0 > > > --- /dev/null > > > +++ b/tests/btrfs/311 > > > @@ -0,0 +1,101 @@ > > > +#! /bin/bash > > > +# SPDX-License-Identifier: GPL-2.0 > > > +# Copyright (C) 2024 Meta, Inc. All Rights Reserved. > > > +# > > > +# FS QA Test 311 > > > +# > > > +# Test an edge case of multi device volume management in btrfs. > > > +# If a device changes devt between mounts of a multi device fs, we can trick > > > +# btrfs into mounting the same device twice fully (not as a bind mount). From > > > +# there, it is trivial to induce corruption. > > > +# > > > +. ./common/preamble > > > +_begin_fstest auto quick volume > > > > Missing 'scrub' group. > > > > > + > > > +# real QA test starts here > > > +_supported_fs btrfs > > > +_require_test > > > +_require_parted > > > + > > > +_cleanup() { > > > + cd / > > > + umount $MNT > > > + umount $BIND > > > + losetup -d $DEV0 > > > + losetup -d $DEV1 > > > + losetup -d $DEV2 > > > + rm $IMG0 > > > + rm $IMG1 > > > + rm $IMG2 > > > +} > > > + > > > +IMG0=$TEST_DIR/$$.img0 > > > +IMG1=$TEST_DIR/$$.img1 > > > +IMG2=$TEST_DIR/$$.img2 > > > +truncate -s 1G $IMG0 > > > +truncate -s 1G $IMG1 > > > +truncate -s 1G $IMG2 > > > +DEV0=$(losetup -f $IMG0 --show) > > > +DEV1=$(losetup -f $IMG1 --show) > > > +DEV2=$(losetup -f $IMG2 --show) > > > +D0P1=$DEV0"p1" > > > +D1P1=$DEV1"p1" > > > +MNT=$TEST_DIR/mnt > > > +BIND=$TEST_DIR/bind > > > + > > > +# Setup partition table with one partition on each device. > > > +$PARTED_PROG $DEV0 'mktable gpt' --script > > > +$PARTED_PROG $DEV1 'mktable gpt' --script > > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > > > + > > > +# mkfs with two devices to avoid clearing devices on close > > > +# single raid to allow removing DEV2. > > > +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null > > > > Please redirect to the $seqres.full instead, and _fail if mkfs fails. > > That's what we do nowadays due to unpleasant surprises in the past. > > > > > + > > > +# Cycle mount the two device fs to populate both devices into the > > > +# stale device cache. > > > +mkdir -p $MNT > > > +mount $D0P1 $MNT > > > > Use the _mount() helper. > > > > > +umount $MNT > > > > We use $UMOUNT_PROG in fstests. > > > > > + > > > +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date. > > > +$PARTED_PROG $DEV0 'rm 1' --script > > > +$PARTED_PROG $DEV1 'rm 1' --script > > > +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script > > > +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script > > > + > > > +# Mount with mismatched dev_t! > > > +mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point" > > > > Same here. > > > > > + > > > +# Remove the extra device to bring temp-fsid back in the fray. > > > +$BTRFS_UTIL_PROG device remove $DEV2 $MNT > > > + > > > +# Create the would be bind mount. > > > +mkdir -p $BIND > > > +mount $D0P1 $BIND > > > > Same here. > > > > > +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT) > > > +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND) > > > +# If they're different, we are in trouble. > > > +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show" > > > + > > > +# Now really prove it by corrupting the first mount with the second. > > > +for i in $(seq 20); do > > > + $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1 > > > > This is overriding the .full file, use >> > > > > > +done > > > +for i in $(seq 20); do > > > + $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1 > > > > Same here, this is overriding the .full file, use >> > > > > > +done > > > +sync > > > > Can we please have a comment mentioning why the sync is needed? > > > > > +find $BIND -type f -delete > > > +sync > > > > Same here. > > > > > + > > > +# This should blow up both mounts, if the writes somehow didn't overlap at all. > > > +$FSTRIM_PROG $BIND > > > > Since it's using the fstrim program and needs trim to be supported, > > the test misses a: > > > > _require_batched_discard "$TEST_DIR" > > > > at the top. > > > > > +echo 3 > /proc/sys/vm/drop_caches > > > > Please add a comment mentioning why we are dropping caches. > > > > > +$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1 > > > > The test passes whether scrub fails or succeeds, as it's redirecting > > stdout and stderr to the .full file and ignoring the exit status of > > the command. > > > > The ideal fstests way is to put the expected output in the golden > > output file and just call the command without redirecting anything. > > If that's not doable for some reason, at the very least do a (...) || > > echo "Scrub failed, check $seqres.full for details." > > > > Thanks. > > > > > + > > > +# success, all done > > > +echo "Silence is golden" > > > +status=0 > > > +exit > > > diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out > > > new file mode 100644 > > > index 000000000..62f253029 > > > --- /dev/null > > > +++ b/tests/btrfs/311.out > > > @@ -0,0 +1,2 @@ > > > +QA output created by 311 > > > +Silence is golden > > > -- > > > 2.43.0 > > > > > >
diff --git a/common/config b/common/config index a3b15b96f..43b517fda 100644 --- a/common/config +++ b/common/config @@ -235,6 +235,7 @@ export BLKZONE_PROG="$(type -P blkzone)" export GZIP_PROG="$(type -P gzip)" export BTRFS_IMAGE_PROG="$(type -P btrfs-image)" export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical) +export PARTED_PROG="$(type -P parted)" # use 'udevadm settle' or 'udevsettle' to wait for lv to be settled. # newer systems have udevadm command but older systems like RHEL5 don't. diff --git a/common/rc b/common/rc index 30c44dddd..8e009aca9 100644 --- a/common/rc +++ b/common/rc @@ -5375,6 +5375,10 @@ _require_unshare() { _notrun "unshare $*: command not found, should be in util-linux" } +_require_parted() { + $PARTED_PROG --list &>/dev/null || _notrun "parted: command not found" +} + # Return a random file in a directory. A directory is *not* followed # recursively. _random_file() { diff --git a/tests/btrfs/311 b/tests/btrfs/311 new file mode 100755 index 000000000..887c46ba0 --- /dev/null +++ b/tests/btrfs/311 @@ -0,0 +1,101 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2024 Meta, Inc. All Rights Reserved. +# +# FS QA Test 311 +# +# Test an edge case of multi device volume management in btrfs. +# If a device changes devt between mounts of a multi device fs, we can trick +# btrfs into mounting the same device twice fully (not as a bind mount). From +# there, it is trivial to induce corruption. +# +. ./common/preamble +_begin_fstest auto quick volume + +# real QA test starts here +_supported_fs btrfs +_require_test +_require_parted + +_cleanup() { + cd / + umount $MNT + umount $BIND + losetup -d $DEV0 + losetup -d $DEV1 + losetup -d $DEV2 + rm $IMG0 + rm $IMG1 + rm $IMG2 +} + +IMG0=$TEST_DIR/$$.img0 +IMG1=$TEST_DIR/$$.img1 +IMG2=$TEST_DIR/$$.img2 +truncate -s 1G $IMG0 +truncate -s 1G $IMG1 +truncate -s 1G $IMG2 +DEV0=$(losetup -f $IMG0 --show) +DEV1=$(losetup -f $IMG1 --show) +DEV2=$(losetup -f $IMG2 --show) +D0P1=$DEV0"p1" +D1P1=$DEV1"p1" +MNT=$TEST_DIR/mnt +BIND=$TEST_DIR/bind + +# Setup partition table with one partition on each device. +$PARTED_PROG $DEV0 'mktable gpt' --script +$PARTED_PROG $DEV1 'mktable gpt' --script +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script + +# mkfs with two devices to avoid clearing devices on close +# single raid to allow removing DEV2. +$MKFS_BTRFS_PROG -f -msingle -dsingle $D0P1 $DEV2 &>/dev/null + +# Cycle mount the two device fs to populate both devices into the +# stale device cache. +mkdir -p $MNT +mount $D0P1 $MNT +umount $MNT + +# Swap the partition dev_ts. This leaves the dev_t in the cache out of date. +$PARTED_PROG $DEV0 'rm 1' --script +$PARTED_PROG $DEV1 'rm 1' --script +$PARTED_PROG $DEV1 'mkpart mypart 1M 100%' --script +$PARTED_PROG $DEV0 'mkpart mypart 1M 100%' --script + +# Mount with mismatched dev_t! +mount $D0P1 $MNT || _fail "failed to remount; don't proceed and do dangerous stuff on raw mount point" + +# Remove the extra device to bring temp-fsid back in the fray. +$BTRFS_UTIL_PROG device remove $DEV2 $MNT + +# Create the would be bind mount. +mkdir -p $BIND +mount $D0P1 $BIND +mount_show=$($BTRFS_UTIL_PROG filesystem show $MNT) +bind_show=$($BTRFS_UTIL_PROG filesystem show $BIND) +# If they're different, we are in trouble. +[ "$mount_show" = "$bind_show" ] || echo "$mount_show != $bind_show" + +# Now really prove it by corrupting the first mount with the second. +for i in $(seq 20); do + $XFS_IO_PROG -f -c "pwrite 0 50M" $MNT/foo.$i >$seqres.full 2>&1 +done +for i in $(seq 20); do + $XFS_IO_PROG -f -c "pwrite 0 50M" $BIND/foo.$i >$seqres.full 2>&1 +done +sync +find $BIND -type f -delete +sync + +# This should blow up both mounts, if the writes somehow didn't overlap at all. +$FSTRIM_PROG $BIND +echo 3 > /proc/sys/vm/drop_caches +$BTRFS_UTIL_PROG scrub start -B $MNT >>$seqres.full 2>&1 + +# success, all done +echo "Silence is golden" +status=0 +exit diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out new file mode 100644 index 000000000..62f253029 --- /dev/null +++ b/tests/btrfs/311.out @@ -0,0 +1,2 @@ +QA output created by 311 +Silence is golden
It is possible to confuse the btrfs device cache (fs_devices) by starting with a multi-device filesystem, then removing and re-adding a device in a way which changes its dev_t while the filesystem is unmounted. After this procedure, if we remount, then we are in a funny state where struct btrfs_device's "devt" field does not match the bd_dev of the "bdev" field. I would say this is bad enough, as we have violated a pretty clear invariant. But for style points, we can then remove the extra device from the fs, making it a single device fs, which enables the "temp_fsid" feature, which permits multiple separate mounts of different devices with the same fsid. Since btrfs is confused and *thinks* there are different devices (based on device->devt), it allows a second redundant mount of the same device (not a bind mount!). This then allows us to corrupt the original mount by doing stuff to the one that should be a bind mount. This is fixed by the combination of the kernel patch: btrfs: support device name lookup in forget and the btrfs-progs patches: btrfs-progs: allow btrfs device scan -u on dead dev btrfs-progs: add udev rule to forget removed device Signed-off-by: Boris Burkov <boris@bur.io> --- common/config | 1 + common/rc | 4 ++ tests/btrfs/311 | 101 ++++++++++++++++++++++++++++++++++++++++++++ tests/btrfs/311.out | 2 + 4 files changed, 108 insertions(+) create mode 100755 tests/btrfs/311 create mode 100644 tests/btrfs/311.out