Message ID | 59b54cc3-b98b-aff9-14fc-dc25c61111c6@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fix writing to the filesystem after unmount | expand |
On Wed, Sep 06, 2023 at 03:26:21PM +0200, Mikulas Patocka wrote: > lvm may suspend any logical volume anytime. If lvm suspend races with > unmount, it may be possible that the kernel writes to the filesystem after > unmount successfully returned. The problem can be demonstrated with this > script: > > #!/bin/sh -ex > modprobe brd rd_size=4194304 > vgcreate vg /dev/ram0 > lvcreate -L 16M -n lv vg > mkfs.ext4 /dev/vg/lv > mount -t ext4 /dev/vg/lv /mnt/test > dmsetup suspend /dev/vg/lv > (sleep 1; dmsetup resume /dev/vg/lv) & > umount /mnt/test > md5sum /dev/vg/lv > md5sum /dev/vg/lv > dmsetup remove_all > rmmod brd > > The script unmounts the filesystem and runs md5sum twice, the result is > that these two invocations return different hash. > > What happens: > * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it > increments sb->s_active > * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls > deactivate_super, deactivate_super sees that sb->s_active is 2, so it > decreases it to 1 and does nothing - the umount syscall returns > successfully > * now we have a mounted filesystem despite the fact that umount returned That can happen for any number of reasons. Other codepaths might very well still hold active references to the superblock. The same thing can happen when you have your filesystem pinned in another mount namespace. If you really want to be absolutely sure that the superblock is destroyed you must use a mechanism like fanotify which allows you to get notified on superblock destruction. > @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn > } > fsnotify_vfsmount_delete(&mnt->mnt); > dput(mnt->mnt.mnt_root); > - deactivate_super(mnt->mnt.mnt_sb); > + wait_and_deactivate_super(mnt->mnt.mnt_sb); Your patch means that we hang on any umount when the filesystem is frozen. IOW, you'd also hang on any umount of a bind-mount. IOW, every single container making use of this filesystems via bind-mounts would hang on umount and shutdown. You'd effectively build a deadlock trap for userspace when the filesystem is frozen. And nothing can make progress until that thing is thawed. Umount can't block if the block device is frozen. > mnt_free_id(mnt); > call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt); > } > Index: linux-2.6/fs/super.c > =================================================================== > --- linux-2.6.orig/fs/super.c 2023-09-05 21:09:16.000000000 +0200 > +++ linux-2.6/fs/super.c 2023-09-06 09:52:20.000000000 +0200 > @@ -36,6 +36,7 @@ > #include <linux/lockdep.h> > #include <linux/user_namespace.h> > #include <linux/fs_context.h> > +#include <linux/delay.h> > #include <uapi/linux/mount.h> > #include "internal.h" > > @@ -365,6 +366,25 @@ void deactivate_super(struct super_block > EXPORT_SYMBOL(deactivate_super); > > /** > + * wait_and_deactivate_super - wait for unfreeze and drop a reference > + * @s: superblock to deactivate > + * > + * Variant of deactivate_super(), except that we wait if the filesystem is > + * frozen. This is required on unmount, to make sure that the filesystem is > + * really unmounted when this function returns. > + */ > +void wait_and_deactivate_super(struct super_block *s) > +{ > + down_write(&s->s_umount); > + while (s->s_writers.frozen != SB_UNFROZEN) { > + up_write(&s->s_umount); > + msleep(1); > + down_write(&s->s_umount); > + } > + deactivate_locked_super(s); That msleep(1) alone is a pretty nasty hack. We should definitely not spin in code like this. That superblock could stay frozen for a long time without s_umount held. So this is spinning. Even if we wanted to do this it would need to use a similar wait mechanism for the filesystem to be thawed like we do in thaw_super_locked(). -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, 6 Sep 2023, Christian Brauner wrote: > > What happens: > > * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it > > increments sb->s_active > > * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls > > deactivate_super, deactivate_super sees that sb->s_active is 2, so it > > decreases it to 1 and does nothing - the umount syscall returns > > successfully > > * now we have a mounted filesystem despite the fact that umount returned > > That can happen for any number of reasons. Other codepaths might very > well still hold active references to the superblock. The same thing can > happen when you have your filesystem pinned in another mount namespace. > > If you really want to be absolutely sure that the superblock is > destroyed you must use a mechanism like fanotify which allows you to get > notified on superblock destruction. If the administrator runs a script that performs unmount and then back-up of the underlying block device, it may read corrupted data. I think that this is a problem. > > @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn > > } > > fsnotify_vfsmount_delete(&mnt->mnt); > > dput(mnt->mnt.mnt_root); > > - deactivate_super(mnt->mnt.mnt_sb); > > + wait_and_deactivate_super(mnt->mnt.mnt_sb); > > Your patch means that we hang on any umount when the filesystem is > frozen. Currently, if we freeze a filesystem with "fsfreeze" and unmount it, the mount point is removed, but the filesystem stays active and it is leaked. You can't unfreeze it with "fsfreeze --unfreeze" because the mount point is gone. (the only way how to recover it is "echo j>/proc/sysrq-trigger"). > IOW, you'd also hang on any umount of a bind-mount. IOW, every > single container making use of this filesystems via bind-mounts would > hang on umount and shutdown. bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing in this case. I tried unmounting bind-mounts and there was no deadlock. > You'd effectively build a deadlock trap for userspace when the > filesystem is frozen. And nothing can make progress until that thing is > thawed. Umount can't block if the block device is frozen. unmounting a filesystem frozen with "fsfreeze" doesn't work in the current kernel. We can say that the administrator shouldn't do it. But reading the block device after umount finishes is something that the administrator may do. BTW. what do you think that unmount of a frozen filesystem should properly do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or something else? > That msleep(1) alone is a pretty nasty hack. We should definitely not > spin in code like this. That superblock could stay frozen for a long > time without s_umount held. So this is spinning. > > Even if we wanted to do this it would need to use a similar wait > mechanism for the filesystem to be thawed like we do in > thaw_super_locked(). Yes, it may be possible to rework it using a wait queue. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, Sep 06, 2023 at 03:26:21PM +0200, Mikulas Patocka wrote: > lvm may suspend any logical volume anytime. If lvm suspend races with > unmount, it may be possible that the kernel writes to the filesystem after > unmount successfully returned. The problem can be demonstrated with this > script: > > #!/bin/sh -ex > modprobe brd rd_size=4194304 > vgcreate vg /dev/ram0 > lvcreate -L 16M -n lv vg > mkfs.ext4 /dev/vg/lv > mount -t ext4 /dev/vg/lv /mnt/test > dmsetup suspend /dev/vg/lv > (sleep 1; dmsetup resume /dev/vg/lv) & > umount /mnt/test > md5sum /dev/vg/lv > md5sum /dev/vg/lv > dmsetup remove_all > rmmod brd > > The script unmounts the filesystem and runs md5sum twice, the result is > that these two invocations return different hash. > > What happens: > * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it > increments sb->s_active > * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls > deactivate_super, deactivate_super sees that sb->s_active is 2, so it > decreases it to 1 and does nothing - the umount syscall returns > successfully > * now we have a mounted filesystem despite the fact that umount returned > * we call md5sum, this waits for the block device being unblocked > * dmsetup resume unblocks the block device and calls thaw_bdev, that calls > thaw_super and thaw_super_locked > * thaw_super_locked calls deactivate_locked_super, this actually drops the > refcount and performs the unmount. The unmount races with md5sum. md5sum > wins the race and it returns the hash of the filesystem before it was > unmounted > * the second md5sum returns the hash after the filesystem was unmounted > > In order to fix this bug, this patch introduces a new function > wait_and_deactivate_super that will wait if the filesystem is frozen and > perform deactivate_locked_super only after that. > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > Cc: stable@vger.kernel.org > > --- > fs/namespace.c | 2 +- > fs/super.c | 20 ++++++++++++++++++++ > include/linux/fs.h | 1 + > 3 files changed, 22 insertions(+), 1 deletion(-) > > Index: linux-2.6/fs/namespace.c > =================================================================== > --- linux-2.6.orig/fs/namespace.c 2023-09-06 09:45:54.000000000 +0200 > +++ linux-2.6/fs/namespace.c 2023-09-06 09:47:15.000000000 +0200 > @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn > } > fsnotify_vfsmount_delete(&mnt->mnt); > dput(mnt->mnt.mnt_root); > - deactivate_super(mnt->mnt.mnt_sb); > + wait_and_deactivate_super(mnt->mnt.mnt_sb); > mnt_free_id(mnt); > call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt); > } > Index: linux-2.6/fs/super.c > =================================================================== > --- linux-2.6.orig/fs/super.c 2023-09-05 21:09:16.000000000 +0200 > +++ linux-2.6/fs/super.c 2023-09-06 09:52:20.000000000 +0200 > @@ -36,6 +36,7 @@ > #include <linux/lockdep.h> > #include <linux/user_namespace.h> > #include <linux/fs_context.h> > +#include <linux/delay.h> > #include <uapi/linux/mount.h> > #include "internal.h" > > @@ -365,6 +366,25 @@ void deactivate_super(struct super_block > EXPORT_SYMBOL(deactivate_super); > > /** > + * wait_and_deactivate_super - wait for unfreeze and drop a reference > + * @s: superblock to deactivate > + * > + * Variant of deactivate_super(), except that we wait if the filesystem is > + * frozen. This is required on unmount, to make sure that the filesystem is > + * really unmounted when this function returns. > + */ > +void wait_and_deactivate_super(struct super_block *s) > +{ > + down_write(&s->s_umount); > + while (s->s_writers.frozen != SB_UNFROZEN) { > + up_write(&s->s_umount); > + msleep(1); > + down_write(&s->s_umount); > + } Instead of msleep, could you use wait_var_event_killable like wait_for_partially_frozen() does? --D > + deactivate_locked_super(s); > +} > + > +/** > * grab_super - acquire an active reference > * @s: reference we are trying to make active > * > Index: linux-2.6/include/linux/fs.h > =================================================================== > --- linux-2.6.orig/include/linux/fs.h 2023-09-05 21:09:16.000000000 +0200 > +++ linux-2.6/include/linux/fs.h 2023-09-06 09:46:56.000000000 +0200 > @@ -2262,6 +2262,7 @@ void kill_anon_super(struct super_block > void kill_litter_super(struct super_block *sb); > void deactivate_super(struct super_block *sb); > void deactivate_locked_super(struct super_block *sb); > +void wait_and_deactivate_super(struct super_block *sb); > int set_anon_super(struct super_block *s, void *data); > int set_anon_super_fc(struct super_block *s, struct fs_context *fc); > int get_anon_bdev(dev_t *); > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
> Currently, if we freeze a filesystem with "fsfreeze" and unmount it, the > mount point is removed, but the filesystem stays active and it is leaked. > You can't unfreeze it with "fsfreeze --unfreeze" because the mount point > is gone. (the only way how to recover it is "echo j>/proc/sysrq-trigger"). You can of course always remount and unfreeze it. > > IOW, you'd also hang on any umount of a bind-mount. IOW, every > > single container making use of this filesystems via bind-mounts would > > hang on umount and shutdown. > > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing > in this case. I tried unmounting bind-mounts and there was no deadlock. With your patch what happens if you do the following? #!/bin/sh -ex modprobe brd rd_size=4194304 vgcreate vg /dev/ram0 lvcreate -L 16M -n lv vg mkfs.ext4 /dev/vg/lv mount -t ext4 /dev/vg/lv /mnt/test mount --bind /mnt/test /opt mount --make-private /opt dmsetup suspend /dev/vg/lv (sleep 1; dmsetup resume /dev/vg/lv) & umount /opt # I'd expect this to hang md5sum /dev/vg/lv md5sum /dev/vg/lv dmsetup remove_all rmmod brd > BTW. what do you think that unmount of a frozen filesystem should properly > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or > something else? In my opinion we should refuse to unmount frozen filesystems and log an error that the filesystem is frozen. Waiting forever isn't a good idea in my opinion. But this is a significant uapi change afaict so this would need to be hidden behind a config option, a sysctl, or it would have to be a new flag to umount2() MNT_UNFROZEN which would allow an administrator to use this flag to not unmount a frozen filesystems. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, Sep 06, 2023 at 08:22:45AM -0700, Darrick J. Wong wrote: > On Wed, Sep 06, 2023 at 03:26:21PM +0200, Mikulas Patocka wrote: > > lvm may suspend any logical volume anytime. If lvm suspend races with > > unmount, it may be possible that the kernel writes to the filesystem after > > unmount successfully returned. The problem can be demonstrated with this > > script: > > > > #!/bin/sh -ex > > modprobe brd rd_size=4194304 > > vgcreate vg /dev/ram0 > > lvcreate -L 16M -n lv vg > > mkfs.ext4 /dev/vg/lv > > mount -t ext4 /dev/vg/lv /mnt/test > > dmsetup suspend /dev/vg/lv > > (sleep 1; dmsetup resume /dev/vg/lv) & > > umount /mnt/test > > md5sum /dev/vg/lv > > md5sum /dev/vg/lv > > dmsetup remove_all > > rmmod brd > > > > The script unmounts the filesystem and runs md5sum twice, the result is > > that these two invocations return different hash. > > > > What happens: > > * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it > > increments sb->s_active > > * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls > > deactivate_super, deactivate_super sees that sb->s_active is 2, so it > > decreases it to 1 and does nothing - the umount syscall returns > > successfully > > * now we have a mounted filesystem despite the fact that umount returned > > * we call md5sum, this waits for the block device being unblocked > > * dmsetup resume unblocks the block device and calls thaw_bdev, that calls > > thaw_super and thaw_super_locked > > * thaw_super_locked calls deactivate_locked_super, this actually drops the > > refcount and performs the unmount. The unmount races with md5sum. md5sum > > wins the race and it returns the hash of the filesystem before it was > > unmounted > > * the second md5sum returns the hash after the filesystem was unmounted > > > > In order to fix this bug, this patch introduces a new function > > wait_and_deactivate_super that will wait if the filesystem is frozen and > > perform deactivate_locked_super only after that. > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > Cc: stable@vger.kernel.org > > > > --- > > fs/namespace.c | 2 +- > > fs/super.c | 20 ++++++++++++++++++++ > > include/linux/fs.h | 1 + > > 3 files changed, 22 insertions(+), 1 deletion(-) > > > > Index: linux-2.6/fs/namespace.c > > =================================================================== > > --- linux-2.6.orig/fs/namespace.c 2023-09-06 09:45:54.000000000 +0200 > > +++ linux-2.6/fs/namespace.c 2023-09-06 09:47:15.000000000 +0200 > > @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn > > } > > fsnotify_vfsmount_delete(&mnt->mnt); > > dput(mnt->mnt.mnt_root); > > - deactivate_super(mnt->mnt.mnt_sb); > > + wait_and_deactivate_super(mnt->mnt.mnt_sb); > > mnt_free_id(mnt); > > call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt); > > } > > Index: linux-2.6/fs/super.c > > =================================================================== > > --- linux-2.6.orig/fs/super.c 2023-09-05 21:09:16.000000000 +0200 > > +++ linux-2.6/fs/super.c 2023-09-06 09:52:20.000000000 +0200 > > @@ -36,6 +36,7 @@ > > #include <linux/lockdep.h> > > #include <linux/user_namespace.h> > > #include <linux/fs_context.h> > > +#include <linux/delay.h> > > #include <uapi/linux/mount.h> > > #include "internal.h" > > > > @@ -365,6 +366,25 @@ void deactivate_super(struct super_block > > EXPORT_SYMBOL(deactivate_super); > > > > /** > > + * wait_and_deactivate_super - wait for unfreeze and drop a reference > > + * @s: superblock to deactivate > > + * > > + * Variant of deactivate_super(), except that we wait if the filesystem is > > + * frozen. This is required on unmount, to make sure that the filesystem is > > + * really unmounted when this function returns. > > + */ > > +void wait_and_deactivate_super(struct super_block *s) > > +{ > > + down_write(&s->s_umount); > > + while (s->s_writers.frozen != SB_UNFROZEN) { > > + up_write(&s->s_umount); > > + msleep(1); > > + down_write(&s->s_umount); > > + } > > Instead of msleep, could you use wait_var_event_killable like > wait_for_partially_frozen() does? I said the same thing but I think that the patch in this way isn't a good idea and technically also uapi breakage. Anyway, can you take a look at my third response here? https://lore.kernel.org/lkml/20230906-aufheben-hagel-9925501b7822@brauner (I forgot you worked on freezing as well. I'm currently moving the freezing bits to fs_holder_ops https://gitlab.com/brauner/linux/-/commits/vfs.super.freeze) -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, Sep 06, 2023 at 05:33:32PM +0200, Christian Brauner wrote: > > Currently, if we freeze a filesystem with "fsfreeze" and unmount it, the > > mount point is removed, but the filesystem stays active and it is leaked. > > You can't unfreeze it with "fsfreeze --unfreeze" because the mount point > > is gone. (the only way how to recover it is "echo j>/proc/sysrq-trigger"). > > You can of course always remount and unfreeze it. > > > > IOW, you'd also hang on any umount of a bind-mount. IOW, every > > > single container making use of this filesystems via bind-mounts would > > > hang on umount and shutdown. > > > > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing > > in this case. I tried unmounting bind-mounts and there was no deadlock. > > With your patch what happens if you do the following? > > #!/bin/sh -ex > modprobe brd rd_size=4194304 > vgcreate vg /dev/ram0 > lvcreate -L 16M -n lv vg > mkfs.ext4 /dev/vg/lv > > mount -t ext4 /dev/vg/lv /mnt/test > mount --bind /mnt/test /opt > mount --make-private /opt > > dmsetup suspend /dev/vg/lv > (sleep 1; dmsetup resume /dev/vg/lv) & > > umount /opt # I'd expect this to hang > > md5sum /dev/vg/lv > md5sum /dev/vg/lv > dmsetup remove_all > rmmod brd > > > BTW. what do you think that unmount of a frozen filesystem should properly > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or > > something else? > > In my opinion we should refuse to unmount frozen filesystems and log an > error that the filesystem is frozen. Waiting forever isn't a good idea > in my opinion. > > But this is a significant uapi change afaict so this would need to be > hidden behind a config option, a sysctl, or it would have to be a new > flag to umount2() MNT_UNFROZEN which would allow an administrator to use > this flag to not unmount a frozen filesystems. That's probably too careful. I think we could risk starting to return an error when trying to unmount a frozen filesystem. And if that causes regressions we could go and look at another option like MNT_UNFROZEN or whatever. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, 6 Sep 2023, Christian Brauner wrote: > > > IOW, you'd also hang on any umount of a bind-mount. IOW, every > > > single container making use of this filesystems via bind-mounts would > > > hang on umount and shutdown. > > > > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing > > in this case. I tried unmounting bind-mounts and there was no deadlock. > > With your patch what happens if you do the following? > > #!/bin/sh -ex > modprobe brd rd_size=4194304 > vgcreate vg /dev/ram0 > lvcreate -L 16M -n lv vg > mkfs.ext4 /dev/vg/lv > > mount -t ext4 /dev/vg/lv /mnt/test > mount --bind /mnt/test /opt > mount --make-private /opt > > dmsetup suspend /dev/vg/lv > (sleep 1; dmsetup resume /dev/vg/lv) & > > umount /opt # I'd expect this to hang > > md5sum /dev/vg/lv > md5sum /dev/vg/lv > dmsetup remove_all > rmmod brd "umount /opt" doesn't hang. It waits one second (until dmsetup resume is called) and then proceeds. Then, it fails with "rmmod: ERROR: Module brd is in use" because the script didn't unmount /mnt/test. > > BTW. what do you think that unmount of a frozen filesystem should properly > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or > > something else? > > In my opinion we should refuse to unmount frozen filesystems and log an > error that the filesystem is frozen. Waiting forever isn't a good idea > in my opinion. But lvm may freeze filesystems anytime - so we'd get randomly returned errors then. > But this is a significant uapi change afaict so this would need to be > hidden behind a config option, a sysctl, or it would have to be a new > flag to umount2() MNT_UNFROZEN which would allow an administrator to use > this flag to not unmount a frozen filesystems. The kernel currently distinguishes between kernel-initiated freeze (that is used by the XFS scrub) and userspace-initiated freeze (that is used by the FIFREEZE ioctl and by device-mapper initiated freeze through freeze_bdev). Perhaps we could distinguish between FIFREEZE-initiated freezes and device-mapper initiated freezes as well. And we could change the logic to return -EBUSY if the freeze was initiated by FIFREEZE and to wait for unfreeze if it was initiated by the device-mapper. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote: > > > On Wed, 6 Sep 2023, Christian Brauner wrote: > > > > > IOW, you'd also hang on any umount of a bind-mount. IOW, every > > > > single container making use of this filesystems via bind-mounts would > > > > hang on umount and shutdown. > > > > > > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing > > > in this case. I tried unmounting bind-mounts and there was no deadlock. > > > > With your patch what happens if you do the following? > > > > #!/bin/sh -ex > > modprobe brd rd_size=4194304 > > vgcreate vg /dev/ram0 > > lvcreate -L 16M -n lv vg > > mkfs.ext4 /dev/vg/lv > > > > mount -t ext4 /dev/vg/lv /mnt/test > > mount --bind /mnt/test /opt > > mount --make-private /opt > > > > dmsetup suspend /dev/vg/lv > > (sleep 1; dmsetup resume /dev/vg/lv) & > > > > umount /opt # I'd expect this to hang > > > > md5sum /dev/vg/lv > > md5sum /dev/vg/lv > > dmsetup remove_all > > rmmod brd > > "umount /opt" doesn't hang. It waits one second (until dmsetup resume is > called) and then proceeds. So unless I'm really misreading the code - entirely possible - the umount of the bind-mount now waits until the filesystem is resumed with your patch. And if that's the case that's a bug. If at all, then only the last umount, the one that destroys the superblock, should wait for the filesystem to become unfrozen. A bind-mount shouldn't as there are still active mounts of the filesystem (e.g., /mnt/test). So you should see this with (unless I really misread things): #!/bin/sh -ex modprobe brd rd_size=4194304 vgcreate vg /dev/ram0 lvcreate -L 16M -n lv vg mkfs.ext4 /dev/vg/lv mount -t ext4 /dev/vg/lv /mnt/test mount --bind /mnt/test /opt mount --make-private /opt dmsetup suspend /dev/vg/lv umount /opt # This will hang with your patch? > > Then, it fails with "rmmod: ERROR: Module brd is in use" because the > script didn't unmount /mnt/test. > > > > BTW. what do you think that unmount of a frozen filesystem should properly > > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or > > > something else? > > > > In my opinion we should refuse to unmount frozen filesystems and log an > > error that the filesystem is frozen. Waiting forever isn't a good idea > > in my opinion. > > But lvm may freeze filesystems anytime - so we'd get randomly returned > errors then. So? Or you might hang at anytime. > > > But this is a significant uapi change afaict so this would need to be > > hidden behind a config option, a sysctl, or it would have to be a new > > flag to umount2() MNT_UNFROZEN which would allow an administrator to use > > this flag to not unmount a frozen filesystems. > > The kernel currently distinguishes between kernel-initiated freeze (that > is used by the XFS scrub) and userspace-initiated freeze (that is used by > the FIFREEZE ioctl and by device-mapper initiated freeze through > freeze_bdev). Yes, I'm aware. > > Perhaps we could distinguish between FIFREEZE-initiated freezes and > device-mapper initiated freezes as well. And we could change the logic to > return -EBUSY if the freeze was initiated by FIFREEZE and to wait for > unfreeze if it was initiated by the device-mapper. For device mapper initiated freezes you can unfreeze independent of any filesystem mountpoint via dm ioctls. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, 6 Sep 2023, Christian Brauner wrote: > On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote: > > > > > > On Wed, 6 Sep 2023, Christian Brauner wrote: > > > > > > > IOW, you'd also hang on any umount of a bind-mount. IOW, every > > > > > single container making use of this filesystems via bind-mounts would > > > > > hang on umount and shutdown. > > > > > > > > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing > > > > in this case. I tried unmounting bind-mounts and there was no deadlock. > > > > > > With your patch what happens if you do the following? > > > > > > #!/bin/sh -ex > > > modprobe brd rd_size=4194304 > > > vgcreate vg /dev/ram0 > > > lvcreate -L 16M -n lv vg > > > mkfs.ext4 /dev/vg/lv > > > > > > mount -t ext4 /dev/vg/lv /mnt/test > > > mount --bind /mnt/test /opt > > > mount --make-private /opt > > > > > > dmsetup suspend /dev/vg/lv > > > (sleep 1; dmsetup resume /dev/vg/lv) & > > > > > > umount /opt # I'd expect this to hang > > > > > > md5sum /dev/vg/lv > > > md5sum /dev/vg/lv > > > dmsetup remove_all > > > rmmod brd > > > > "umount /opt" doesn't hang. It waits one second (until dmsetup resume is > > called) and then proceeds. > > So unless I'm really misreading the code - entirely possible - the > umount of the bind-mount now waits until the filesystem is resumed with > your patch. And if that's the case that's a bug. Yes. It can be fixed by changing wait_and_deactivate_super to this: void wait_and_deactivate_super(struct super_block *s) { down_write(&s->s_umount); while (s->s_writers.frozen != SB_UNFROZEN && atomic_read(&s->s_active) == 2) { up_write(&s->s_umount); msleep(1); down_write(&s->s_umount); } deactivate_locked_super(s); } > > > > BTW. what do you think that unmount of a frozen filesystem should properly > > > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or > > > > something else? > > > > > > In my opinion we should refuse to unmount frozen filesystems and log an > > > error that the filesystem is frozen. Waiting forever isn't a good idea > > > in my opinion. > > > > But lvm may freeze filesystems anytime - so we'd get randomly returned > > errors then. > > So? Or you might hang at anytime. lvm doesn't keep logical volumes suspended for a prolonged amount of time. It will unfreeze them after it made updates to the dm table and to the metadata. So, it won't hang forever. I think it's better to sleep for a short time in umount than to return an error. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, Sep 06, 2023 at 05:03:34PM +0200, Mikulas Patocka wrote: > > IOW, you'd also hang on any umount of a bind-mount. IOW, every > > single container making use of this filesystems via bind-mounts would > > hang on umount and shutdown. > > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing > in this case. I tried unmounting bind-mounts and there was no deadlock. You are making *any* mount destruction hang if the sucker is frozen. Which includes the things like exit(2) of the last process within a namespace, etc. And it does include the things like mount --bind /usr/bin/gcc /tmp/cc; umount /tmp/cc if /usr happened to be frozen at the moment. This is really not an option. > BTW. what do you think that unmount of a frozen filesystem should properly > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or > something else? It's not just umount(2). It's exit(2). And close(2). And AF_UNIX garbage collector taking out an undeliverable SCM_RIGHTS datagram that happens to contain a reference to the last opened file on lazy-umounted fs, etc. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote: > Perhaps we could distinguish between FIFREEZE-initiated freezes and > device-mapper initiated freezes as well. And we could change the logic to > return -EBUSY if the freeze was initiated by FIFREEZE and to wait for > unfreeze if it was initiated by the device-mapper. By the time you get to cleanup_mnt() it's far too late to return -EBUSY. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Wed 06-09-23 18:52:39, Mikulas Patocka wrote: > On Wed, 6 Sep 2023, Christian Brauner wrote: > > On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote: > > > > > BTW. what do you think that unmount of a frozen filesystem should properly > > > > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or > > > > > something else? > > > > > > > > In my opinion we should refuse to unmount frozen filesystems and log an > > > > error that the filesystem is frozen. Waiting forever isn't a good idea > > > > in my opinion. > > > > > > But lvm may freeze filesystems anytime - so we'd get randomly returned > > > errors then. > > > > So? Or you might hang at anytime. > > lvm doesn't keep logical volumes suspended for a prolonged amount of time. > It will unfreeze them after it made updates to the dm table and to the > metadata. So, it won't hang forever. > > I think it's better to sleep for a short time in umount than to return an > error. I think we've got too deep down into "how to fix things" but I'm not 100% sure what the "bug" actually is. In the initial posting Mikulas writes "the kernel writes to the filesystem after unmount successfully returned" - is that really such a big issue? Anybody else can open the device and write to it as well. Or even mount the device again. So userspace that relies on this is kind of flaky anyway (and always has been). I understand the need for userspace to make sure the device is not being modified to do its thing - but then it should perhaps freeze the bdev if it wants to be certain? Or at least open it O_EXCL to make sure it's not mounted? WRT the umount behavior for frozen filesystem - as Al writes it's a tricky issue and we've been discussing that several times over the years and it never went anywhere because of nasty corner-cases (which current behavior also has but trading one nasty case for another isn't really a win). Now that we distinguish between kernel-initiated freeze (with well defined freeze owner and lifetime) and userspace-initiated freeze, I can image we'd make last unmount of the superblock wait for the kernel-initiated freeze to thaw. But as Al writes with lazy unmounts, bind mounts in multiple namespaces etc. I'm not sure such behavior brings much value... Honza
> I think we've got too deep down into "how to fix things" but I'm not 100% We did. > sure what the "bug" actually is. In the initial posting Mikulas writes "the > kernel writes to the filesystem after unmount successfully returned" - is > that really such a big issue? Anybody else can open the device and write to > it as well. Or even mount the device again. So userspace that relies on > this is kind of flaky anyway (and always has been). Yeah, agreed. > namespaces etc. I'm not sure such behavior brings much value... It would in any case mean complicating our code for little gain imho. And as I showed in my initial reply the current patch would hang on any bind-mount unmount. IOW, any container. And Al correctly points out issues with exit(), close() and friends on top of that. But I also hate the idea of waiting on the last umount because that can also lead to new unexpected behavior when e.g., the system is shutdown and systemd goes on to unmount all things and then suddenly just hangs when before it was able to make progress. And returning EBUSY is tricky as well as we somehow would need to have a way to refcount in a manner that let's us differentiate between last- "user-visible"-superblock-reference" and last-active-superblock-reference which would complicate things even more. I propose we clearly document that unmounting a frozen filesystem will mean that the superblock stays active at least until the filesystem is unfrozen. And if userspace wants to make sure to not recycle such a frozen superblock they can now use FSCONFIG_CMD_CREATE_EXCL to detect that. What might be useful is to extend fanotify. Right now we have fsnotify_sb_delete() which lets you detect that a superblock has been destroyed (generic_shutdown_super()). It could be useful to also get notified when a superblock is frozen and unfrozen? -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Thu, 7 Sep 2023, Christian Brauner wrote: > > I think we've got too deep down into "how to fix things" but I'm not 100% > > We did. > > > sure what the "bug" actually is. In the initial posting Mikulas writes "the > > kernel writes to the filesystem after unmount successfully returned" - is > > that really such a big issue? I think it's an issue if the administrator writes a script that unmounts a filesystem and then copies the underyling block device somewhere. Or a script that unmounts a filesystem and runs fsck afterwards. Or a script that unmounts a filesystem and runs mkfs on the same block device. > > Anybody else can open the device and write to it as well. Or even > > mount the device again. So userspace that relies on this is kind of > > flaky anyway (and always has been). It's admin's responsibility to make sure that the filesystem is not mounted multiple times when he touches the underlying block device after unmount. > Yeah, agreed. > > > namespaces etc. I'm not sure such behavior brings much value... > > It would in any case mean complicating our code for little gain imho. > And as I showed in my initial reply the current patch would hang on any > bind-mount unmount. IOW, any container. And Al correctly points out > issues with exit(), close() and friends on top of that. > > But I also hate the idea of waiting on the last umount because that can > also lead to new unexpected behavior when e.g., the system is shutdown > and systemd goes on to unmount all things and then suddenly just hangs > when before it was able to make progress. Would you agree to waiting on the last umount only if the freeze was initiated by lvm? (there would be no hang in this case because lvm thaws the block device in a timely manner) Mikulas -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Thu 07-09-23 14:04:51, Mikulas Patocka wrote: > > > On Thu, 7 Sep 2023, Christian Brauner wrote: > > > > I think we've got too deep down into "how to fix things" but I'm not 100% > > > > We did. > > > > > sure what the "bug" actually is. In the initial posting Mikulas writes "the > > > kernel writes to the filesystem after unmount successfully returned" - is > > > that really such a big issue? > > I think it's an issue if the administrator writes a script that unmounts a > filesystem and then copies the underyling block device somewhere. Or a > script that unmounts a filesystem and runs fsck afterwards. Or a script > that unmounts a filesystem and runs mkfs on the same block device. Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem hasn't been unmounted properly and complain. Which is exactly what should IMHO happen. > > > Anybody else can open the device and write to it as well. Or even > > > mount the device again. So userspace that relies on this is kind of > > > flaky anyway (and always has been). > > It's admin's responsibility to make sure that the filesystem is not > mounted multiple times when he touches the underlying block device after > unmount. What I wanted to suggest is that we should provide means how to make sure block device is not being modified and educate admins and tool authors about them. Because just doing "umount /dev/sda1" and thinking this means that /dev/sda1 is unused now simply is not enough in today's world for multiple reasons and we cannot solve it just in the kernel. Honza
Dne 08. 09. 23 v 9:32 Jan Kara napsal(a): > On Thu 07-09-23 14:04:51, Mikulas Patocka wrote: >> >> On Thu, 7 Sep 2023, Christian Brauner wrote: >> >>>> I think we've got too deep down into "how to fix things" but I'm not 100% >>> We did. >>> >>>> sure what the "bug" actually is. In the initial posting Mikulas writes "the >>>> kernel writes to the filesystem after unmount successfully returned" - is >>>> that really such a big issue? >> I think it's an issue if the administrator writes a script that unmounts a >> filesystem and then copies the underyling block device somewhere. Or a >> script that unmounts a filesystem and runs fsck afterwards. Or a script >> that unmounts a filesystem and runs mkfs on the same block device. > Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem > hasn't been unmounted properly and complain. Which is exactly what should > IMHO happen. > >>>> Anybody else can open the device and write to it as well. Or even >>>> mount the device again. So userspace that relies on this is kind of >>>> flaky anyway (and always has been). >> It's admin's responsibility to make sure that the filesystem is not >> mounted multiple times when he touches the underlying block device after >> unmount. > What I wanted to suggest is that we should provide means how to make sure > block device is not being modified and educate admins and tool authors > about them. Because just doing "umount /dev/sda1" and thinking this means > that /dev/sda1 is unused now simply is not enough in today's world for > multiple reasons and we cannot solve it just in the kernel. > Hi /me just wondering how do you then imagine i.e. safe removal of USB drive when user shall not expect unmount really unmounts filesystem? IMHO - unmount should detect some very suspicious state of block device if it cannot correctly proceed - i.e. reporting 'warning/error' on such commands... Main problem is - if the 'unmount' is successful in this case - the last connection userspace had to this fileystem is lost - and user cannot get rid of such filesystem anymore for a system. I'd likely propose in this particular state of unmounting of a frozen filesystem to just proceed - and drop the frozen state together with release filesystem and never issue any ioctl from such filelsystem to the device below - so it would not be a 100% valid unmount - but since the freeze should be nearly equivalent of having a proper 'unmount' being done - it shoudn't be causing any harm either - and all resources associated could be 'released. IMHO it's correct to 'drop' frozen state for filesystem that is not going to exist anymore (assuming it's the last such user) Regards Zdenek -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Fri 08-09-23 11:29:40, Zdenek Kabelac wrote: > Dne 08. 09. 23 v 9:32 Jan Kara napsal(a): > > On Thu 07-09-23 14:04:51, Mikulas Patocka wrote: > > > > > > On Thu, 7 Sep 2023, Christian Brauner wrote: > > > > > > > > I think we've got too deep down into "how to fix things" but I'm not 100% > > > > We did. > > > > > > > > > sure what the "bug" actually is. In the initial posting Mikulas writes "the > > > > > kernel writes to the filesystem after unmount successfully returned" - is > > > > > that really such a big issue? > > > I think it's an issue if the administrator writes a script that unmounts a > > > filesystem and then copies the underyling block device somewhere. Or a > > > script that unmounts a filesystem and runs fsck afterwards. Or a script > > > that unmounts a filesystem and runs mkfs on the same block device. > > Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem > > hasn't been unmounted properly and complain. Which is exactly what should > > IMHO happen. > > > > > > > Anybody else can open the device and write to it as well. Or even > > > > > mount the device again. So userspace that relies on this is kind of > > > > > flaky anyway (and always has been). > > > It's admin's responsibility to make sure that the filesystem is not > > > mounted multiple times when he touches the underlying block device after > > > unmount. > > What I wanted to suggest is that we should provide means how to make sure > > block device is not being modified and educate admins and tool authors > > about them. Because just doing "umount /dev/sda1" and thinking this means > > that /dev/sda1 is unused now simply is not enough in today's world for > > multiple reasons and we cannot solve it just in the kernel. > > > > /me just wondering how do you then imagine i.e. safe removal of USB drive > when user shall not expect unmount really unmounts filesystem? Well, currently you click some "Eject / safely remove / whatever" button and then you get a "wait" dialog until everything is done after which you're told the stick is safe to remove. What I imagine is that the "wait" dialog needs to be there while there are any (or exclusive at minimum) openers of the device. Not until umount(2) syscall has returned. And yes, the kernel doesn't quite make that easy - the best you can currently probably do is to try opening the device with O_EXCL and if that fails, you know there's some other exclusive open. > IMHO - unmount should detect some very suspicious state of block device if > it cannot correctly proceed - i.e. reporting 'warning/error' on such > commands... You seem to be concentrated too much on the simple case of a desktop with an USB stick you just copy data to & from. :) The trouble is, as Al wrote elsewhere in this thread that filesystem unmount can be for example a result of exit(2) or close(2) system call if you setup things in a nasty way. Do you want exit(2) to fail because the block device is frozen? Umount(2) has to work for all its users and changing the behavior has nasty corner-cases. So does the current behavior, I agree, but improving situation for one usecase while breaking another usecase isn't really a way forward... > Main problem is - if the 'unmount' is successful in this case - the last > connection userspace had to this fileystem is lost - and user cannot get rid > of such filesystem anymore for a system. Well, the filesystem (struct superblock to be exact) is invisible in /proc/mounts (or whatever), that is true. But it is still very much associated with that block device and if you do 'mount <device> <mntpoint>', you'll get it back. But yes, the filesystem will not go away until all references to it are dropped and you cannot easily find who holds those references and how to get rid of them. > I'd likely propose in this particular state of unmounting of a frozen > filesystem to just proceed - and drop the frozen state together with release > filesystem and never issue any ioctl from such filelsystem to the device > below - so it would not be a 100% valid unmount - but since the freeze > should be nearly equivalent of having a proper 'unmount' being done - it > shoudn't be causing any harm either - and all resources associated could > be 'released. IMHO it's correct to 'drop' frozen state for filesystem > that is not going to exist anymore (assuming it's the last such user) This option was also discussed in the past and it has nasty consequences as well. Cleanly shutting down a filesystem usually needs to write to the underlying device so either you allow the filesystem to write to the device on umount breaking assumptions of the user who froze the fs or you'd have to implement a special handling for this case for every filesystem to avoid the writes (and put up with the fact that the filesystem will appear as uncleanly shutdown on the next mount). Not particularly nice either... Honza
Dne 08. 09. 23 v 12:20 Jan Kara napsal(a): > On Fri 08-09-23 11:29:40, Zdenek Kabelac wrote: >> Dne 08. 09. 23 v 9:32 Jan Kara napsal(a): >>> On Thu 07-09-23 14:04:51, Mikulas Patocka wrote: >>>> On Thu, 7 Sep 2023, Christian Brauner wrote: >>>> >>>>>> I think we've got too deep down into "how to fix things" but I'm not 100% >>>>> We did. >>>>> >>>>>> sure what the "bug" actually is. In the initial posting Mikulas writes "the >>>>>> kernel writes to the filesystem after unmount successfully returned" - is >>>>>> that really such a big issue? >>>> I think it's an issue if the administrator writes a script that unmounts a >>>> filesystem and then copies the underyling block device somewhere. Or a >>>> script that unmounts a filesystem and runs fsck afterwards. Or a script >>>> that unmounts a filesystem and runs mkfs on the same block device. >>> Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem >>> hasn't been unmounted properly and complain. Which is exactly what should >>> IMHO happen. >> I'd likely propose in this particular state of unmounting of a frozen >> filesystem to just proceed - and drop the frozen state together with release >> filesystem and never issue any ioctl from such filelsystem to the device >> below - so it would not be a 100% valid unmount - but since the freeze >> should be nearly equivalent of having a proper 'unmount' being done - it >> shoudn't be causing any harm either - and all resources associated could >> be 'released. IMHO it's correct to 'drop' frozen state for filesystem >> that is not going to exist anymore (assuming it's the last such user) > This option was also discussed in the past and it has nasty consequences as > well. Cleanly shutting down a filesystem usually needs to write to the > underlying device so either you allow the filesystem to write to the device > on umount breaking assumptions of the user who froze the fs or you'd have > to implement a special handling for this case for every filesystem to avoid > the writes (and put up with the fact that the filesystem will appear as > uncleanly shutdown on the next mount). Not particularly nice either... I'd say there are several options and we should aim towards the variant which is most usable by normal users. Making hyper complex unmount rule logic that basically no user-space tools around Gnome/KDE... are able to handle well and getting it to the position where only the core kernel developer have all the 'wisdom' to detect and decode system state and then 'know what's going on' isn't the favourite goal here. Freeze should be getting the filesystem into 'consistent' state - filesystem should be able to 'easily' recover and finish all the ongoing 'unfinished' process with the next mount without requiring full 'fsck' - otherwise it would be useless for i.e. snapshot. So to me this looks like the win-win strategy where we basically do not loose any information and we also do not leak kernel resources - since i..e in case of DM devices - the underlying DM device might have already changed disk characteristics anyway. If the developers then believe - that 'more variants' of complex behavior are necessary - then kernel could have some sysfs parameter to configure some 'more advanced' logic i.e. keep 'fs mounted' for those skilled admins who are able to go through the deepest corners here - but other then that plain 'umount' should really go with the meaning of a) manages to umount and release a device b) in other case reports to a user there is still something holding device.... Regards Zdenek -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
> I'd say there are several options and we should aim towards the variant > which is most usable by normal users. None of the options is sufficiently satisfying to risk intricate behavioral changes with unknown consequences for existing workloads as far as I'm concerned. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Thu 2023-09-07 11:44:57, Jan Kara wrote: > On Wed 06-09-23 18:52:39, Mikulas Patocka wrote: > > On Wed, 6 Sep 2023, Christian Brauner wrote: > > > On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote: > > > > > > BTW. what do you think that unmount of a frozen filesystem should properly > > > > > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or > > > > > > something else? > > > > > > > > > > In my opinion we should refuse to unmount frozen filesystems and log an > > > > > error that the filesystem is frozen. Waiting forever isn't a good idea > > > > > in my opinion. > > > > > > > > But lvm may freeze filesystems anytime - so we'd get randomly returned > > > > errors then. > > > > > > So? Or you might hang at anytime. > > > > lvm doesn't keep logical volumes suspended for a prolonged amount of time. > > It will unfreeze them after it made updates to the dm table and to the > > metadata. So, it won't hang forever. > > > > I think it's better to sleep for a short time in umount than to return an > > error. > > I think we've got too deep down into "how to fix things" but I'm not 100% > sure what the "bug" actually is. In the initial posting Mikulas writes "the > kernel writes to the filesystem after unmount successfully returned" - is > that really such a big issue? Anybody else can open the device and write to > it as well. Or even mount the device again. So userspace that relies on > this is kind of flaky anyway (and always has been). Umm. No? I admin my own systems; I'm responsible for my userspace. Maybe I'm in single user mode. Noone writes to my block devices without my permissions. By mount, I give such permission to the kernel. By umount, I take such permission away. There's nothing flaky about that. Kernel is simply buggy. Fix it. [Remember that "you should umount before disconnecting USB devices to prevent data corruption"? How is that working with kernel writing to devices after umount?] Best regards, Pavel
Hi! > What I wanted to suggest is that we should provide means how to make sure > block device is not being modified and educate admins and tool authors > about them. Because just doing "umount /dev/sda1" and thinking this means > that /dev/sda1 is unused now simply is not enough in today's world for > multiple reasons and we cannot solve it just in the kernel. It better be enough. And I'm pretty sure it is true in single-user mode, or for usb sticks, or... Simply fix the kernel. No need to re-educate anyone. Pavel
> Well, currently you click some "Eject / safely remove / whatever" button > and then you get a "wait" dialog until everything is done after which > you're told the stick is safe to remove. What I imagine is that the "wait" > dialog needs to be there while there are any (or exclusive at minimum) openers > of the device. Not until umount(2) syscall has returned. And yes, the Agreed. umount(2) doesn't give guarantees about a filesystem being really gone once it has returned. And it really shouldn't. There's too many cases where that doesn't work and it's not a commitment we should make. And there are ways to wait until superblock shutdown that I've mentioned before in other places where it somehow really matters. inotify's IN_UMOUNT will notify about superblock shutdown. IOW, when it really hits generic_shutdown_super() which can only be hit after unfreezing as that requires active references. So this really can be used to wait for a filesystem to go away across all namespaces, and across filesytem freezing and it's available to unprivileged users. Simple example: # shell 1 sudo mount -t xfs /dev/sda /mnt sudo mount --bind /mnt /opt inotifywait -e unmount /mnt #shell 2 sudo umount /opt # nothing happens in shell 1 sudo umount /mnt # shell 1 gets woken > corner-cases. So does the current behavior, I agree, but improving > situation for one usecase while breaking another usecase isn't really a way > forward... Agreed. > Well, the filesystem (struct superblock to be exact) is invisible in > /proc/mounts (or whatever), that is true. But it is still very much > associated with that block device and if you do 'mount <device> > <mntpoint>', you'll get it back. But yes, the filesystem will not go away And now we at least have an api to detect that case and refuse to reuse the superblock. > until all references to it are dropped and you cannot easily find who holds > those references and how to get rid of them. Namespaces make this even messier. You have no easy way of knowing whether the filesystem isn't pinned somewhere else through an explicit bind-mount or when it was copied during mount namespace creation. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Dne 08. 09. 23 v 13:32 Christian Brauner napsal(a): >> I'd say there are several options and we should aim towards the variant >> which is most usable by normal users. > > None of the options is sufficiently satisfying to risk intricate > behavioral changes with unknown consequences for existing workloads as > far as I'm concerned. > I'm not convinced anyone has the 'fsfreeze' + 'unmount' as a regular workload. Thus solving this unusual race case shouldn't be breaking anyones else existing workload. ATM there is no good solution for this particular case. So can you please elaborate which new risks are we going to introduce by fixing this resource hole ? Zdenek -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
> So can you please elaborate which new risks are we going to introduce by > fixing this resource hole ? I'm not quite sure why you need a personal summary of the various reasons different people brought together in the thread. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
>>>>> "Christian" == Christian Brauner <brauner@kernel.org> writes: >> Well, currently you click some "Eject / safely remove / whatever" button >> and then you get a "wait" dialog until everything is done after which >> you're told the stick is safe to remove. What I imagine is that the "wait" >> dialog needs to be there while there are any (or exclusive at minimum) openers >> of the device. Not until umount(2) syscall has returned. And yes, the > Agreed. umount(2) doesn't give guarantees about a filesystem being > really gone once it has returned. And it really shouldn't. There's > too many cases where that doesn't work and it's not a commitment we > should make. So how the heck is someone supposed to know, from userspace, that a filesystem is unmounted? Just wearing my SysAdmin hat, this strikes me as really potentially painful and annoying. But then again, so are bind mounts from alot of views too. Don't people remember how bad it can be when you are trying to shutdown and system and it hangs because a remote NFS server is down and not responding? And your client system hangs completely? > And there are ways to wait until superblock shutdown that I've > mentioned before in other places where it somehow really > matters. inotify's IN_UMOUNT will notify about superblock > shutdown. IOW, when it really hits generic_shutdown_super() which > can only be hit after unfreezing as that requires active references. Can we maybe invert this discussion and think about it from the userspace side of things? How does the user(space) mount/unmount devices cleanly and reliably? > So this really can be used to wait for a filesystem to go away across > all namespaces, and across filesytem freezing and it's available to > unprivileged users. Simple example: > # shell 1 > sudo mount -t xfs /dev/sda /mnt > sudo mount --bind /mnt /opt > inotifywait -e unmount /mnt > #shell 2 > sudo umount /opt # nothing happens in shell 1 > sudo umount /mnt # shell 1 gets woken So what makes this *useful* to anyone? Why doesn't the bind mount A) lock /mnt into place, but B) give you some way of seeing that there's a bindmount in place? >> corner-cases. So does the current behavior, I agree, but improving >> situation for one usecase while breaking another usecase isn't really a way >> forward... > Agreed. >> Well, the filesystem (struct superblock to be exact) is invisible >> in /proc/mounts (or whatever), that is true. But it is still very >> much associated with that block device and if you do 'mount >> <device> <mntpoint>', you'll get it back. But yes, the filesystem >> will not go away Then should it be unmountable in the first place? I mean yes, we always need a way to force an unmount no matter what, even if that breaks some other process on the system, but for regular use, shouldn't it just give back an error like: /mnt in use by bind mount /opt or something like that? Give the poor sysadmin some information on what's going on here. > And now we at least have an api to detect that case and refuse to reuse > the superblock. >> until all references to it are dropped and you cannot easily find >> who holds those references and how to get rid of them. ding ding ding!!!! I don't want to have to run 'lsof' or something like that. > Namespaces make this even messier. You have no easy way of knowing > whether the filesystem isn't pinned somewhere else through an > explicit bind-mount or when it was copied during mount namespace > creation. This is the biggest downside of namespaces and bind mounts in my mind. The lack of traceability back to the source. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Fri, Sep 08, 2023 at 12:20:14PM +0200, Jan Kara wrote: > Well, currently you click some "Eject / safely remove / whatever" button > and then you get a "wait" dialog until everything is done after which > you're told the stick is safe to remove. What I imagine is that the "wait" > dialog needs to be there while there are any (or exclusive at minimum) openers > of the device. Not until umount(2) syscall has returned. And yes, the > kernel doesn't quite make that easy - the best you can currently probably > do is to try opening the device with O_EXCL and if that fails, you know > there's some other exclusive open. ... and the simple answer to the problem is to have an event notification for when the super_block has finally been destroyed. That way the application gets this notification directly instead of having to make a process synchronous that fundamentally isn't as explained in this thread. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
On Fri 08-09-23 12:51:03, Zdenek Kabelac wrote: > Dne 08. 09. 23 v 12:20 Jan Kara napsal(a): > > On Fri 08-09-23 11:29:40, Zdenek Kabelac wrote: > > > Dne 08. 09. 23 v 9:32 Jan Kara napsal(a): > > > > On Thu 07-09-23 14:04:51, Mikulas Patocka wrote: > > > > > On Thu, 7 Sep 2023, Christian Brauner wrote: > > > > > > > > > > > > I think we've got too deep down into "how to fix things" but I'm not 100% > > > > > > We did. > > > > > > > > > > > > > sure what the "bug" actually is. In the initial posting Mikulas writes "the > > > > > > > kernel writes to the filesystem after unmount successfully returned" - is > > > > > > > that really such a big issue? > > > > > I think it's an issue if the administrator writes a script that unmounts a > > > > > filesystem and then copies the underyling block device somewhere. Or a > > > > > script that unmounts a filesystem and runs fsck afterwards. Or a script > > > > > that unmounts a filesystem and runs mkfs on the same block device. > > > > Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem > > > > hasn't been unmounted properly and complain. Which is exactly what should > > > > IMHO happen. > > > I'd likely propose in this particular state of unmounting of a frozen > > > filesystem to just proceed - and drop the frozen state together with release > > > filesystem and never issue any ioctl from such filelsystem to the device > > > below - so it would not be a 100% valid unmount - but since the freeze > > > should be nearly equivalent of having a proper 'unmount' being done - it > > > shoudn't be causing any harm either - and all resources associated could > > > be 'released. IMHO it's correct to 'drop' frozen state for filesystem > > > that is not going to exist anymore (assuming it's the last such user) > > This option was also discussed in the past and it has nasty consequences as > > well. Cleanly shutting down a filesystem usually needs to write to the > > underlying device so either you allow the filesystem to write to the device > > on umount breaking assumptions of the user who froze the fs or you'd have > > to implement a special handling for this case for every filesystem to avoid > > the writes (and put up with the fact that the filesystem will appear as > > uncleanly shutdown on the next mount). Not particularly nice either... > > > I'd say there are several options and we should aim towards the variant > which is most usable by normal users. > > Making hyper complex unmount rule logic that basically no user-space tools > around Gnome/KDE... are able to handle well and getting it to the position > where only the core kernel developer have all the 'wisdom' to detect and > decode system state and then 'know what's going on' isn't the favourite > goal here. I don't think we are really making forward progress in the argument which behavior is more or less correct or useful. But maybe when we cannot agree on the general solution we could still improve the situation that practically matters? E.g. disputing Gnome apps telling you you can safely remove the USB stick when you actually cannot because the filesystem on it is frozen is actually kind of weak argument because users that freeze filesystem on their USB stick are practically non-existent. So is there a usecase where users are hitting these problems in practice? Maybe some user report that triggered original Mikulas' patch? Or was that mostly a theoretical concern? > Freeze should be getting the filesystem into 'consistent' state - filesystem > should be able to 'easily' recover and finish all the ongoing 'unfinished' > process with the next mount without requiring full 'fsck' - otherwise it > would be useless for i.e. snapshot. More or less yes but e.g. grub2 isn't able to reliably read images of just frozen filesystem because it ignores journal contents. So if this was root filesystem this could result in unbootable system. Honza
Index: linux-2.6/fs/namespace.c =================================================================== --- linux-2.6.orig/fs/namespace.c 2023-09-06 09:45:54.000000000 +0200 +++ linux-2.6/fs/namespace.c 2023-09-06 09:47:15.000000000 +0200 @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn } fsnotify_vfsmount_delete(&mnt->mnt); dput(mnt->mnt.mnt_root); - deactivate_super(mnt->mnt.mnt_sb); + wait_and_deactivate_super(mnt->mnt.mnt_sb); mnt_free_id(mnt); call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt); } Index: linux-2.6/fs/super.c =================================================================== --- linux-2.6.orig/fs/super.c 2023-09-05 21:09:16.000000000 +0200 +++ linux-2.6/fs/super.c 2023-09-06 09:52:20.000000000 +0200 @@ -36,6 +36,7 @@ #include <linux/lockdep.h> #include <linux/user_namespace.h> #include <linux/fs_context.h> +#include <linux/delay.h> #include <uapi/linux/mount.h> #include "internal.h" @@ -365,6 +366,25 @@ void deactivate_super(struct super_block EXPORT_SYMBOL(deactivate_super); /** + * wait_and_deactivate_super - wait for unfreeze and drop a reference + * @s: superblock to deactivate + * + * Variant of deactivate_super(), except that we wait if the filesystem is + * frozen. This is required on unmount, to make sure that the filesystem is + * really unmounted when this function returns. + */ +void wait_and_deactivate_super(struct super_block *s) +{ + down_write(&s->s_umount); + while (s->s_writers.frozen != SB_UNFROZEN) { + up_write(&s->s_umount); + msleep(1); + down_write(&s->s_umount); + } + deactivate_locked_super(s); +} + +/** * grab_super - acquire an active reference * @s: reference we are trying to make active * Index: linux-2.6/include/linux/fs.h =================================================================== --- linux-2.6.orig/include/linux/fs.h 2023-09-05 21:09:16.000000000 +0200 +++ linux-2.6/include/linux/fs.h 2023-09-06 09:46:56.000000000 +0200 @@ -2262,6 +2262,7 @@ void kill_anon_super(struct super_block void kill_litter_super(struct super_block *sb); void deactivate_super(struct super_block *sb); void deactivate_locked_super(struct super_block *sb); +void wait_and_deactivate_super(struct super_block *sb); int set_anon_super(struct super_block *s, void *data); int set_anon_super_fc(struct super_block *s, struct fs_context *fc); int get_anon_bdev(dev_t *);
lvm may suspend any logical volume anytime. If lvm suspend races with unmount, it may be possible that the kernel writes to the filesystem after unmount successfully returned. The problem can be demonstrated with this script: #!/bin/sh -ex modprobe brd rd_size=4194304 vgcreate vg /dev/ram0 lvcreate -L 16M -n lv vg mkfs.ext4 /dev/vg/lv mount -t ext4 /dev/vg/lv /mnt/test dmsetup suspend /dev/vg/lv (sleep 1; dmsetup resume /dev/vg/lv) & umount /mnt/test md5sum /dev/vg/lv md5sum /dev/vg/lv dmsetup remove_all rmmod brd The script unmounts the filesystem and runs md5sum twice, the result is that these two invocations return different hash. What happens: * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it increments sb->s_active * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls deactivate_super, deactivate_super sees that sb->s_active is 2, so it decreases it to 1 and does nothing - the umount syscall returns successfully * now we have a mounted filesystem despite the fact that umount returned * we call md5sum, this waits for the block device being unblocked * dmsetup resume unblocks the block device and calls thaw_bdev, that calls thaw_super and thaw_super_locked * thaw_super_locked calls deactivate_locked_super, this actually drops the refcount and performs the unmount. The unmount races with md5sum. md5sum wins the race and it returns the hash of the filesystem before it was unmounted * the second md5sum returns the hash after the filesystem was unmounted In order to fix this bug, this patch introduces a new function wait_and_deactivate_super that will wait if the filesystem is frozen and perform deactivate_locked_super only after that. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org --- fs/namespace.c | 2 +- fs/super.c | 20 ++++++++++++++++++++ include/linux/fs.h | 1 + 3 files changed, 22 insertions(+), 1 deletion(-) -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel